First of all, there is no reason to be afraid. Not only big corporations having terabytes of data can use data-warehouses. If you have a complex domain model and a few databases with millions of records here and there it might be useful to create a separate database instance for performing analytical queries and generating reports. Also, don’t believe people who say that you will need years to build a simple DW. You don’t need more than 4 weeks to develop DW that will bring some value to your business.
It’s very important to think about the architecture of your DW as it will define tools and procedures that will be used. To make it simple: ‘Inmon VS Kimball‘ is the question you need to answer:
In Inmon’s paradigm all information is stored in 3rd normal form. Basically, the schema of your DW will be similar (more or less) to the schema of your transactional database. Inmon’s paradigm is good for creating large centralized data warehouses that will serve the whole enterprise. It’s a top down approach that requires more upfront design. The performance in this model isn’t as good as in Kimball’s model because data is normalized.
In Kimball’s paradigm you start building several databases (known as data marts) that will serve individual departments of your business. In this model DW is a combination of all data marts. The dimensional model, that is used to store data, simplifies the schema significantly and, as a result, gives you much better performance. As for me I prefer Kimball’s paradigm as you don’t have to design the whole DW upfront. You can start with creating a single data mart and it can be done by one person in a month.
Read about Kimball’s model
The best thing you can do is to buy this book “The Data Warehouse Toolkit, 2nd Edition: The Complete Guide to Dimensional Modeling”. It’s a good overview of the dimension model, which will help you if you don’t have any experience in building a DW. It helped me a lot. Highly recommend it to everyone interested in the topic.
Choosing a DBMS for your DW is a very important step. And it’s not an easy decision to make. There are two options you can choose from. If your dataset is rather small you can choose a familiar row-oriented database (such as Postgres). Just having data denormalized will improve performance. But if your database is huge it’s time to put on big boy’s pants and look at column-oriented storages:
- Row-oriented database are stable, reliable and familiar to everyone. They are perfect for transactional operations but not so good for analytical queries.
- Column-oriented databases are not very mature. As a result, they are hard to administrate, hard to find any documentation about, and they are not as reliable as row-oriented databases. It isn’t surprising taking into account the fact that these databases have been developing only for a few years. But they can give you something that MySQL can’t – performance. From my experience, they are AT LEAST 10 TIMES FASTER for complex analytical queries.
My advice would be to load some data into your favorite row-oriented database and check the performance. And ONLY if you are not happy with it go and try MonetDB, LucidDB, Infobright or some other column-oriented DBMS.
If you decided to go with a column-oriented database I can give you some advice about two databases I had experience with:
The biggest benefit you’ll get if you choose to use Infobright is that fact that it’s a mysql engine. You can use your favorite driver to connect to it (so you can use any language, any platform) and you won’t have to learn a new dialect of SQL. It’s a very solid product but the free version has severe limitations - you can use only bulk upload. No inserts, updates or deletes.
The distribution includes it’s own jdbc driver. If you are a ruby or a python developer you are out of luck. In addition, their site doesn’t contain any decent documentation. It’s just a collection of reference-like wiki page, most of which contain only a brief description of some crazy parameters such as processorMaxBytes or databaseShadowLogIncrementSize. I am 100% sure that most of all luciddb users will never change them. On the other hand, the free version of LucidDB doesn’t have any limitations.