How to Build a Data-Warehouse in 4 weeks, Part 2
I’ve talked about the first 2 steps you need to take to build your own data warehouse (Read: How to Build a Data-Warehouse in 4 weeks, Part 1). Choosing the architecture and the DBMS are the first things that need to be done. So far we have the idea of the data we need to replicate and the database we want to store data in. The missing part is the process. How do we store replicated data? How do we transform data? These are the questions I’ll answer in this post.
There are many ways of replicating data from your transaction databases to the DW. For the sake of simplicity, let’s assume that we’ll run our job replicating data once a day. And at that time our business is not working, so the transactional databases are not being updated. Let’s assume that we have two transaction databases (TDB1 and TDB2) and our DW must contain data from both of them.
IMAGE1 showing two databases with 2 different schemas, our data warehouse and a question mark showing that we don’t know how to replicate data yet.
We will populate our DW using an ETL (extract, transform, and load) job. We have two choices here:
Making it a one-step process. In this case we have only one ETL that does all the work. It extracts data, transforms it in memory and loads it into our DW.
Two-step process introduces a staging area. Instead of one ETL we have two. The first one copies data from our transaction databases into the staging area doing some minimal transformations (like converting data types). The second ETL uses heavy transformations to copy data from the staging area to the data warehouse.
Let’s take a closer look at these two approaches.
The one-step process comprises one job doing everything. It sorts and merges data from different input sources (TDB1 and TDB2) in memory and loads it into our DW. Though this approach is the simplest one it has some obvious flaws:
The process is monolithic. If you introduce some errors in your transformation and the process fails you will have to rerun the whole process again. Don’t forget that you won’t be able to do it during business hours, as your transactional database will be under load.
Generally, it’s a good idea to minimize the time you spend accessing remote servers (your database instances). Having your ETL job implemented this way won’t allow you to do it.
You won’t be able to use capabilities of your DBMS to merge data from different input data sources. Everything has to be done by your job, which may be cumbersome and error prone.
The two-step process comprises two jobs:
“Replicate to Staging”. It copies data from our transactional databases (TDB1 and TDB2) into another database - the staging area. We don’t do any complex transformations at this point. The purpose of this step is to copy all the data we haven’t processed yet.
“Populate Data Marts”. It takes that data we have in the staging area, transforms it and uploads into our DW. It also cleans the staging area after it’s processed all the data. It never processes the same data twice.
There are several benefits you will get if you choose this approach:
Only the first step touches your transactional databases. You can rerun “Populate Data Marts” as many times as you want without affecting your transaction database. It means that it can be done during business hours. This aspect is crucial as usually “Replicate to Staging” is pretty straightforward and doesn’t cause any problems.
Having all the data from our input sources in one place allows you to use capabilities of your DBMS to join, merge, and filter data.
Though introducing an additional step (copying data to the staging area) may complicate your implementation in the beginning the price is not so high if you think about benefits you will get. The process is more reliable and easier to extend. In addition, ability to use DBMS to join data from several input sources will save you plenty of time.
In addition, I’d like to share some thoughts about some implementation details.
Additional Implementation Notes: Using BI Platforms
Such BI platforms as Pentaho will give you all capabilities to write and execute ETL jobs. If you don’t have much time and you are not afraid of using drag-and-drop programming you can write all needed ETL jobs in a few days.
Though I’m a big proponent of ready-to-use solutions (such as BI platforms) writing everything from scratch is a better approach in many ways. You won’t have to deploy and support one more instance of Tomcat. Secondly, BI platforms are very far from being Agile. As a result, the only way to test your ETL job is to do it manually, which basically makes any kind of refactoring extremely painful. Also, it’s very hard to keep you ETL jobs DRY which increases the price of making changes in the future.
Additional Implementation Notes: Copying Data to Staging
Most of the tables you will need to copy will belong to one of following groups:
- Some reference tables containing up to a few thousand rows. You don’t have to bother. Just copy the whole table every night.
- Tables containing immutable data. You can use the primary id to copy new rows.
- Tables containing mutable data and having an “updated_at” kind of column. Use this column to find the data that was updated.
In some situations it’s not that easy:
- For instance, you may need to join a few tables to find updated rows. Or use many columns (such as primary_id, inserted_at and updated_at) for one table.
In the end I’d like to say one more time that it’s not as complicated as people say. Building a simple DW is a task that can be achieved by one person in a month. Of course, there is a lot of theory behind it (like how to handle different types of dimensions etc). But to bring some value to your business you don’t have to know all this theory, just understanding the basics will be enough.