Building a Data Warehouse, Part 4: Extraction, Transformation, and Load
- Part I: When to build your data warehouse
- Part II: Building a new schema
- Part III: Location of your data warehouse
In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. In Part III, we looked at where to put the data warehouse tables. Today we are going to look at how to populate those tables and keep them in sync with your OLTP system.
No matter where your data warehouse is located, the biggest challenge with a data warehouse, especially one where you are going to do real time reporting off of, is that the data is published from the transactional system (OLTP). By definition, the data warehouse is out of date compared to the OLTP system.
Usually when you ask the boss, “How much latency can you accept between the OLTP and data warehouse systems” the boss will reply: none. While that is impossible, the more time to develop and the more money you have will to develop said system, the closer to real time you can get. Always bring up the American Express example used in Part I for proof that your system can accept at least some latency.
If you remember the code examples from Part II, most of the time you have to query OLTP data, aggregate it, and then load it into your new schema. This is known as the process of extraction, transformation, and loading (ETL) data from the OLTP system into the data warehouse. The workflow usually goes like this: at a pre-set time (at every change, hourly, nightly, or weekly) query the OLTP data (extraction) and then aggregate and flatten it out (transformation) and then copy the transformed data to the data warehouse (star or snowflake) tables (load).
You have several options ranging from near real time ETL (very complex and expensive) to nightly data dumps (least complex and least expensive.) You can also perform full ETL (wipe out the data warehouse tables completely and refill on each operation) and partial ETL (only add incremental data as per a time series.) The actual load can be done via database triggers on an add/update/delete for near real time to simple scheduled SQL batches that wipe data and run your load scripts. Most likely you will want to take a hybrid approach, depending on the maturity of your system. You have three basic options, ranging from least complex to most complex:
- Direct database dump
- ETL tool
- Database triggers
When you have a long time series to publish your data, say nightly, or weekly, you can do a direct database dump. The process would be pretty straightforward. At a regular interval (or manually) a process would start that would query the OLTP database and perform all of the aggregations, etc and then load it into a staging data warehouse database, then wipe out the production data warehouse and load the data in.
Another option is to use an ETL tool. A good example is SQL Server Integration Services (SSIS) if you are using Microsoft SQL Server. (Actually SSIS will work with multiple database, you just need a SQL Server host.) A modern ETL tool will give you the ability to segment the work into logical groups, have a control flow based on success and failure of a condition, and allows rollbacks.
A typical workflow with an ETL tool is that the ETL will run on a schedule (or based on a condition, such as a message arriving from a queue or a record written to a admin table) and have a parameter(s) passed to it. This parameter is usually a time series and the ETL will perform all of the extraction on data from the OLTP database filtered by that parameter. An ETL tool is the most likely solution you will employ, especially if you have to make frequent updates from your OLTP system to your data warehouse.
Another option is to use database triggers. For those of you that don’t know a lot about triggers, well they can lead to evil. ;) That said, they are events that fire when data changes. You can then write SQL code to run when the data changes, even ETL code. Triggers are hard to debug and difficult to maintain, so I would only suggest that you use a trigger when you need “real time” updates to your data warehouse and even then, the trigger should only write a record into an admin table that your ETL process is polling to get started.
A common design pattern with a sophisticated ETL is to do all of the ETL to a staging data warehouse and allow a check mechanism to verify that the ETL was successful. Once this check is performed (either by computer or by human, depending on the system), you can then push the data from the staging tables to the production tables. There is a SQL operator specifically for this process, the MERGE operator. MERGE looks at two tables and looks at their joins and compares the data and will do add, updates, inserts, and deletes to keep the two tables in sync. Here is how we would do that with a MERGE statement and our fact table from Part II.
--MERGE the FACT Local to FACT Remote MERGE dbo.FactOrder as t--target USING Northwind.dwh.FactOrder as s--source ON t.ProductID = s.ProductID AND t.EmployeeID = s.EmployeeID AND t.ShipperID = s.ShipperID AND t.TimeKey = s.TimeKey --joining keys --record exists in target, but data different WHEN MATCHED AND (s.OrderDate != t.OrderDate OR s.PostalCode != t.PostalCode OR s.[Total Sales] != t.[Total Sales] OR s.Discount != t.Discount OR s.[Unit Sales] != t.[Unit Sales]) THEN UPDATE SET t.OrderDate = s.OrderDate, t.PostalCode = s.PostalCode, t.[Total Sales] = s.[Total Sales], t.Discount = s.Discount, t.[Unit Sales] = s.[Unit Sales] WHEN NOT MATCHED BY SOURCE THEN --only in target, get rid of it Delete WHEN NOT MATCHED BY TARGET THEN --record only exists in source INSERT VALUES (s.OrderDate, s.PostalCode, s.ProductID,s.EmployeeID, s.ShipperID, s.[Total Sales], s.Discount, s.[Unit Sales], s.[TimeKey]);--required!!
Next we will warp up with the application development options.