Building a Data Warehouse, Part 2: Building a New Schema
In Part I we looked at when you should build your data warehouse and concluded that you should build it sooner rather than later to take advantage of reporting and view optimization. Today we will look at your options to build your data warehouse schema.
When architecting a data warehouse, you have two basic options: build a flat “reporting” table for each operation you are performing, or build with BI/cubes in mind and implement a “star” or “snowflake” schema. Let’s take a quick look at the first option and then we will take a look at the star and snowflake schemas.
Whenever the business requests a complex report, developers usually slow down the system with a complex SQL statement or operation. For example, pretend in our order entry system (OLTP) the business wants a report that says this: show me the top ten customers in each market including their overall rank. You would usually have to perform a query like this:
- Complex joins for unique customer
- Rollup the sales
- Ranking functions to determine overall rank
- Partition functions to segment the rank by country
- Standard aggregates to get the sales
- Dump all of this to a work table in order to pull out the top 10 (if you don’t do this, you will lose the overall rank)
A typical SQL statement to do steps 1-5 would look like this:
With CTETerritory As ( Select cr.Name as CountryName, CustomerID, Sum(TotalDue) As TotalAmt From Sales.SalesOrderHeader soh inner join Sales.SalesTerritory ter on soh.TerritoryID=ter.TerritoryID inner join Person.CountryRegion cr on cr.CountryRegionCode=ter.CountryRegionCode Group By cr.Name, CustomerID ) Select *, Rank() Over (Order by TotalAmt DESC) as OverallRank, Rank() Over (Partition By CountryName Order By TotalAmt DESC, CustomerID DESC) As NationalRank From CTETerritory
Argh! No wonder developers hate SQL and want to use ORMs! (I challenge the best ORM to make this query!)
Instead you can create a table, lets call it SalesRankByRegion, with the fields: CountryName, CustomerID, TotalSales, OverallRank, and NationalRank, and use the above SQL as part of a synchronization/load script to fill your reporting table on a regular basis. Then your SQL statement for the above query looks like this:
SELECT * FROM SalesRankByRegion WHERE CustomerNationalRank Between 1 and 10 ORDER BY CountryName, CustomerNationalRank
The results look like:
That is more like it! A simple select statement is easier for the developer to write, the ORM to map, and the system to execute.
The SalesRankByRegion table is a vast improvement over having to query all of the OLTP tables (by my count there are three tables plus the temp table). While this approach has its appeal, very quickly, your reporting tables will start to proliferate.
Your best option is to follow one of the two industry standards for data warehouse tables, a “star” or a “snowflake’ schema. Using a schema like this gives you a few advantages. They are more generic than the SalesRankByRegion, which was a table built for one query/report, giving you the ability to run many different reports off each table. Another advantage is that you will have the ability to build cubes very easily off of a star or snowflake schema as opposed to a bunch of SalesRankByRegion tables.
The design pattern for building true data warehouse tables are to build a “fact” table, or a table that contains detail level (or aggregated) “facts” about something in the real world, like an order or customer for example. Inside of the fact table you will also have “measures” or a numeric value that represents a “fact.” To support your fact table you will have “dimension” tables. Dimensions are a structure that will categorize your data, usually in the form of a hierarchy. A dimension table for example could be “time” with a hierarch of OrderYear, OrderQuarter, OrderMonth, OrderDate, OrderTime.
There are tons of tutorials on the internet that show you how to build a star or snowflake schema and the difference between them, so I will not repeat them here. (You may want to start here.) I’ll give you the high level on a simple star schema here.
Let’s say we have an order entry system, such as Northwind (in the Microsoft SQL Server sample database.) You can have a fact table that revolves around an order. You can then have three (or more) fact tables that focus on: time, product, and salesperson. The time dimension would roll-up the order date by year, quarter, month, and date. The product dimension would roll-up the product by the product and category. (In most systems you would have a much deeper hierarchy for products.) The salesperson dimension would be roll-up of the employee, the employee manager and the department they work in. The key in each of these tables would then be foreign keys in the fact table, along with the measure (or numerical data describing the fact.)
There is an example similar to this in Programming SQL Server 2008, a book where I am a co-author. Here is modified version of that demo:
CREATE TABLE [dwh].[DimTime] ( [TimeKey] [int] IDENTITY (1, 1) NOT NULL Primary Key, [OrderDate] [datetime] NULL , [Year] [int] NULL , [Quarter] [int] NULL , [Month] [int] NULL ) CREATE TABLE [dwh].[DimProduct] ( [ProductID] [int] not null Primary Key, [ProductName] nvarchar(40) not null, [UnitPrice] [money] not null, [CategoryID] [int] not null, [CategoryName] nvarchar(15) not null ) CREATE TABLE [dwh].[DimEmployee] ( EmployeeID int not null Primary Key, EmployeeName nvarchar(30) not null, EmployeeTitle nvarchar(30), ManagerName nvarchar(30) ) Fact table: CREATE TABLE [dwh].FactOrder ( [PostalCode] [nvarchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [ProductID] [int] NOT NULL , [EmployeeId] [int] NOT NULL , [ShipperId] [int] NOT NULL , [Total Sales] [money] NULL , [Discount] [float] NULL , [Unit Sales] [int] NULL , [TimeKey] [int] NOT NULL )
We have the basis of a star schema. Now we have to fill those tables and keep them up to date. That is a topic for Part III.