How to Design an Enterprise Data Warehouse
How to Design an Enterprise Data Warehouse
In this article, I will give some best practices that you can apply to your data warehouses.
Join the DZone community and get the full member experience.Join For Free
Within your company, you may have hundreds of data sources, many kinds of databases and data types. If this is the case, it can be very hard and complex to design a user-friendly data warehouse. Moreover, there can exist hundreds of excel files which you do not know the source.
In this article, I will give some "best practices" which you can apply to your data warehouses. Of course, The performance of your systems would be also affected by the hardware and software environment . You have to make some decisions for them independently from your data warehouse design.
For a well-designed data warehouse, I would like to list some main headlines for your to be able to be interested in. I will cover only the 1st, 2nd, and 3rd topic in this article and will write the last three headlines in my next article.
What should be the subject areas of your data warehouse?
What should be the model rules for your data warehouse?
What should be the database constraints for your database objects?
What should be the schedule times for your data flow?
What kind of data marts do you need?
Do you need any operational data source?
What Should Be the Subject Areas of Your Data Warehouse?
An Enterprise Data Warehouse model must have its own data modeling structure. It must not be the simple copies of the data sources. There could be many replicated data in many of the transactional data applications. Therefore, prior to the data warehouse modeling, the business data types of the company have to be defined so that the main subject areas of the data warehouse are to be able to defined first before modeling. The definition of the subject areas provide you the general model for the data warehouse, which means that the model is autonomous of the data sources. Different data sources and different data types can be consolidated in the same table or subject area by this general modeling practice.
What could be the subject areas? They could be:
Finance: Invoice, Fraud Cases, or any other ERP generated financial information.
Campaign and Services: What are your product campaigns, or special services your offer to your customer.
Sales: What does your company sell? The order management data can be modeled in this subject area.
Customer: Your master customer information. Or it can include any kind of new customer, returned customer transactional information.
Warehouse: If you have stock management it must include the warehouse information also.
The examples can be increased. It depends on your industry. It has to be defined what kind of analysis you need to do by using all these data.
Besides, there should be general subject areas like "Party"and "Address," which are used in most in any kind of industry. These kind of subject areas are the most flexible one that you can bring your any kind of data source. For example, if the data warehouse has the Party subject area, you can put your employee, customer, vendor information, and team information to the "Party" subject area.
What Should Be the Model Rules for Your Data Warehouse?
If the size of the transactional data is very high, it is a best practice to separate the design into two parts.
Relational data source (RDS): which is isolated from the user interfaces.
Dimensional data source (DDS): which is specifically designed for user and reporting interfaces.
The above method also provides the ability for a flexible management of the enterprise data warehouse.
First, the extracted transactional data can be kept in relational models. This makes easier the source extraction, the transformation of the format of the data and loading the data to your RDS model. This is why the RDS model should not be denormalized model.
Yes! RDS is the normalized data layer for the data warehouse. The general rules can be defined as;
The data warehouse should generate its own key values, which are called surrogate keys.
The rds layer should keep the relation between the surrogate keys and natural keys (the source system key fields), however in other layers for the user interfaces, it does not have to be transferred to the other layers.
There can be 5 type of tables
It keeps historical information between two subjects with strat date and end date information
Link Type Tables
It keeps the type of the relation for "Link Tables", for example adrress of a customer could be home or office address of the customer
Simple definition tables. The order types, payment types, address types, regions etc. The lookup tables can be Slowly Changing Dimension Type 1 or Type 2. Type 1 just keeps the current state of the record, on the other hanf Type 2 keeps the versions of data changes.
These tables just keeps the information for the transactions, like the order transactions
Master Data Tables
They are the most important company wide information, like customer base, address base etc.
There can be hierarchical data in the data warehouse.
Since the data warehouse model is designed to be more generic, a table can have different types of data sources. So the data warehouse should keep the data source list in an inventory table. And use that information in the ETL process.
There should be naming conventions for the model. It should include the abreviations of the subject area, table type for link tables, transaction tables or lookup tables etc. and the specific definition for the table.
The model also should include a standard calendar subject area for general usage in all the data warehouse layers.
What Should Be the Database Constraints for Your Database Objects?
The data warehouse systems are to be designed for optimizing the reporting performance so that it should be optimized for reading data.
If there are many source systems and thousands of tables, loading time would be very critical for the preparation time of the reports. For loading the tables rapidly, there should not be indexes on tables.
Some transactional tables can have millions/billions of records. These data can be partitioned by time or based on a key field.
Surrogate keys should be defined in numeric data types. It also decreases the data size of the tables for data warehouse lifecycle.
For handling the processed data, a recycling mechanism can also be designed. In daily transaction transfers, if there is unhandled data based on the defined transform rules, the data can be kept in recycled tables for analyzing later.
For these three headlines, I try to explain some best practices for designing a data warehouse. Based on the size of your data and business needs, the design can be changed. I will continue the topic because there are more details to share for the design of a data warehouse.
Opinions expressed by DZone contributors are their own.