Are Data Warehouses Dinosaurs?
Are Data Warehouses Dinosaurs?
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
As anybody that follows my blog knows, I am not a fan of vertical scaling. I don't like solutions that can only be implemented in a single address and storage space. Unfortunately, there are analytical problems that need a holistic view of data. This is very typical of data warehousing applications. As a result, data warehouses are expensive, often out of the reach of smaller organizations. But there may be an alternative that is less expensive and horizontally scalable. What is this great revelation? Processing streams of events using an Event Stream Processor (ESP) solution.
ESP analyze streams of events using a language similar to SQL. In the same manner that databases and data warehouses use SQL to perform analysis of data tables, ESP use their query language to analyze streams of events. The simplest way to understand ESP is to think of events as rows in a table and the attributes of an event as the columns. Each event type is the equivalent of a table. From this perspective, it becomes straightforward to see how ESP works. But how does this relate to replacing data warehouses?
Data warehouse analysis involves aggregating information along a variety of axis as well as inverting relationships in the data. The goal is to provide the business with different perspectives on what the customers are doing. In order to do this, data is loaded into the warehouse periodically. Typically daily ETL processes are performed on the production databases to keep the warehouse fresh. This process though has a couple issues beyond the cost of the warehouse infrastructure. First, the ETL places a significant load on your production databases. If your business has nice offline windows for the ETL, that's great, but if not, managing the scale becomes a challenge. Second, the freshness of the warehouse is typically 24 hours behind or more. As your business grows this lag will grow as well.
ESP address this by analyzing the changes to your data as it occurs. Rather than doing batch ETL's, you stream business events as the state of your data changes. This creates a more manageable scaling model for your production system. The business analytics extracts are spread throughout the transaction day. ESP can also be horizontally scaled, providing a more cost effective solution for your business. And since ESP is performing the analysis in real time, the business metrics can be current and remain that way as the business grows.
Does this spell the end of data warehouses? Well, maybe but there is one challenge with the ESP approach. While it is able to provide analytics cost effectively, it does not provide the ability to perform historical analysis. If you know what you want, then ESP will deliver the results from the current point in time forward. But what if you want a different perspective on your business activity and you want it over the past 3 months. One solution is to create a framework for capturing and replaying transactions but this can be expensive. This becomes a matter of deciding the business value of performing the historical analysis.
Whether you choose to use a data warehouse or not, ESP is definitely worth investigating as a way of delivering business analytics more cost effectively.
Published at DZone with permission of Dan Pritchett , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.