Use Materialized Views to Turbo-Charge BI, Not Proprietary Middleware
Big data is full of tricky concepts and problems. One dev takes a look at the times when using proprietary middleware will only add to your complications.
Join the DZone community and get the full member experience.Join For Free
Query performance has always been an issue in the world of business intelligence (BI), and many BI users would be happy to have their reports load and render quicker. Traditionally, the best way to achieve this performance (short of buying a bigger database) has been to build and maintain aggregate tables at various levels to intercept certain groups of queries to prevent repeat queries of the same raw data. Also, many BI tools pull data out of databases into their own memory, into “cubes” of some sort, and run analyses off of those extracts.
Downsides of Aggregates and Cubes
Both of these approaches have the major downside of needing to maintain the aggregate or cube as new data arrives. In the past, that has been a daily event, but most warehouses are now being stream-fed in near real-time. It’s not practical to continuously rebuild aggregate tables or in-memory cubes every time a new row arrives or a historical row is updated.
Enter Hadoop. Problem Solved? No.
The problem of query performance became even more of an issue when Hadoop and data lakes emerged as common stores of big data. While the query engines of these systems have improved over time, they are not always capable of providing acceptable response times, particularly as data volumes get large and as query concurrency increases. Also, with the exception of Kudu, Hadoop and data lakes have not evolved clean ways to ingest real-time streams of data in a way that makes them instantly queryable, and particularly struggle with late-arriving or restated historical data.
Enter Proprietary Middleware. Problem Solved? Still No.
In the world of big data, many companies try to address the query response and concurrency problems by introducing custom middleware that sits between a BI tool and a Hadoop system or data lake. Products like Datameer, Platfora, AtScale, Jethro, and others were designed to improve discrete query performance, but they introduced yet another layer to the stack that you have to select, buy, manage, and maintain. You also have to tune and worry about these products separately from everything else in your stack (like your database below and your BI cache above).
Modern Data Warehouses + Modern BI to the Rescue
More recently, there has been rapid adoption of cloud-native data warehouses, such as Amazon Redshift, Google BigQuery, Quoble, and Snowflake. Nearly everyone we talk to in the data warehousing space these days is using, testing, or thinking about trying a cloud-native database for some of their data warehousing workloads.
What does this mean for BI and query performance? Well, it makes things easier to some extent, as these cloud-native systems speak fluent SQL, and have characteristics similar to traditional data warehouse systems but with vastly superior speed, scalability, and cost profiles than either traditional MPP warehouses or most Hadoop-based systems. Some of these cloud systems rely heavily on partitioning for performance, including micro-partitioning, which match up perfectly to Zoomdata’s unique microquery and Data Sharpening support for big data. Other databases available as cloud services use a primarily in-memory approach, such as SAP HANA and MemSQL, or are re-packaged versions of on-premises database technologies.
Modern Materialized Views
While microqueries and micro-partitions provide a lot of performance boost for interactive data exploration, the classic Monday Morning problem can still arise when thousands of dashboard consumers all ask for a similar (but not identical) dashboard at the same time. In this case, aggregate tables can still be very useful to deflect some of that load from the raw underlying data.
But as discussed above, persisted aggregate tables are hard to maintain in the face of continuously-arising data. To address this, some of these cloud data stores, and most traditional data warehouse systems support materialized views. While traditional databases have offered materialized views for quite some time, and many of the recommendations below apply to those backends as well, the recent innovations allow these views to be defined with one SQL command, then automatically maintained by the system even as new data is continuously arriving even at extreme scale and speed.
Even if data arrives late, out of order, or historical data is updated/restated, the modern materialized view can handle it. Even under high rates of data inserts and updates, with huge volumes of data, and virtually unlimited query concurrency, the modern materialized view keeps trucking without batting an eye.
How to Decide Which Materialized Views to Create
Whether using a cloud-native database, an in-memory database, or a more traditional on-premises data warehouse, there is some maintenance cost for the materialized views, and they do take up some amount of storage space (although typically much less than the underlying full detail raw data). So it’s best to look at your most commonly-used dashboards, and see what tables they are hitting and any joins they are doing, and build materialized views to be able to serve those dashboards. Then you can have those frequently-used Monday Morning dashboards run off the materialized views, but still have links to more detailed dashboards and reports for users that want to dive into the details.
Published at DZone with permission of Ruhollah Farchtchi, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.