Do THIS Not THAT for Modern BI: Query Data in Place, Don't Have Copies All Over
Keeping your data in one place is safer and makes life easier for data science teams. Read on to learn more about why that is.
Join the DZone community and get the full member experience.
Join For FreeWhen I started working in the BI space in the mid '90s, the "ROLAP vs. MOLAP" war was raging. On one side were MOLAP companies like Arbor (now Oracle), Essbase, and what is now known as SQL Server Analysis Services, originally Panorama Software. On the other was the company I worked for, Microstrategy.
Before database engines started to become optimized for analytical versus transactional queries, the best way to do BI was to build what is called a "multi-dimensional OLAP cube," which precomputes the answers to many of the possible calculations that a user could ask for. This makes it really fast to answer users' questions, but takes a lot of time and memory to calculate the cube, and then space to store it. Since most BI was being done on small amounts of financial data, this generally worked fine.
Then data warehousing took off in the retail sector, and retailers wanted to analyze all their sales and inventory transactions at a granular level. This was much more data, and so the cubes started to get much bigger — eventually too big to fit on a reasonable computer of the day. So along came Microstrategy who said, "you should query the data in place in the database when people ask questions, not pre-compute all the answers in a copy." This was called Relational OLAP, or ROLAP.
ROLAP was much more scalable than MOLAP, but was generally slower. Queries could take minutes as opposed to seconds, and to make them faster we'd usually create aggregate tables in the database, which was sort of like MOLAP but just stored in a different place. But it's still basically a copy that needs to be created and maintained.
Eventually, most BI tools in the mid 2000s evolved to use a hybrid strategy, using MOLAP approaches for some of the data as an accelerator, and ROLAP for access to the bulk of the data. So neither approach really ever "won" the ROLAP-MOLAP war.
When Hadoop and cloud storage became popular in the mid 2010s, and the big data revolution started, many companies didn't want to store all their data in a regular relational database anymore, as the data was too big, and those systems were too slow and expensive. But then new problems evolved. The first generation of big databases were developed for batch processing (think MapReduce). It took a while to find ways to quickly query big databases. Approaches like HIVE sort of made it queryable, but generally too slowly for interactive analysis.
And everyone still wanted their BI to work on this new big data. And people tried a lot of different things — Datameer, Platfora (MOLAP is back!). These early Hadoop data analysis tools used the MOLAP playbook. Extract the data from Hadoop, munge it, roll it up, and copy the results into a cube. Then these players also built a hard-wired BI tool on top of these cubes, which only worked with these cubes. The net of this is you're using a new BI (and ETL) tool, on rolled-up copies of your Hadoop data. Cool? Well, initially, sort of... except that you're not leveraging the power of your Hadoop platform except as a cheap source data file store, and you need to learn a new ETL and BI tool. Also, you need to keep those extracted cubes up-to-date, and create new ones when users need to look at the data in a different way.
AtScale, Jethro (MOLAP, Bring Your Own BI)
Some other startups decided to focus just on the ETL and MOLAP aspects, and to let you keep your existing BI tool. So they built engines that were able to cube up data out of Hadoop and present it as a SQL or MDX store to existing BI tools. This approach has worked a bit better than the Datameer/Platfora type approach, but still requires care and feeding of a copy of data, not leveraging the power of your Hadoop platform except as a file store, and acquiring and running another piece of technology in your BI stack.
Impala, Presto, Hive on Tez/LLAP (ROLAP?)
This is where we get back to ROLAP. The open source community decided that we needed faster query engines than Hive alone. Cloudera built Impala, Hortonworks ported Hive to work on Tez and LLAP for greater speed, and Facebook built and open sourced Presto. Presto was the first engine to work on data stored in Amazon S3 in addition to HDFS, and Impala and Hive have since added S3 connectivity. When run in Hadoop on HDFS data, this approach properly leverages the co-located compute and storage that Hadoop is all about. When run on S3 data it's a little fuzzier, except the Hadoop investment is still being leveraged to run the query engine, but not to store the data. And these engines can be run on big, raw data, not copies, and can be used by many BI tools. The only catch is that they are pretty complicated to set up, the data needs to be formatted just so if you want them to really perform, and most BI tools have not optimized for these engines in a way that makes them usable especially as concurrent usage levels increase. But again, this approach means no copies to define, secure, and maintain, and leveraging Hadoop is a very good thing!
So we've talked about three approaches to leverage big data and making it accessible to analysis tools, each with their pros and cons. Many organizations who have invested in Hadoop, particularly on-premises Hadoop, are now actively deciding between the latter two of these approaches. But for companies going all-in on Cloud, there is a another approach which is quickly gathering steam:
Snowflake and BigQuery (Just a Database!)
There are advantages to full-featured databases over Hadoop, which have not been fully replicated by Hadoop SQL engines. But, historically, there hasn't been a full-featured database that was scalable and cheap enough at scale to be workable, which is why Hadoop was invented in the first place. But two teams of database die-hards went to work to build a database that would scale, and those two visions have now been achieved as Snowflake and Google BigQuery. Both will hold extremely large amounts of data but also return query results quickly. So for data processing and BI, if you are willing to put all your data in the cloud, they work very well. They work less well for ad-hoc and Machine Learning/AI workloads which need more than SQL access to data, so they aren't a complete one-stop-shop. But those bridges are being built. And while you do need to load your data into these cloud databases, once you do so they can become your one master copy, and they are fast enough to not need any aggregate or cubes on top of them.
Conclusion
So the decision today if you are trying to decide what to do with all your structured data comes down a few scenarios:
- You're going all in on cloud, and primarily need a SQL database. In this case one of the fully hosted databases like Snowflake or BigQuery may be your best bet.
- You're going all in on cloud, need a database, but also plan to do a lot of machine learning, AI, or data engineering. In this case, a hosted Hadoop option such as Cloudera Altus or Amazon EMR may be a good choice.
- You need to have your data on-premises. An on-premises Hadoop deployment is probably your best option. Consider one that also has a cloud hosted option, and you could potentially migrate in the future.
The good news is that both of these approaches let you query your single copy of your data. No copies, no cubes, no aggregates, no MOLAP. Just say no to anyone who wants to make a copy of your data to speed up queries — it's complex and risky, and it's not secure. Plus, it's no longer needed. Then choose a BI tool that is optimized for the modern storage engine that you choose.
Published at DZone with permission of Justin Langseth, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Decoding eBPF Observability: How eBPF Transforms Observability as We Know It
-
Stack in Data Structures
-
Understanding the Role of ERP Systems in Modern Software Development
-
Implementing a Serverless DevOps Pipeline With AWS Lambda and CodePipeline
Comments