Don’t Jump From the BI-on-Hadoop Ship Just Yet
Don’t Jump From the BI-on-Hadoop Ship Just Yet
The frustration of using Hadoop for BI is justified. Still, if you’re thinking of abandoning Hadoop for BI, you may want to reconsider.
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
After just five short years in the spotlight, tech experts are calling for the death of Hadoop. For many, it seems like Hadoop is already going the way of legacy applications — it’s not going anytime soon, but many business leaders are seeing it as more of a necessary evil than the miracle it once was.
The frustration is justified. The Hadoop distributed computing platform has been oversold as a solution for your entire business. But in reality, Hadoop is designed for data scientists who can code in MapReduce — not business analysts who need interactive business intelligence functionality.
Sometimes, though, the grass isn’t greener on the other side. If you’re thinking of abandoning Hadoop for BI, you may want to reconsider.
It’s True: Hadoop Isn’t Meant for Business Intelligence
Despite what you may have thought when you dove headfirst into Hadoop, it isn’t built for visualization software that promises near real-time business intelligence.
Consider three main points that summarize why Hadoop isn’t made for business intelligence (and is “failing” you as a result):
- Data lakes aren’t meant for interactive queries. The lack of guaranteed response times makes latency too much of a challenge in these situations.
- Hadoop is best-suited for ETL batch workloads and machine learning because it offers a cheap storage repository.
- Data scientists can master Hadoop while critics say business users would have to learn Hive, Pig, or Spark to actually make it work — which obviously isn’t going to happen anytime soon.
These “failures” of Hadoop lead to many workaround solutions to make BI work on the distributed computing platform, which is where even greater frustration often sets in.
Workaround Challenges for BI on Hadoop
Latency issues in Hadoop point to a larger problem: Big data might just be too big for business intelligence in Hadoop. Business users need insights in near real-time and Hadoop and BI tools won’t integrate seamlessly to make this a reality.
Your instinct might be to abandon the Hadoop ship and find a new distributed computing platform to meet the needs of both data scientists and business users — but finding an actual solution won’t work.
DataTorrent Co-Founder Phu Hoang summed up the problem best:
“Hadoop is painful. But [business leaders] don’t see another solution. Until there may be other distributed computing platforms out there, our focus will be on making that one as easy to use as possible.”
This means finding workarounds to make BI work in Hadoop. But if you’ve ever tried the following workarounds, you know they aren’t without their own challenges:
Implement a Generic, Out-of-the-Box BI Solution
Trying to force a standard BI solution into Hadoop won’t work. These solutions are often slow when connected to Hadoop, might not fit into your specific use case, or might fail flex to the querying needs of your users. However, this additional software layer is the only way to create a BI dashboard for low latency insights.
Make Big Data Smaller With Extracts
BI tools like Tableau, Qlik, and MicroStrategy are excellent for visualizing ingested data from Excel spreadsheets. But when you start working with three billion rows of big data, you’ll just freeze/crash typical solutions. Some companies extract smaller sets of data to work with in standard BI solutions. However, this negates the benefits of granular big data analytics when you’re working with massive datasets. You end up with data silos across the organization and increasingly frustrated users.
Adopting Any of the SQL-on-Hadoop Solutions
When you choose a SQL-on-Hadoop solution like Hive, it doesn’t work the way you expect. While they are a step up from Hadoop’s lack of inherent SQL support, these solutions aren’t enough to meet your high-performance needs. On their own, data warehousing and massively parallel processing (MPP) solutions for SQL-on-Hadoop will only fuel the perceived downward trend of Hadoop for BI.
Hadoop Isn’t Going Anywhere, but BI on Hadoop Must Get Easier
Will Hadoop always be the primary (or only) distributed computing platform? Probably not. But regardless, so many companies have billions of rows of data in Hadoop and migrating it all won’t happen in the short term.
Instead, we have to start thinking about ways to satisfy business users as well as data scientists within Hadoop-powered organizations. The best way to do this — and to end the calls for Hadoop’s death — is to take advantage of a query acceleration engine. You can enjoy the cost efficiency of Hadoop without sacrificing high-performance business intelligence.
Published at DZone with permission of Remy Rosenbaum , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.