Ever since Hadoop’s second generation release in late 2013 brought it closer to enterprise capability acceptance, the blogosphere has been ringing with the sound of commentators proclaiming what was still lacking, and what hurdles that yellow elephant needed to jump in order to truly burst into the mainstream.
Yes, Hadoop 2 brought the YARN resource scheduler – and along with it too a plethora of tools and options that’s brought Hadoop closer to becoming a credible ‘good enough’ solution for many diverse Big Data use cases beyond batch-processed MapReduce (but shy of the extremes of scale and speed). I charted this evolution in the recent MWD report Hadoop: A driving force in the Big Data technology landscape, and you can hear more in the webinar replay Navigating the Big Data landscape too.
But overall, to be taken seriously at the mainstream enterprise table, Hadoop has needed to satisfy two key tests: achieve business grade performance, security, availability, etc. (IT often won’t touch somebody else’s new toy unless it’s wearing these badges); and provide a way in via SQL. Why still SQL? Well, there’s a lot of SQL talent and BI investment out there, and it beats having to hire in a whole new army of query wranglers if you can treat Hadoop like you would any other feeder for your analytics processing.
To satisfy the first concern, a number of vendors (such as MapR, IBM) have taken the route of ‘hardening’ elements of the Hadoop ecosystem – providing API-compatible versions of components such as the filesystem, surrounding Hadoop in their own tools and connectors, and/or simply leveraging their existing proprietary infrastructure (rather than relying solely on pure open source contributions) to tweak performance, ensure robust enterprise-grade and granular security, guarantee five 9s availability, etc.
As for SQL? Well, there’s a myriad different options there but they basically boil down to one of two stories: either something open source, with its routes in Apache Hive (either inspired by it, spawned from it, or developed alongside / in opposition to it and designed to iron out its kinks); or something proprietary, designed to pull a big name heritage database vendor’s flavour of Hadoop more firmly over into its realm of legacy data warehouses and all-encompassing suites.
And so Hadoop’s SQL wars rage on. Is Apache Hive SQL-like enough for you, and does its continued reliance on batch – after all, it’s a MapReduce translator at heart – slow you down? Even if Hive classic isn’t for you, what about its more modern makeovers via Hortonworks’ Stinger initiative: Hive on Spark on YARN, Hive on Tez – do they give Hive enough of a tune-up, or has that horse been well-and-truly flogged? Is Cloudera’s alternative Impala a more efficient ground-up re-imagining of real-time SQL on Hadoop as a native distributed engine, shorn of any Hive-like translation overheads – however well-tuned? Is relative newcomer Apache Drill (sponsored by MapR – who already include Impala in their Hadoop distribution line-up, as well as sporting a partnership with HP Vertical that brings its own ANSI SQL analytics on Hadoop) and its ANSI SQL on Hadoop and NoSQL enough of a stone to kill two birds? And beyond the world of open source, there’s big enterprise-class SQL engines from the likes of IBM’s BigSQL (part of its BigInsights Hadoop distribution), Pivotal’s HAWQ on its HD Hadoop distribution, Teradata’s SQL-MapReduce framework for Aster (alongside its partnered-in Hortonworks Open Distribution for Hadoop, or recent acquisition of SQL-on-Hadoop pioneer Hadapt).
What’s all this telling us? Well, there’s a lot of money riding on it being the SQL battleground where different flavours of Hadoop either demonstrate enough mettle to jump the chasm and cement their enterprise future, or jump the shark trying to over-engineer themselves into the supposed true keeper of the SQL faith. There’s a lot of bets being hedged too (just look at the web of partnerships, acquisitions and home-grown initiatives some players are weaving in order to keep as many bases covered, and appeal to as many sources of Hadoop convertees as possible).
You may feel you’d like to see how the dust settles, and who’s left standing, before you commit to any one SQL-on-Hadoop movement. But you probably don’t have that luxury. Depending on your need for flexibility and/or purity, any of the above may well suit you just fine for now anyway. You may have plumped for the one with the best fit for the rest of your IT estate – especially if you’re overwhelmingly an IBM or Teradata shop, etc; or the one which speaks most to your open source leanings; or perhaps you’ve analysed your use cases enough to really understand the nuances that separate the players in a dynamically overlapping market.
So, as Big Data franchises go, if you enjoyed Hadoop’s part 1… what are you looking for in a good SQL?