Spark Grows Up and Scales Out
Join the DZone community and get the full member experience.Join For Free
Written by Craig Wentworth.
To understand the furor that’s greeted recent vendor announcements around open source analytics computing engine Spark, and some commentary seemingly setting up a Spark versus Hadoop battle, it’s worth taking a moment to recap on what each actually is (and is not).
As I covered in last year’s MWD report on Hadoop and its family of tools, when people talk about Apache Hadoop they’re often referring to a whole framework of tools designed to facilitate distributed parallel processing of large datasets. That processing was traditionally confined to MapReduce batch jobs in Hadoop’s early days, though Hadoop 2 brought the YARN resource scheduler and opened up Hadoop to streaming, real-time querying and a wider array of analytical programming applications (beyond MapReduce).
Spark has been designed to run on top of Hadoop’s Distributed File System (amongst other data platforms) as an alternative to MapReduce – tuned for real-time streaming data processing and fast interactive queries, and with multi-genre analytics applicability (machine learning, time series, graph, SQL, streaming out-of-the-box). It gets that speed advantage by caching in-memory (rather than writing interim results to disk, as MapReduce does), but with that approach comes a need for higher-spec physical machines (compared with MapReduce’s tolerance for commodity hardware).
So, Spark isn’t about to replace Hadoop -- but it may well supplant MapReduce (especially in growing real-time use cases). Those “Spark vs Hadoop” headlines are about as meaningful as one proclaiming “mushrooms vs pizza." Yes, mushroom might be a more suitable topping than, say, pepperoni (especially in a vegetarian use case), but it’ll still be deployed on the same dough and tomato sauce pizza platform. Nobody’s about to suggest the mushroom should go it alone!
But what’s behind the headlines and the hype is a story of enterprise adoption – or at least vendors anticipating that adoption and investing in ‘the weaponization of Spark’ as it faces the more exacting standards of security, scaling performance, consistency, etc. which come with mainstream enterprise deployment.
Big names like IBM, Databricks (the company formed by the originators of Spark), and MapR made commitments in and around the Spark Summit earlier this month. MapR has announced three new Quick Start Solutions for its Hadoop distribution to help customers get started with Spark in real-time security log analytics, genome sequencing, and time series analytics; Databricks’ cloud-hosted Spark platform (formerly known as Databricks Cloud) has become generally available; and IBM announced a raft of measures designed to give Spark a significant shot in the arm – it’s open sourcing its SystemML technology to bolster Spark’s machine learning capabilities, integrating Spark into its own analytics platforms, investing in Spark training and education, committing 3,500 of its researchers and developers to work on Spark-related projects, and offering Spark as a service on its Bluemix developer cloud.
Given the overlap with Databricks’ business model (of offering development, certification, and support for Spark), IBM’s intentions are likely to tread on some toes before long – but for now, at least, both companies are content to focus on the combined push benefiting the Spark community and its enterprise aspirations overall (though clearly IBM’s betting on all this investment buying it some influence over where Spark goes next).
It’s worth bearing in mind that not all its supporters champion Spark wholesale and all the interested parties tend to be interested in particular bits of Spark (as wide-ranging as it is) because of overlaps with their own preferred toolsets. For instance, although Spark supports many analytics genres, Cloudera focuses on its machine learning capabilities (as it has its own SQL-on-Hadoop tool in Impala), and MapR and Hortonworks also promote Drill and Hive as their favoured source of SQL-on-Hadoop. IBM’s support is focused on Spark’s machine learning and in-memory capabilities (hence the SystemML open sourcing news). In the face of such strong vendor preferences, how long before some of Spark’s current features fall away (or at least start to show the effects of being starved of as much care and feeding as is bestowed upon vendors’ favourite Spark components)?
The Spark community is at much the same place the Hadoop one was at a while back – it’s showing great promise and suitability in key growth workloads (in Spark’s case, such as real-time IoT applications). However, the product as it stands is too immature for many enterprise tastes. Cue enterprise software vendors stepping up to help grow Spark up fast. Their challenge though is to smooth out the edges without smothering what made it so interesting in the first place.
Opinions expressed by DZone contributors are their own.