There is no doubt that Spark is one of the hottest (pun intended) technologies available to data engineers today. And the buzz on the exhibitor floor at the Hilton San Francisco for this year’s Spark Summit did not disappoint. All the major players who supply software and hardware tools for next generation app development were present and accounted for, including some intriguing new players to the market. The companies on the badges of attendees that floated by were also heavily skewed to the early adopter market, but with some surprising traditional names making their appearance as well. This tells me that Spark usage is quite possibly ready to cross Geoffrey Moore’s chasm, if it hasn’t already.
Map/Reduce and Hadoop are already 10 years old, and in that 10 years, the proliferation of mobile devices has become the driving force behind most data scientists’ workload. There is exponentially more data coming in at faster and faster of speeds, while consumers are becoming more accustomed to highly personalized experiences. Spark has become the standard for faster and more flexible analytics that drive these applications, and with the cost of RAM dropping, it makes technologies like Apache Ignite and its easy integration with Spark and easy 1-2 punch for bringing analytics from batch to true real-time.
But why? If you think about it, Spark over HDFS creates a bottleneck at inception. Spark with all it’s in-memory goodness is still subject to getting data off impossibly slow disk. And once that memory resident RDD is gone, you have to go back to turn of the century spinning rust to get your answers. Why not architect your system from the get-go to take advantage of all that RAM has to offer? “But it is expensive and I have too much data!”. Go check out some of the advancements from IBM and Intel. Within 2 years, you will wish you had an infrastructure architected for memory rather than disk.