Why Apache Spark?
We live in an era of “Big Data” where data of various types are being generated at an unprecedented pace, and this pace seems to be only accelerating astronomically. This data can be categorized broadly into transactional data, social media content (such as text, images, audio, and video), and sensor feeds from instrumented devices.
But one may ask why it is important to pay any attention to it. The reason being: “data is valuable because of the decisions it enables”.
Up until a few years ago, there were only a few companies with the technology and money to invest in storing and mining huge amounts of data to gain invaluable insights. However, everything changed when Yahoo open sourced Apache Hadoop in 2009. It was a disruptive change that lowered the bar on Big Data processing considerably. As a result, many industries, such as Health care, Infrastructure, Finance, Insurance, Telematics, Consumer, Retail, Marketing, E-commerce, Media, Manufacturing, and Entertainment, have since tremendously benefited from practical applications built on Hadoop.
Apache Hadoop provides two major capabilities:
- HDFS, a fault tolerant way to store vast amounts of data inexpensively using horizontally scalable commodity hardware.
- Map-Reduce computing paradigm, which provide programming constructs to mine data and derive insights.
Figure 1 below illustrates how data are processed through series of Map-Reduce steps where output of a Map-Reduce step is input to the next in a typical Hadoop job.
The intermediate results are stored on the disk, which means that most Map-Reduce jobs are I/O bound, as opposed to being computationally bound. This is not an issue for use cases such as ETLs, data consolidation, and cleansing, where processing times are not much of a concern, but there are other types of Big Data use cases where processing time matters. These use cases are listed below:
- Streaming data processing to perform near real-time analysis. For example, clickstream data analysis to make video recommendations, which enhances user engagement. We have to trade-off between accuracy and processing time.
- Interactive querying of large datasets so a data scientist may run ad-hoc queries on data set.
Figure 2 below shows how Hadoop has grown into an ecosystem of several technologies providing very specialized tools catering to these use cases.
While we love the richness of choices among tools in the Hadoop ecosystem, there are several challenges that make the ecosystem cumbersome to use:
- A different technology stack is required to solve each type of use case, because some solutions are not reusable across different use cases.
- Proficiency in several technologies is required for productivity
- Some technologies face version compatibility issues
- It is unsuitable for faster data-sharing needs across parallel jobs.
These are the challenges that Apache Spark solves! Spark is a lightning fast in-memory cluster-computing platform, which has unified approach to solve Batch, Streaming, and Interactive use cases as shown in Figure 3.