Apache Spark has been Open Source’s new kid on the block. Companies are using Spark to develop sophisticated models that would enable them to discover new opportunities or avoid risk. But what does the future or at least the near future hold for Spark? In this blog we have outlined five trends we see in the Big Data space and the role that Spark expected to play.
1. There is going to be a lot more compute than storage in Hadoop and that’s a good thing: Businesses use Apache Hadoop to either store new and existing data cost effectively or analyze data to come up with new business insights. In the beginning it was relatively easy for companies to justify their foray into Hadoop by rationalizing their data architecture – specifically lowering the operating cost of their enterprise data warehouse by moving cold or warm data or performing expensive data aggregations and transformations (ETL processes) in Hadoop. This resulted in immediate savings for IT organizations but it was also a lower value-add function that can be easily matched by their competitors.
As businesses begin to saturate the benefits of cheap data storage, we expect them to shift their attention to extracting value from their data. This means that processing power or RAM dedicated to analyzing data will begin to outpace the resources dedicated to storing data. We expect Spark with other enabling technologies to be at the front and center of this evolution. Spark is already emerging as the tool of choice for data scientists and business analysts and we expect this trend to accelerate. When Spark-based workloads begin to emerge as among the leading analytical workloads in a company’s Hadoop cluster, that is, in fact, a sign that a business is on its way to extracting insights from its data. We predict that in the next few years, leading organizations will have more IT resources devoted to compute than storage.
2. Data on premise tends to be analyzed on premise but not exclusively: As businesses begin to map their journey to Spark, one option to get up and running quickly is to use cloud-based implementations of Spark. However, this is a viable option only for smaller companies and startups whose data volume is currently small enough to be moved into the cloud or they started out with storing their data in the cloud. For enterprises that have sizable data volume or have invested in large data centers, moving their data into the cloud may not be practical. For this reason, we expect Spark implementations based on existing or historical data to be largely based on-premise.
This implementation distinction between small and large enterprises in certain industries will blur in the near future as larger companies will also consider running Spark in the cloud for net new data sources and use cases. For example, an automobile manufacturer might use its cloud implementation of Spark to analyze data streaming from a car’s sensor to determine its performance on the road, while on-premise Spark cluster is used to analyze historical data and aggregated streaming data for longer term design improvements. However, don’t expect financial services and companies in regulated industries to move their data into the cloud in the near future.
3. Spark Workloads will Increasingly Move into Production: Spark started out as the playground of data scientists where they can build sophisticated models using the volume and type of data that was not possible in the past. By large Spark has been running in companies as part of clusters that are primarily used by data scientists for prototyping and rapid iteration. Hence the need for enterprise requirements such as security and data governance has been minimal. If enterprises are going to use Spark in production deployments to derive critical business insights, then it must have the same safeguards in place that are used to protect data in traditional data architectures, as well as other Hadoop components. This means that access to data and models in Spark must have a protective layer of authentication, authorization, policy administration and encryption built at the platform layer. This approach will ensure that security is applied consistently across the stack.
We predict that as companies move more of their Spark workloads into production, they will apply the same standards of enterprise-grade security and data governance to their Spark deployments as they do to their traditional data infrastructure.
4. Visualizing is believing: Traditional data analysis in companies evolved from data arranged in rows and columns to static reports and dashboards to interactive data visualization that can be shared across the organization and enables analysts and business users to ask questions of data as they occur to them. We expect big data to follow the same trajectory. Up till now, data scientists have been using Spark Core exposed through an API to write Java, Python, Scala, and R scripts. In the near future, we expect Interactive browser-based notebooks to play a major role in enabling data engineers, data analysts and data scientists to develop, organize, execute and share data code and visualize results without referring to the command line or needing the cluster details.
5. Companies will be increasingly drinking their own Spark champagne: Companies have traditionally looked towards their customers to come up with the applications for Spark. This includes building models to place ads, predict next purchase and stress test the risk of financial instruments just to name a few. However, this does not mean that customer-facing scenarios will be the sole domain of Spark. Big data also provides a number of opportunities that can ease an organization’s internal big data journey. Effective security administration and data management are two areas where an organization’s predictive models can go a long way to making its big data initiative successful.
As organizations build their data lakes, it literally contains millions of pieces of data. Organizations need an automated and dynamic way to predict which employee should be given access to which slice of data based on a person’s roles and responsibilities. This will greatly enhance business agility and remove the pain point of system administrators that currently grant access to data manually. Same goes for the organizational need to classify data automatically as it flows into the data lake, based on an algorithm or machine learning capabilities of Spark.
Tell us your thoughts about the trends you see for Spark in the industry and in your own organizations.