Apache Spark™ is a general data processing framework, which is getting popular due to its fast data model and its flexible execution engine compared to MapReduce. In fact, Spark is becoming an essential technology for data analytics and this has become even more evident by the fact that all the top three Hadoop distributions are now including Spark. We recognized the importance of Spark early on and therefore integrated it with the Cask Data Application Platform (CDAP), a 100% open source, enterprise-ready application and integration platform, since version 2.5, which was released more than 18 months ago. Recently, with the release of CDAP version 3.4, we have much improved Spark integration in order to provide a new seamless API for Scala and Java, Spark Streaming and SparkSQL support, fine grained transaction support with Apache Tephra™ (incubating) and many other enhancements in the runtime system.
While Spark by itself is a great tool for data analytics, CDAP adds the necessary components for creating an enterprise-grade, production-ready data platform around Spark. First, let us take a look at what a typical enterprise data platform looks like:
An enterprise data platform usually consists of data pipelines for performing ETL (Extract-Transform-Load) processes to collect, clean up and store data in the cluster in some standardized form for later consumption. Once the data have landed on the cluster, different processes such as Spark can start consuming those datasets and generating new result datasets. The platform usually provides different ways to access the data, such as ad-hoc SQL queries, programmatic API or web services. Typically, there are also system-wide services, such as security, data lineage, auditing and transaction support, which apply to all data and applications that are managed by the platform. With CDAP, a lot of these processes and services are provided through the platform itself, and through extensions like Cask Hydrator and Cask Tracker.
Starting with CDAP 3.4, we have added support for running self-contained Spark programs in CDAP without any modification. This allows any existing Spark program to enjoy all the benefits provided by CDAP by simply deploying and running the Spark program from within CDAP.
When running Spark programs in CDAP, the first obvious benefit is the logs and metrics collection. Logs and metrics for Spark programs are collected in real time. It includes those emitted from the driver process as well as from all executors running in a distributed fashion. Being able to view and monitor these details in real time while the program is running, and also having the ability to query historical data is not only very important for developers, but also for DevOps staff. Just imagine how much time this feature can save you when debugging as opposed to having to go through the logs for each individual container across the cluster.
Another enhancement that we made to Spark in CDAP is the integration with Apache Tephra and CDAP Datasets to provide data integrity and data abstraction at the framework level. Tephra provides globally consistent ACID transactions for distributed data stores, such as Apache HBase, so that you don’t have to worry about running into data inconsistency in case of system failure. By default, whenever a transactional dataset is accessed, a new transaction will be created implicitly to cover the execution of the Spark job that is created for the Spark action. For example, here is how to copy a table transactionally:
In case you need a transaction that covers more than one Spark action, you can do so by creating an explicit transaction. For example, to copy a table to two tables in the same transaction:
The integration of Tephra and CDAP Datasets with Spark removes the burden from the developer to handle the complex details of the underlying storage and data consistency. For example, developers no longer need to worry about dirty reads and partial writes when two or more applications are accessing the same dataset concurrently. Also, with transactional datasets supported in CDAP, it becomes trivial to do exactly once processing in Spark Streaming.
Another benefit of using CDAP Datasets in Spark is the automatic capturing of metadata , as well as audit trail and data lineage analysis through Cask Tracker, another CDAP extension. It provides visibility into how datasets were created, used and accessed by Spark programs. Tracker is the perfect tool to give you insights on the metadata of your data.
With CDAP 3.4, we have also enhanced Cask Hydrator, a CDAP extension, to support Spark compute and the Spark Machine Learning library (MLlib) as part of the data pipeline, which can be created using a drag and drop user interface in Hydrator. With this feature, your experienced Spark developers can now create reusable modules and make them available through Hydrator, so that data scientists now can enjoy the power of Spark, too, without having to be experts on any of the low level details of a Spark cluster.
Besides the new API, Tephra integration and Hydrator enhancements, there are various improvements to make the experience of running Spark in CDAP even better. For example, we switched to use yarn-cluster mode for launching Spark programs in the Hadoop YARN cluster for better resource isolation between Spark programs that run within the same Workflow fork. The SDK implementation was also improved so that multiple Spark programs can run concurrently in the same JVM. During the integration process, we discovered and patched a couple Spark platform bugs (SPARK-13441, SPARK-14513) as well to fix runtime failure and memory leakage and contributed it back to the Spark open source community.
At Cask, we always focus on making big data technology easy to use and accessible. By integrating Spark with CDAP and making it more accessible through the graphical interfaces in Hydrator and Tracker, we are doing our part to make Spark more powerful and easier to use at the same time. We will continue to embrace Spark and keep improving the integration.