Cloudera announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform.
- The Dataset API further enhances Spark’s claim as the best tool for data engineering by providing compile-time type safety along with the benefits of a query-optimization engine.
- The Structured Streaming API enables the modeling of streaming data as a continuous DataFrame and expresses operations on that data with a SQL-like API.
- It offers a richer collection of ML algorithms, as well as the ability to persist models and pipelines.
The Spark 2.0 Beta is available in the form of a Cloudera Manager add-on service. Add-on services are separate, standalone components from Cloudera or its ISV partners that can utilize Cloudera Manager’s distribution, configuration, monitoring, resource-management, and lifecycle-management features. Thus, on any Cloudera Manager-managed cluster with the CDH parcel installed, the beta can be deployed “side-by-side” with Spark 1.6 and treated like any other service. This initial beta release (2.0 Beta 1) is compatible with the CDH 5.7.x line, 5.8.x line, and soon-to be-released 5.9.x line (and requires Scala 2.11); see the docs for more details.
To activate the beta, you should simply upload the Spark 2.0 Beta Custom Service Descriptor (CSD) file, which is available here, to Cloudera Manager. The CSD file contains all the configuration metadata needed to describe and manage the Spark 2.0 Beta in Cloudera Manager, including the URL of the relevant repository for parcel installation and deployment.
Installing the Spark Beta 2.0 CSD
- Download and save the Spark 2.0 Beta CSD file to your desktop.
- Login to the Cloudera Manager Server host, and upload the CSD file to
/opt/cloudera/csd(or to whatever other location you may have configured for CSD files).
- Set the file ownership to
cloudera-scm:cloudera-scmwith permission 644.
- Restart the Cloudera Manager Server with
service cloudera-scm-server restart
- Login to the Cloudera Manager Admin Console and restart the Cloudera Management Service.
- Do either of the following:
- Select Clusters -> Cloudera Management Service -> Cloudera Management Service, and then select Actions -> Restart. Or:
- On the Home -> Status tab, open the drop-down menu to the right of “Cloudera Management Service” and select Restart.
- The Command Details window shows the progress of stopping and then starting the roles. When the message “Command completed with n/n successful subcommands” appears, the task is complete. Click Close.
- Do either of the following:
- You should now see the Spark 2.0 Beta in your “Parcels” list, and from there, it can be downloaded, distributed, and activated/deactivated as needed.
- After deploying the parcel, create a “spark2 service” from the Cluster dropdown.
Fired Up and Ready to Go
With that, you’re now ready to explore the Spark 2.0 Beta. With the beta installed on your CDH cluster, you can run Spark 2.0 jobs as well as Spark 1.6 jobs simultaneously on the cluster.
Keep in mind, although no support is provided for beta releases, we strongly encourage you to test early and often in preparation for the upcoming GA release because there are significant differences between Spark 2.0 and the Spark 1.x line. As usual, please provide any and all feedback about the beta via the Cloudera Community’s Beta Releases forum.
Anand Iyer is a Director of Product Management at Cloudera.
Mark Grover is a Software Engineer working on the Spark team at Cloudera. He is a committer and PMC member on Apache Sentry and has also contributed to Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume.