Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Running Apache Spark Applications in Docker Containers

DZone's Guide to

Running Apache Spark Applications in Docker Containers

Even once your Spark cluster is configured and ready, you still have a lot of work to do before you can run it in a Docker container. But these tips can help make it easier!

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Apache Spark is a wonderful tool for distributed computations. However, some preparation steps are required on the machine where the application will be running. Assuming that you already have your Spark cluster configured and ready, you still have to do the following steps on your workstation:

  • Install Apache Spark distribution containing necessary tools and libraries.

  • Install Java Development Kit.

  • Install and configure SCM client, like Git.

  • Install and configure build tool, like SBT.

Then, you have to check out the source code from the repository, build the binary, and submit it to the Spark cluster using a special spark-submit tool. It should be clear now that one cannot simply just run an Apache Spark application... right?

Wrong! If you have the URL of the application source code and URL of the Spark cluster, then you can just run the application.

Let’s confine the complex things in a Docker container: docker-spark-submit. This Docker image serves as a bridge between the source code and the runtime environment, covering all intermediate steps.

Running applications in containers provides the following benefits:

  • Zero configuration on the machine because the container has everything it needs.

  • Clean application environment thanks to container immutability.

Here is an example of typical usage:

docker run \
  -ti \
  --rm \
  -p 5000-5010:5000-5010 \
  -e SCM_URL="https://github.com/mylogin/project.git" \
  -e SPARK_MASTER="spark://my.master.com:7077" \
  -e SPARK_DRIVER_HOST="host.domain" \
  -e MAIN_CLASS="Main" \
  tashoyan/docker-spark-submit:spark-2.2.0

Parameters SCM_URLSPARK_MASTER,  and MAIN_CLASS are self-describing. Other less intuitive, but important, parameters are as follows.

tashoyan/docker-spark-submit:spark-2.2.0

Choose the tag of the container image based on the version of your Spark cluster. In this example, Spark 2.2.0 is assumed.

-p 5000-5010:5000-5010

It is necessary to publish this range of network ports. The Spark driver program and Spark executors use these ports for communication.

-e SPARK_DRIVER_HOST="host.domain"

You have to specify where the network address of the host machine where the container will be running. Spark cluster nodes should be able to resolve this address. This is necessary for communication between executors and the driver program. For detailed technical information, see SPARK-4563.

Detailed instructions, as well as some examples, are available at the project page on GitHub. You can find there:

  • How to run the application code from a custom Git branch or from a custom subdirectory.

  • How to supply data for your Spark application by means of Docker volumes.

  • How to provide custom Spark settings or application arguments.

  • How to run docker-spark-submit on a machine behind a proxy server.

To conclude, let me emphasize that docker-spark-submit is not intended for continuous integration. The intended usage is to let people quickly try Spark applications, saving them from configuration overhead. CI practices assume separate stages for building, testing, and deploying;  docker-spark-submit does not follow these practices.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
apache spark ,docker ,containers ,big data ,tutorial ,cluster

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}