DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Have You Heard About Cloud Native Buildpacks?
  • Setting Up Local Kafka Container for Spring Boot Application
  • Docker Image Building Best Practices
  • An Introduction to BentoML: A Unified AI Application Framework

Trending

  • Intro to RAG: Foundations of Retrieval Augmented Generation, Part 2
  • Implementing API Design First in .NET for Efficient Development, Testing, and CI/CD
  • Understanding the Shift: Why Companies Are Migrating From MongoDB to Aerospike Database?
  • Software Delivery at Scale: Centralized Jenkins Pipeline for Optimal Efficiency
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Running Apache Spark Applications in Docker Containers

Running Apache Spark Applications in Docker Containers

Even once your Spark cluster is configured and ready, you still have a lot of work to do before you can run it in a Docker container. But these tips can help make it easier!

By 
Arseniy Tashoyan user avatar
Arseniy Tashoyan
·
Aug. 26, 17 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
54.7K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Spark is a wonderful tool for distributed computations. However, some preparation steps are required on the machine where the application will be running. Assuming that you already have your Spark cluster configured and ready, you still have to do the following steps on your workstation:

  • Install Apache Spark distribution containing necessary tools and libraries.

  • Install Java Development Kit.

  • Install and configure SCM client, like Git.

  • Install and configure build tool, like SBT.

Then, you have to check out the source code from the repository, build the binary, and submit it to the Spark cluster using a special spark-submit tool. It should be clear now that one cannot simply just run an Apache Spark application... right?

Wrong! If you have the URL of the application source code and URL of the Spark cluster, then you can just run the application.

Let’s confine the complex things in a Docker container: docker-spark-submit. This Docker image serves as a bridge between the source code and the runtime environment, covering all intermediate steps.

Running applications in containers provides the following benefits:

  • Zero configuration on the machine because the container has everything it needs.

  • Clean application environment thanks to container immutability.

Here is an example of typical usage:

docker run \
  -ti \
  --rm \
  -p 5000-5010:5000-5010 \
  -e SCM_URL="https://github.com/mylogin/project.git" \
  -e SPARK_MASTER="spark://my.master.com:7077" \
  -e SPARK_DRIVER_HOST="host.domain" \
  -e MAIN_CLASS="Main" \
  tashoyan/docker-spark-submit:spark-2.2.0

Parameters SCM_URL, SPARK_MASTER,  and MAIN_CLASS are self-describing. Other less intuitive, but important, parameters are as follows.

tashoyan/docker-spark-submit:spark-2.2.0

Choose the tag of the container image based on the version of your Spark cluster. In this example, Spark 2.2.0 is assumed.

-p 5000-5010:5000-5010

It is necessary to publish this range of network ports. The Spark driver program and Spark executors use these ports for communication.

-e SPARK_DRIVER_HOST="host.domain"

You have to specify where the network address of the host machine where the container will be running. Spark cluster nodes should be able to resolve this address. This is necessary for communication between executors and the driver program. For detailed technical information, see SPARK-4563.

Detailed instructions, as well as some examples, are available at the project page on GitHub. You can find there:

  • How to run the application code from a custom Git branch or from a custom subdirectory.

  • How to supply data for your Spark application by means of Docker volumes.

  • How to provide custom Spark settings or application arguments.

  • How to run docker-spark-submit on a machine behind a proxy server.

To conclude, let me emphasize that docker-spark-submit is not intended for continuous integration. The intended usage is to let people quickly try Spark applications, saving them from configuration overhead. CI practices assume separate stages for building, testing, and deploying;  docker-spark-submit does not follow these practices.

Docker (software) application Apache Spark

Published at DZone with permission of Arseniy Tashoyan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Have You Heard About Cloud Native Buildpacks?
  • Setting Up Local Kafka Container for Spring Boot Application
  • Docker Image Building Best Practices
  • An Introduction to BentoML: A Unified AI Application Framework

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!