DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • The Importance of Persistent Storage in Kubernetes- OpenEBS
  • Keep Your Application Secrets Secret
  • Auto-Scaling a Spring Boot Native App With Nomad
  • Common Performance Management Mistakes

Trending

  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  • Navigating the LLM Landscape: A Comparative Analysis of Leading Large Language Models
  • Ensuring Configuration Consistency Across Global Data Centers
  • Next-Gen IoT Performance Depends on Advanced Power Management ICs
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Running Apache Spark on Kubernetes

Running Apache Spark on Kubernetes

This article covers using Spark on K8s to overcome dependency on cloud providers and running Apache Spark on Kubernetes.

By 
Ramiro Alvarez Fernandez user avatar
Ramiro Alvarez Fernandez
·
Jul. 26, 21 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
11.7K Views

Join the DZone community and get the full member experience.

Join For Free

For the last few weeks, I’ve been deploying a Spark cluster on Kubernetes (K8s). I want to share the challenges, architecture, and solution details I’ve discovered with you.

Challenges

At Empathy, all code running in production must be cloud-agnostic. As of this publication date, Empathy has overcome a previous dependency on cloud providers by using Spark solutions, according to the Cloud provider: EMR (AWS scenario), Dataproc (GCP scenario), and HDInsight (Azure scenario).

The different solutions for these cloud providers offer an easy and simple method to deploy Spark on the cloud. However, some limitations arise when a company scales up, leading to several key questions:

  • How do you orchestrate jobs?
  • How do you distribute the Spark job?
  • How do you schedule nightly jobs?
  • Where is the jobs code configuration?
  • How can changes be propagated?
  • Can you reuse job definitions? Templates?
  • Can you reference the jobs through code?
  • Can you test from localhost?

These are common questions when trying to execute Spark jobs. Solving them with Kubernetes can save effort and provide a better experience.

Running Apache Spark on K8s offers us the following benefits:

  • Scalability: The new solution should be scalable for any needs.
  • Reliability: The new solution should monitor compute nodes and automatically terminate and replace instances in case of failure.
  • Portability: The new solution should be deployable in any cloud solution, avoiding dependency on a particular cloud provider. Overall, this approach saves time in thinking about orchestrating, distributing, and scheduling Spark jobs with the different cloud service providers.
  • Cost-effectiveness: You don’t need the cloud provider, so you can save on these costs.
  • Monitoring: The new solution should include ad-hoc monitoring.
  • K8s ecosystem: Uses common ecosystem as with other workloads and offers continuous deployment, RBAC, dedicated node-pools, autoscaling, etc.

The benefits are the same as Empathy’s solution for Apache Flink running on Kubernetes, as I explored in my previous article.

Apache Spark on Kubernetes

Apache Spark is a unified analytics engine for big data processing, particularly handy for distributed processing. Spark is used for machine learning and is currently one of the biggest trends in technology.

Apache Spark Architecture

Spark Submit can be used to submit a Spark Application directly to a Kubernetes cluster. The flow would be as follows:

  1. Spark Submit is sent from a client to the Kubernetes API server in the master node.
  2. Kubernetes will schedule a new Spark Driver pod.
  3. Spark Driver pod will communicate with Kubernetes to request Spark executor pods.
  4. The new executor pods will be scheduled by Kubernetes.
  5. Once the new executor pods are running, Kubernetes will notify Spark Driver pod that new Spark executor pods are ready.
  6. Spark Driver pod will schedule tasks on the new Spark executor pods.

Spark Submit Flowchart

You can schedule a Spark Application using Spark Submit (vanilla way) or using Spark Operator.

Spark Submit

Spark Submit is a script used to submit a Spark Application and launch the application on the Spark cluster. Some nice features include:

  • Kubernetes version: Not dependent on Kubernetes version.
  • Native Spark: It’s included in the Spark image.
  • Non-declarative setup: Need to plan how to orchestrate jobs.
  • Define K8s resources needed: Mounting configmaps, volumes, set anti-affinity, nodeSelectors, etc.
  • CRD not needed: A Kubernetes custom resource is not needed.

Spark Operator

The SparkOperator project was developed by Google and is now an open-source project. It uses Kubernetes Custom Resource for specifying, running, and surfacing the status of Spark Applications. Some nice features include:

  • Declarative: Application specification and management of application through custom resources.
  • Planned restarts: Configurable restart policy.
  • K8s resources automatically defined: Support mounting configmaps and volumes, set pod affinity, etc.
  • Dependencies injection: Inject dependencies directly.
  • Metrics: Supports collecting and exporting application-level metrics and driver/executor metrics to Prometheus.
  • Open-source community: Everyone can contribute.
Spark Submit vs Spark Operator

The image above shows the main commands of Spark Submit vs Spark Operator.

Empathy’s solution prefers Spark Operator because it allows for faster iterations than Spark Submit, where you have to create custom Kubernetes manifests for each use case.

Solution Details

To solve the questions posed in the Challenges section, ArgoCD and Argo Workflows can help you, along with the support of CNCF projects. For instance, you can schedule your favorite Spark Applications workloads from Kubernetes using ArgoCD to create Argo Workflows and define sequential jobs.

The flowchart would be as follows:

  • Define your changes on git.
  • ArgoCD syncs your git changes to your K8s cluster (for instance, create an Argo Workflow template).
  • Argo Workflows template allows you to customize inputs and reuse configurations for multiple Spark jobs and create nightly jobs based on Argo Workflows.
Solution flowchart

ArgoCD

ArgoCD is a GitOps continuous delivery tool for Kubernetes. The main benefits are:

  • GitOps: Using git repositories as a source of truth for defining the desired application state.
  • Declarative setup: Everything on git!
  • Traceability and automation: Apps deployments can track updates to branches, tags, etc. Apps deployment will be automated based on the specific target environments.
  • Web UI: Good-looking UI to check the workloads deployed.
  • K8s manifests Kustomize, Helm, ksonnet, jsonnet, etc. Choose your fighter!

More detailed information can be found in their official documentation.

Argo Workflows

Argo Workflows is a workflow solution for Kubernetes. The main benefits are:

  • Job orchestration: This allows for orchestrating jobs sequentially or creating a custom DAG.
  • Schedule workflows: Cron native.
  • Spark Applications: Easily orchestrate Spark Applications on any Kubernetes cluster.
  • Workflow Template: Reuse templates for different use cases. Input can be parameterized.
  • WebUI: Great visual UI to check the workflows’ progress.

More detailed information can be found in their official documentation.

Monitoring

Once Prometheus scrapes the metrics, some Grafana Dashboards are needed. The custom Grafana Dashboards for Apache Spark is based on the following community dashboards:

  • ArgoCD Dashboard
  • Argo Workflow Dashboard
  • Apache Spark Operator Dashboard
  • Apache Spark Applications Dashboard

To Sum Up

Empathy chooses Spark Operator, ArgoCD, and Argo Workflows to create a Spark Application Workflow solution on Kubernetes and uses GitOps to propagate the changes. The setup illustrated in this article has been used in production environments for about one month, and the feedback is great! Everyone is happy with the workflow — having a single workflow that’s valid for any cloud provider, thus getting rid of individual cloud provider solutions.

To test it for yourself, follow these hands-on samples and enjoy deploying some Spark Applications from localhost, with all the setup described in this guide: Hands-on Empathy Repo.

I’ve also drawn upon my presentation for Kubernetes Days Spain 2021.

Though the journey was long, we’ve learned a lot along the way. I hope our innovations will help you become more cloud-agnostic too.

References

  • Spark Operator
  • Running Spark on Kubernetes
  • Implementing and Integrating Argo Workflow and Spark on Kubernetes
  • Argo Project
  • Scaling Apache Spark on Kubernetes
  • Spark Docker Image
  • Optimizing Spark Performance on Kubernetes
  • Spark on Kubernetes with Argo and Helm — GoDataDriven
  • Amazon EKS Spark ETL Workloads
  • Migrating Spark Workloads from EMR to K8s
  • Kubernetes Workflows for BigData, AI/DL
  • Hands-on Empathy Repo: Spark on Kubernetes
Kubernetes Apache Spark application Machine learning workflow career Docker (software) Cloud pods Open source

Published at DZone with permission of Ramiro Alvarez Fernandez. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • The Importance of Persistent Storage in Kubernetes- OpenEBS
  • Keep Your Application Secrets Secret
  • Auto-Scaling a Spring Boot Native App With Nomad
  • Common Performance Management Mistakes

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: