Quickstart: Apache Spark on Kubernetes
Quickstart: Apache Spark on Kubernetes
See how to run Apache Spark Operator on Kubernetes.
Join the DZone community and get the full member experience.Join For Free
The Apache Spark Operator for Kubernetes
Since its launch in 2014 by Google, Kubernetes has gained a lot of popularity along with Docker itself and since 2016 has become the de facto Container Orchestrator, established as a market standard. Having cloud-managed versions available in all the major Clouds.    (including Digital Ocean and Alibaba).
With this popularity came various implementations and use-cases of the orchestrator, among them the execution of Stateful applications including databases using containers.
What would be the motivation to host an orchestrated database? That’s a great question. But let’s focus on the Spark Operator running workloads on Kubernetes.
A native Spark Operator idea came out in 2016, before that you couldn’t run Spark jobs natively except some hacky alternatives, like running Apache Zeppelin inside Kubernetes or creating your Apache Spark cluster inside Kubernetes (from the official Kubernetes organization on GitHub) referencing the Spark workers in Stand-alone mode.
However, the native execution would be far more interesting for taking advantage of Kubernetes Scheduler responsible for taking action of allocating resources, giving elasticity and an simpler interface to manage Apache Spark workloads.
Considering that, Apache Spark Operator development got attention, merged and released into Spark version 2.3.0 launched in February, 2018.
If you’re eager for reading more regarding the Apache Spark proposal, you can head to the design document published in Google Docs.
As companies are currently seeking to reinvent themselves through the widely spoken digital transformation in order for them to be competitive and, above all, to survive in an increasingly dynamic market, it is common to see approaches that include Big Data, Artificial Intelligence and Cloud Computing   .
An interesting comparison between the benefits of using Cloud Computing instead of On-premises' servers in the context of Big Data can be read at Databricks blog, which is the company founded by the creators of Apache Spark.
As we see widespread adoption of Cloud Computing (even by companies that would be able to afford the hardware and run on-premises), we notice that most of these Cloud implementations don’t have an Apache Hadoop since the Data Teams (BI/Data Science/Analytics) increasingly choose to use tools like Google BigQuery or AWS Redshift. Therefore, it doesn’t make sense to spin-up a Hadoop with the only intention to use YARN as the resources manager.
To better understand the design of Spark Operator, the doc from GCP on GitHub is a no-brainer.
Let’s Get Hands-On!
Warming up the Engine
Now that the word has been spread, let’s get our hands on it to show the engine running. For that, let’s use:
- Docker as the container engine for Kubernetes (installation guide);
- Minikube (installation guide) to facilitate the provisioning of the Kubernetes (yes, it will be a local execution);
- For interaction with the Kubernetes API it is necessary to have
kubectlinstalled, if you don’t have it, follow instructions here.
- a compiled version of Apache Spark larger than 2.3.0.
Once the necessary tools are installed, it’s necessary to include Apache Spark path in
PATH environment variable, to ease the invokation of Apache Spark executables. Simply run:
Creating the Minikube “cluster”
At last, to have a Kubernetes “cluster” we will start a
minikube with the intention of running an example from Spark repository called
SparkPi just as a demonstration.
Building the Docker image
Let’s use the Minikube Docker daemon to not depend on an external registry (and only generate Docker image layers on the VM, facilitating garbage disposal later). Minikube has a wrapper that makes our life easier:
After having the daemon environment variables configured, we need a Docker image to run the jobs. There is a shell script in the Spark repository to help with this. Considering that our
PATH was properly configured, just run:
-m parameter here indicates a minikube build.
Let’s take the highway to execute SparkPi, using the same command that would be used for a Hadoop Spark cluster spark-submit.
Fire in the Hole!
Mid the gap between the Scala version and .jar when you’re parameterizing with your Apache Spark version:
What’s new is:
--master: Accepts a prefix
k8s://in the URL, for the Kubernetes master API endpoint, exposed by the command
https://$(minikube ip):8443. BTW, in case you want to know, it’s a shell command substitution;
--conf spark.kubernetes.container.image=: Configures the Docker image to run in Kubernetes.
To see the job result (and the whole execution) we can run a
kubectl logs passing the name of the driver pod as a parameter:
Which brings the output (omitted some entries), similar to:
The result appears in:
Finally, let’s delete the VM that Minikube generates, to clean up the environment (unless you want to keep playing with it):
I hope your curiosity got sparked and some ideas for further development have raised for your Big Data workloads. If you have any doubts or suggestions, don’t hesitate to share in the comment section.
Published at DZone with permission of Matheus Cunha . See the original article here.
Opinions expressed by DZone contributors are their own.