DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • From DLT to Lakeflow Declarative Pipelines: A Practical Migration Playbook
  • Master Advanced Error-Handling to Make PySpark Pipelines Production-Ready
  • Declarative Pipelines in Apache Spark 4.0
  • Upgrading Spark Pipelines Code: A Comprehensive Guide

Trending

  • Mocking Kafka for Local Spring Development
  • A Deep Dive into Tracing Agentic Workflows (Part 1)
  • From APIs to Actions: Rethinking Back-End Design for Agents
  • Retesting Best Practices for Agile Teams: A Quick Guide to Bug Fix Verification
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Seven Steps To Deploy Kedro Pipelines on Amazon EMR

Seven Steps To Deploy Kedro Pipelines on Amazon EMR

In this post, the author explains how to launch an Amazon EMR cluster and how to deploy a Kedro project to run a Spark job.

By 
Jo Stichbury user avatar
Jo Stichbury
DZone Core CORE ·
May. 30, 23 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
3.9K Views

Join the DZone community and get the full member experience.

Join For Free

This post explains how to launch an Amazon EMR cluster and deploy a Kedro project to run a Spark job.

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform for applications built using open-source big data frameworks, such as Apache Spark, that process and analyze vast amounts of data with AWS.

1. Set up the Amazon EMR Cluster

One way to install Python libraries onto Amazon EMR is to package a virtual environment and deploy it. To do this, the cluster needs to have the same Amazon Linux 2 environment as used by Amazon EMR.

We used this example Dockerfile to package our dependencies on an Amazon Linux 2 base. Our example Dockerfile is as below:

Shell
 
FROM --platform=linux/amd64 amazonlinux:2 AS base 

RUN yum install -y python3 

ENV VIRTUAL_ENV=/opt/venv 
RUN python3 -m venv $VIRTUAL_ENV 
ENV PATH="$VIRTUAL_ENV/bin:$PATH" 

COPY requirements.txt /tmp/requirements.txt 

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install venv-pack==0.2.0 && \ 
    python3 -m pip install -r /tmp/requirements.txt 

RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz 

FROM scratch AS export 
COPY --from=base /output/pyspark_deps.tar.gz /


Note: A DOCKER_BUILDKIT backend is necessary to run this Dockerfile (make sure you have it installed).

Run the Dockerfile using the following command:

DOCKER_BUILDKIT=1 docker build --output . <output-path>

This will generate a pyspark_deps.tar.gz file at the <output-path> specified in the command above.

Use this command if your Dockerfile has a different name:

DOCKER_BUILDKIT=1 docker build -f Dockerfile-emr-venv --output . <output-path>

2. Set up CONF_ROOT

The kedro package command only packs the source code and yet the conf directory is essential for running any Kedro project. To make it available to Kedro separately, its location can be controlled by setting CONF_ROOT.

By default, Kedro looks at the root conf folder for all its configurations (catalog, parameters, globals, credentials, logging) to run the pipelines, but this can be customised by changing CONF_ROOT in settings.py.

For Kedro versions < 0.18.5 For Kedro versions >= 0.18.5

  • Change CONF_ROOT in settings.py to the location where the conf directory will be deployed. It could be anything. e.g. ./conf or /mnt1/kedro/conf. 

For Kedro versions >= 0.18.5

  • Use the --conf-source CLI parameter directly with kedro run to specify the path. CONF_ROOT need not be changed in settings.py.

3. Package the Kedro Project

Package the project using the kedro package command from the root of your project folder. This will create a .whl in the dist folder that will be used when doing spark-submit to the Amazon EMR cluster to specify the --py-files to refer to the source code.

4. Create .tar for conf

As described, the kedro package command only packs the source code and yet the conf directory is essential for running any Kedro project. Therefore it needs to be deployed separately as a tar.gz file. It is important to note that the contents inside the folder needs to be zipped and not the conf folder entirely.

Use the following command to zip the contents of the conf directory and generate a conf.tar.gz file containing catalog.yml, parameters.yml and other files needed to run the Kedro pipeline. It will be used with spark-submit for the --archives option to unpack the contents into a conf directory.

tar -czvf conf.tar.gz --exclude="local" conf/*

5. Create an Entrypoint for the Spark Application

Create an entrypoint.py file that the Spark application will use to start the job. This file can be modified to take arguments and can be run only using main(sys.argv) after removing the params array.

python entrypoint.py --pipeline my_new_pipeline --params run_date:2023-02-05,runtime:cloud

This would mimic the exact kedro run behaviour.

Python
 
import sys 
from proj_name.__main__ import main: 

if __name__ == "__main__":
	"""
	These params could be used as *args to 
	test pipelines locally. The example below 
	will run `my_new_pipeline` using `ThreadRunner`
	applying a set of params
	params = [ 
		"--pipeline", 
		"my_new_pipeline", 
		"--runner", 
		"ThreadRunner", 
		"--params", 
		"run_date:2023-02-05,runtime:cloud", 
	] 
	main(params) 
	"""

	main(sys.argv)


6. Upload Relevant Files to S3

Upload the relevant files to an S3 bucket (Amazon EMR should have access to this bucket), in order to run the Spark Job. The following artifacts should be uploaded to S3:

  • .whl file created in step #3
  • Virtual Environment tar.gz created in step 1 (e.g. pyspark_deps.tar.gz)
  • .tar file for conf folder created in step #4 (e.g. conf.tar.gz)
  • entrypoint.py file created in step #5.

7.spark-submit to the Amazon EMR Cluster

Use the following spark-submit command as a step on Amazon EMR running in cluster mode. A few points to note:

  • pyspark_deps.tar.gz is unpacked into a folder named environment
  • Environment variables are set referring to libraries unpacked in the environment directory above. e.g. PYSPARK_PYTHON=environment/bin/python
  • conf directory is unpacked to a folder specified in the following after the # symbol ( s3://{S3_BUCKET}/conf.tar.gz#conf)

Note the following:

  • Kedro versions < 0.18.5. The folder location/name after the # symbol should match with CONF_ROOT in settings.py
  • Kedro versions >= 0.18.5. You could follow the same approach as earlier. However, Kedro now provides flexibility to provide the CONF_ROOT through the CLI parameters using --conf-source instead of setting CONF_ROOT in settings.py. Therefore --conf-root configuration could be directly specified in the CLI parameters and step 2 can be skipped completely.
Shell
 
spark-submit 
    --deploy-mode cluster 
    --master yarn 
    --conf spark.submit.pyFiles=s3://{S3_BUCKET}/<whl-file>.whl
    --archives=s3://{S3_BUCKET}/pyspark_deps.tar.gz#environment,s3://{S3_BUCKET}/conf.tar.gz#conf
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python
    --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python 
    --conf spark.yarn.appMasterEnv.<env-var-here>={ENV} 
    --conf spark.executorEnv.<env-var-here>={ENV} 

    s3://{S3_BUCKET}/run.py --env base --pipeline my_new_pipeline --params run_date:2023-03-07,runtime:cloud 


Summary

This post describes the sequence of steps needed to deploy a Kedro project to an Amazon EMR cluster.

  • Set up the Amazon EMR cluster
  • Set up CONF_ROOT (optional for Kedro versions >= 0.18.5)
  • Package the Kedro project
  • Create an entrypoint for the Spark application
  • Upload relevant files to S3
  • spark-submit to the Amazon EMR cluster

Kedro supports a range of deployment targets, including Amazon SageMaker, Databricks, Vertex AI and Azure ML, and our documentation additionally includes a range of approaches for single-machine deployment to a production server.

Apache Spark Pipeline (software) MapReduce

Published at DZone with permission of Jo Stichbury. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • From DLT to Lakeflow Declarative Pipelines: A Practical Migration Playbook
  • Master Advanced Error-Handling to Make PySpark Pipelines Production-Ready
  • Declarative Pipelines in Apache Spark 4.0
  • Upgrading Spark Pipelines Code: A Comprehensive Guide

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook