Setting Up Dev Endpoint Using Apache Zeppelin With AWS Glue
What is AWS Glue? How can we leverage it to successfully set up a Dev Endpoint? How can we use Apache Zeppelin? Let's explore the set-up using Apache Zeppelin.
Join the DZone community and get the full member experience.Join For Free
AWS Glue is a powerful tool that is managed, relieving you of the hassle associated with maintaining the infrastructure. It is hosted by AWS and offers Glue as Serverless ETL, which converts the code into Python/Scala and executes it in a Spark environment.
AWS Glue provisions all the required resources (Spark cluster) at runtime to execute the Spark Job which takes ~7-10 mins and then starts executing your actual ETL code. To reduce this time, AWS Glues provides a development endpoint, which can be configured in Apache Zeppelin (provisioned with the spark environment) to interactively, run, debug and test ETL code before deploying as Glue job or scheduling the ETL process.
In order to successfully set up the Dev Endpoint on AWS Glue, let's first understand some of its prerequisites:
- An IAM role for the Glue Dev Endpoint with the necessary policies. E.g.:
- A table in the Glue Data Catalog and the necessary connection
- I am assuming that you know your way around VPC networking, security groups, etc., E.g.: The Dev Endpoint requires a security group that allows port 22 since we need that for the SSH tunneling.
Further, we can move on to the process involved in setting up the Dev Endpoint on AWS Glue. Here's how it goes:
- Create an SSL Key Pair
- You can create it using the PuTTyGen tool or you can create it under AWS EC2 -> Network & Security -> Key Pairs
- You need the Public Key which should look like this:
- You will need the Private Key in .ppk format. If you're using PuTTYgen, you will get it in the .ppk format whereas in the case of EC2, you will get the file in .pem format and you will have to convert it into .ppk using the PuTTygen conversion tool.
- Make sure that the file has permissions of 400 or 600 (in the case of Linux) whereas in the case of windows follow the below steps:
- Right click on PEM file > Properties > Security > Advanced.
- Make sure that you are the owner of the file and have disabled the inheritance (once you have disabled it, it will look like the below screenshot)
- In the Development endpoint name
- Give it any name
- IAM Role: Select the role which you've created
- Click Next.
- Choose "Skip Networking Information" if you have S3 data stores, otherwise you can select the rest two as per your instances or security groups. Click Next.
- SSH tunnel for Glue Dev Endpoint: When your dev endpoint is provisioned, check that its network interface has a public address attached to it and make note of it (e.g., ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com).
And using SSH:
- Download the Zeppelin Notebook 0.7.3 version
- Unzip the file and copy the folder under C: Drive
- Go to localhost:8080
- On the top-right corner, click Anonymous > Interpreter > Search for Spark > Edit
- Have Connect to existing process checked
- Set host to localhost and port to 9007
- Under properties, set master to yarn-client
- Remove executor.memory and spark.driver.memory properties if they exist
- Save and click Ok
- Under Notebook, create Notebook > select Spark Interpreter
- Try your program and execute it
And with that, your AWS Glue is up and running. Obviously, this was just scratching the surface. There is a lot more to learn but I'll cover those in another blog.
Published at DZone with permission of Shubham Muneshwar. See the original article here.
Opinions expressed by DZone contributors are their own.