Accessing Data Using JDBC on AWS Glue
Accessing Data Using JDBC on AWS Glue
Glue supports accessing data via JDBC, and using the DataDirect JDBC connectors, you can access many different data sources for use in AWS Glue.
Join the DZone community and get the full member experience.Join For Free
New whitepaper: Database DevOps – 6 Tips for Achieving Continuous Delivery. Discover 6 tips for continuous delivery with Database DevOps in this new whitepaper from Redgate. In 9 pages, it covers version control for databases and configurations, branching and testing, automation, using NuGet packages, and advice for how to start a pioneering Database DevOps project. Also includes further research on the industry-wide state of Database DevOps, how application and database development compare, plus practical steps for bringing DevOps to your database. Read it now free.
AWS Glue is an Extract, Transform, Load (ETL) service available as part of Amazon's hosted web services. Glue is intended to make it easy for users to connect their data in a variety of data stores, edit and clean the data as needed, and load the data into an AWS-provisioned store for a unified view.
Glue supports accessing data via JDBC, and currently, the databases supported through JDBC are Postgres, MySQL, Redshift, and Aurora. Of course, JDBC drivers exist for many other databases besides these four. Using the DataDirect JDBC connectors, you can access many other data sources for use in AWS Glue.
This tutorial demonstrates accessing Salesforce data with AWS Glue, but the same steps apply with any of the DataDirect JDBC drivers.
Download DataDirect Salesforce JDBC Driver
Download DataDirect Salesforce JDBC driver from here.
To install the driver, you would have to execute the JAR package and you can do it by running the following command in terminal or just by double-clicking on the JAR package.
java -jar PROGRESS_DATADIRECT_JDBC_SF_ALL.jar
This will launch an interactive Java installer using which you can install the Salesforce JDBC driver to your desired location as either a licensed or evaluation installation.
Note that this will install Salesforce JDBC driver and bunch of other drivers too for your trial purposes in the same folder.
Upload DataDirect Salesforce Driver to Amazon S3
- Navigate to the install location of the DataDirect JDBC drivers and locate the DataDirect Salesforce JDBC driver file, named
- Upload the Salesforce JDBC JAR file to Amazon S3.
Create Amazon Glue Job
Go to AWS Glue Console on your browser, under ETL > Jobs, click on the Add Job button to create a new job. You should see an interface as shown below.
Fill in the name of the job, and choose/create an IAM role that gives permissions to your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job. For this tutorial, we just need access to Amazon S3, as I have my JDBC driver and the destination will also be S3.
Choose A new script to be authored by you under This job runs options.
Give a name for your script and choose a temporary directory for Glue Job in S3.
Under Script Libraries and job parameters (optional), for Dependent Jars path, choose the
sforce.jar file in your S3. Your configuration should look as shown below.
Click the Next button and you should see Glue asking if you want to add any connections that might be required by the job. In this tutorial, we don't need any connections, but if you plan to use another destination such as RedShift, SQL Server, Oracle, etc., you can create the connections to these data sources in your Glue and those connections will show up here.
Click on Next, review your configuration, and click Finish to create the job.
You should now see an editor to write a Python script for the job. Here, you write your custom Python code to extract data from Salesforce using DataDirect JDBC driver and write it to S3 or any other destination.
You can use this code sample to get an idea of how you can extract data from data from Salesforce using DataDirect JDBC driver and write it to S3 in a CSV format. Feel free to make any changes to suit your needs. Save the job.
Run Glue Job
- Click on the Run Job button to start the job. You can see the status by going back and selecting the job that you have created.
- After the Job has run successfully, you should now have a CSV file in S3 with the data that you have extracted using Salesforce DataDirect JDBC driver.
You can use similar steps with any of DataDirect JDBC suite of drivers available for relational, big data, SaaS, and NoSQL Data sources. Feel free to try any of our drivers with AWS Glue for your ETL jobs for a 15-day trial period.
Published at DZone with permission of Nishanth Kadiyala , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.