Use AWS Glue To Migrate RDS Data To Amazon Redshift

For those seeking to migrate their databases to Amazon Redshift, this tutorial provides steps on how to do so using AWS Glue.

Apr. 13, 20 · Tutorial

Likes (5)

Comment

Save

18.7K Views

Moving large amounts of data is always a cumbersome task to do, especially when there are adjustments to be made along the way. However, sticking with a more traditional way of storing data or running database services isn’t efficient either.

If you are building a data lake, for example, moving from Amazon RDS to Amazon Redshift is a logical decision to make. Redshift integrates well with other AWS services and is itself a fully managed, petabyte-scale data warehouse service in the cloud. So, it works optimally in handling petabytes of structured and semi-structured data. Regardless of the size of the data set, Amazon Redshift can provide fast query performance by using other SQL-based tools and business intelligence applications.

Among those tools, to help you fully take advantage of the data warehouse platform, is AWS Glue; which you can use to migrate your data from RDS to Redshift. AWS Glue is a fully managed ETL service (extract, transform, and load) for moving and transforming data between your data stores.

NOTE: It can read and write data from the following AWS services.

AWS Glue

AWS Glue

AWS Glue: Copy And Unload

Moving data to and from Amazon Redshift is something best done using AWS Glue. Glue is an ETL service that can also perform data enriching and migration with predetermined parameters, which means you can do more than copy data from RDS to Redshift in its original structure.

For example, Glue supports FindMatches ML Transform, and it works with Apache Spark. You can run multiple Spark ETL jobs in an efficient way, plus you have the ability to create bookmarks at any point. Bookmarks act as points to which you can rewind your Glue jobs.

What you want to do first is establish ETL runtime for extracting data stored in Amazon RDS. Configure the AWS Glue Crawlers to collect data from RDS directly, and then Glue will develop a data catalog for further processing.

To do this, go to AWS Glue and add a new connection to your RDS database. While you are at it, you can configure the data connection from Glue to Redshift from the same interface. The next step is creating the data catalog; you need a data catalog for RDS and Redshift.

Adding a crawler is a matter of identifying schemes for data copying and unloading, although you have to make sure that crawlers have sufficient access to collect the data. Assign sufficient access level to crawlers using IAM.

Creating Migration Jobs

Glue Jobs are actionable runtimes that perform specific tasks. When you create a Glue Job, you define how data needs to be gathered, processed and transferred. This is the core of Glue ETL; it does the extraction, transformation, and loading of data.

You can use Python or Scala as your ETL language. When configuring Jobs, however, you want to be specific with your data mapping, including the data types for each column. You can add security configurations, additional scripts or job parameters as needed.

Glue will generate codes for the process and display a diagram of how the process flows. If you are happy with the results, you can execute the code to start migrating data from RDS to Redshift. You can immediately try a query in Redshift once the process is completed.

For larger data sets, you want to be careful with how you define the columns you’re migrating. Be specific and make sure you define the column type for each set to avoid unnecessary errors and problems with the process.