Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Continuously Encrypt Amazon Redshift Loads with S3, KMS, and Lambda

DZone's Guide to

Continuously Encrypt Amazon Redshift Loads with S3, KMS, and Lambda

When building a new system, our urge is to do the magic, make it work, and gain the user appreciation for it as fast as we can. However, you need to continuously encrypt.

· Big Data Zone
Free Resource

Leveraging big data is necessary for survival. The question is how, the hurdles—complexity, scalability, speed, cost, reliability, expertise—are many. The answer is Qubole's Autonomous Data Platform.

One of the main goals of this blog is to help developers and data architects, just like us, with their Amazon Redshift operations. Starting from a full comparison with Google BigQuery and explaining how to load Google Analytics data into it and perform analysis using Tableau, all the way to providing you with tools such as PGProxy, which can help you load balance queries between multiple Redshift clusters, this article may extend your know-how on Redshift and help you do your job much better.

When building a new system, our urge is to do the magic, make it work, and gain the user appreciation for it as fast as we can. Security is not the first consideration. However, as the system grows, and especially as the amount of data we store grows, the realization that there’s an asset that needs protecting becomes abundantly clear.

This guide will help you continually encrypt your data loads to Redshift and make sure your information is safe.

About Amazon KMS

Before getting into the step-by-step details, note that we are going to use one of Amazon’s main encryption tools, the AWS Key Management Service (KMS). This service allows for the creation and management of encryption keys. It is based on the envelope algorithm concept, so it requires two keys for encrypting and decrypting operations — a data key to encrypt the data and a master key to encrypt the data key.

Using Amazon Unload/Copy Utility

This Amazon Redshift utility helps to migrate data between Redshift clusters or databases. Using it we can also move the data from Redshift to S3. Then we will use KMS to encrypt the data and return it to Redshift. 

Image source: GitHub.

This is planned for use as a scheduled activity for instances running a data pipeline shell activity.

1. Connect to Redshift

We use PyGreSQL (Python Module) to connect to our Redshift cluster. To install PyGreSQL on an EC2 Linux instance, run these commands:

sudo easy_install pip
sudo yum install postgresql postgresql-devel gcc python-devel
sudo pip install PyGreSQL

2. Create Tables

Next, we need to create two identical tables in Redshift, one for source redshift and another for destination Redshift. In source Redshift, we will create a table name “source_student_data,” and in destination Redshift, we will create a table name “destination_student_data.”

CREATE TABLE source_student_data
(
	ADM_DATE TIMESTAMP,
	STUDENT_ID INTEGER,
	STUDENT_COUNTRY VARCHAR(10),
	STUDENT_DIVISION VARCHAR(20),
	STUDENT_STATE_NAME CHAR(15),
	STUDENT_SEMESTER CHAR(8),
	STUDENT_AP CHAR(3)
)
DISTSTYLE EVEN
SORTKEY
(
	ADM_DATE
);

CREATE TABLE destination_student_data
	ADM_DATE TIMESTAMP,
	STUDENT_ID INTEGER,
	STUDENT_COUNTRY VARCHAR(10),
	STUDENT_DIVISION VARCHAR(20),
	STUDENT_STATE_NAME CHAR(15),
	STUDENT_SEMESTER CHAR(8),
	STUDENT_AP CHAR(3)
)
DISTSTYLE EVEN
SORTKEY
(
	ADM_DATE
);

3. Create a KMS Master Key

Because we are going to encrypt our Redshift data with KMS, we need to create a KMS master key. In this step, we will create it using the following script, named createkmskey.sh:

#!/bin/bash
keyArn=$(aws kms create-key --description UnloadCopyUtility --key-usage "ENCRYPT_DECRYPT" --region us-west-2 --query KeyMetadata.Arn)

keyId=$(echo $keyArn | cut -d/ -f2 | tr -d '"')

aws kms create-alias --target-key-id $keyId --alias-name "alias/UnloadCopyUtility" --region us-west-2

echo "Created new KMS Master Key $keyArn with alias/UnloadCopyUtility"

4. Encrypt Your Password

After creating a KMS master key, it’s time to encrypt our password. We will write a bash script named encryptvalue.sh, which will generate a Base64-encoded encrypted value that we will use in our configuration file in the next step. To create “encryptvalue.sh”, run the following commands:

#!/bin/bash

key=alias/UnloadCopyUtility

aws kms encrypt --key-id $key –plaintext $1 --region us-west-2 --query CiphertextBlob --output text

After creating the script, execute the following command with a password:

$ ./encryptvalue.sh password123

The output will resemble the following:

DqPLC+63KJ29llps+IZExFl5Ce47Qrg+ptqCnAHQFHY0fBKRAQEBAgB4lGPveByOeoYb7fiGRMRZeQnuO0K4PqbagpwB0BR2NHwAAABoMGYGCSqGSIb3DQEHBqBZMFcCAQAwUgYJKoZIhvcNAQcBMB4GCWCGSAFlAwQBLjARBAwcOR73wpqThnkYsHMCARCAJbci0vUsbM9iZm8S8fhkXhtk9vGCO5sLP+OdimgbnvyCE5QoD6k=

Now, we need to copy this value and insert it into our configuration file.

5. Create Configuration File for Redshift Info

We are going to use a Python script to automatically unload and copy our Redshift data, so we need to provide information beforehand. In this step, we will create a configuration file named ‘config.json’, where we will provide the Redshift cluster endpoint, database name, AWS access key, and S3 bucket, as well as the Base64-encoded password created in the previous step.

{
  // the source database from which we'll export data
  "unloadSource": {
    "clusterEndpoint": "source-cluster.c6jhdm5beccty.us-west-2.redshift.amazonaws.com",
    "clusterPort": 5439,
    // base 64 encoded password for the user to UNLOAD data as. Use the encryptvalue.sh utility to generate this string 
"connectPwd": "DqPLC+63KJ29llps+IZExFl5Ce47Qrg+ptqCnAHQFHY0fBKRAQEBAgB4lGPveByOeoYb7fiGRMRZeQnuO0K4PqbagpwB0BR2NHwAAABoMGYGCSqGSIb3DQEHBqBZMFcCAQAwUgYJKoZIhvcNAQcBMB4GCWCGSAFlAwQBLjARBAwcOR73wpqThnkYsHMCARCAJbci0vUsbM9iZm8S8fhkXhtk9vGCO5sLP+OdimgbnvyCE5QoD6k=",
    "connectUser": "redshift",
    "db": "studentdb",
    "schemaName": "public",
    "tableName": "source_student_data"
  },
  // location and credentials for S3, which are used to store migrated data while in flight
  "s3Staging": {

    "aws_access_key_id": "AWZXXXXXXXXX",

    "aws_secret_access_key": "cJnwqxxxxxxxxxxx",
    // path on S3 to use for storing in-flight data. The current date and time is appended to the prefix
    "path": "s3://redshift-copy-cluster/prefix/",
    "deleteOnSuccess": "True",
    // region to use for the S3 export
    "region": "us-west-2"
  },
  // the destination database into which we will copy data
  "copyTarget": {
    "clusterEndpoint": "destination-cluster.c6jhdm5beccty.us-west-2.redshift.amazonaws.com",
    "clusterPort": 5439,
    // base 64 encoded password for the user to COPY data as. Use the encryptvalue.sh utility to generate this string
"connectPwd": "DqPLC+63KJ29llps+IZExFl5Ce47Qrg+ptqCnAHQFHY0fBKRAQEBAgB4lGPveByOeoYb7fiGRMRZeQnuO0K4PqbagpwB0BR2NHwAAABoMGYGCSqGSIb3DQEHBqBZMFcCAQAwUgYJKoZIhvcNAQcBMB4GCWCGSAFlAwQBLjARBAwcOR73wpqThnkYsHMCARCAJbci0vUsbM9iZm8S8fhkXhtk9vGCO5sLP+OdimgbnvyCE5QoD6k=",
    "connectUser": "redshift",
    "db": "studentdb",
    "schemaName": "public",
    "tableName": "destination_student_data"
  }
}

6. Encrypt the Data 

We are using a Python script name, “redshift-unload-copy.py,” which will unload the source data from Redshift, then encrypt the data with the KMS master key and upload to S3, and finally copy the encrypted data from S3 to the destination Redshift cluster.

$ python redshift-unload-copy.py s3://redshift-copy-cluster/config.json us-west-2

To access the full script, click here.

Using AWS Lambda

This solution builds an automatic pipeline that creates a KMS master key, uploads encrypted data to S3, and copies the encrypted data back to Redshift. Next, we’ll use Lambda to continuously encrypt newly incoming data.

At the initial stage, Lambda receives an S3 notification. Based on the file prefix, Lambda receives the bucket and the key, then builds the copy command that will run in the destination Redshift cluster. To make the Lambda function idempotent, it verifies the file has not already been copied before executing the ‘COPY’ command.

To implement this solution, we need to do the following things:

First of all, download the source file from this location.

After downloading the ZIP file, unzip it, and edit the parameters of the ‘copy.py’ script’ file according to these requirements:

iam_role = "arn:aws:redshift:us-west-2:805124XXXXXX:cluster:destination-cluster"
db_database = "studentdb"
db_user = "redshift"
db_password = "XXXXXXX"
db_port = "5439"
db_host = "destination-cluster.c6jhdm5beccty.us-west-2.redshift.amazonaws.com"
query_bucket = "s3://redshift-copy-cluster"
query_prefix = "s3://redshift-copy-cluster/prefix"

Once the copy.py file is edited, ZIP the whole folder again.

Now, in the AWS console, go to AWS Lambda and create a function.

aws create a lambda function

Click Create a Lambda Function to display the next page.

aws lambda function select blue print

Select Blank Function to configure the triggers.

aws lambda configure triggers

Click Next.

aws lambda configure action

On this page, specify the Name, Description, and Runtime environment. For our use case, the Runtime environment will be Python 2.7. From the Code entry type drop-down list, select Upload a ZIP file and select the zip file we created earlier.

aws lambda advanced settings

Provide the Handler, Role, and Existing role information, and click Next.

aws lambda provide handler

Click Create Function to create the function in Lambda.

The function runs the copy query to load the data into the Destination Redshift Cluster upon receiving any encrypted data in the S3 bucket.

This solution can be used to replicate real-time data from one Redshift cluster to another. In that case, we need to configure an event trigger in the S3 bucket. For details on configuring an event trigger with S3 bucket, we can follow this link.

Final Note

Using own KMS customer-managed keys allow us to protect the Amazon Redshift data and give full control over who can use these keys to access the cluster data. It is worth mentioning that AWS services S3 and KMS provide an easy solution to encrypt data loads without any additional charge, though KMS has a limited free tier offer up to 20,000 requests per month.

In the past, we had to meet our security professionals and ask them to integrate security into our data repositories, in large and complex environments we still do. However, we can also follow such guidelines which enable us, the data architects, to ensure the bare minimum when it comes to running our data solutions on the cloud. I hope this solution will help you do your work in a safe and secure data environment.

Leveraging big data is necessary for survival. The question is how, the hurdles—complexity, scalability, speed, cost, reliability, expertise—are many. The answer is Qubole's Autonomous Data Platform.

Topics:
encryption ,big data ,amazon ,redshift ,tutorial

Published at DZone with permission of Alon Brody. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}