Build Real-Time Analytics Applications With AWS Kinesis and Amazon Redshift
Build real-time analytics with AWS Kinesis for streaming, AWS Lambda for processing, and Amazon Redshift for scalable data analysis.
Join the DZone community and get the full member experience.
Join For FreeReal-time analytics enables businesses to make immediate, data-driven decisions. Unlike traditional batch processing, real-time processing allows for faster insights, better customer experiences, and more responsive operations.
In this tutorial, you’ll learn how to build a real-time analytics pipeline using AWS Kinesis for streaming data and Amazon Redshift for querying and analyzing that data.
Prerequisites
Before you begin, ensure you have:
- An AWS account
- Basic knowledge of AWS services (Kinesis, Redshift, S3)
- AWS CLI installed
- IAM permissions for Kinesis, Redshift, Lambda, and S3
Step 1: Set Up AWS Kinesis for Real-Time Data Streaming
AWS Kinesis is a fully managed service that makes it easy to collect, process, and analyze real-time data streams. For our application, we'll use Kinesis Data Streams to ingest and process streaming data.
1. Create a Kinesis Stream
- Go to the AWS Management Console, search for Kinesis, and select Kinesis Data Streams.
- Click on Create stream.
- Provide a name for your stream (e.g.,
real-time-data-stream). - Set the number of shards (a shard is the base throughput unit for a stream). Start with one shard and increase later if needed.
- Click Create stream.
This will create a Kinesis Data Stream that can start receiving real-time data.
2. Put Data into the Kinesis Stream
We’ll use a sample application that sends real-time data (like user activity logs) to the Kinesis stream. Below is an example Python script using Boto3, AWS’s SDK for Python, to simulate real-time data into the stream.
import boto3
import json
import time
# Initialize Kinesis client
kinesis_client = boto3.client('kinesis', region_name='us-east-1')
# Data to simulate
data = {
"user_id": 12345,
"event": "page_view",
"page": "home"
}
# Stream name
stream_name = 'real-time-data-stream'
# Put data into Kinesis Stream
while True:
kinesis_client.put_record(
StreamName=stream_name,
Data=json.dumps(data),
PartitionKey=str(data['user_id'])
)
time.sleep(1) # Simulate real-time data ingestion
This script sends data to your stream every second. You can modify it to send different types of events or data.
Step 2: Process Data in Real-Time Using AWS Lambda
Once data is in Kinesis, you can process it using AWS Lambda, a serverless compute service. Lambda can be triggered whenever new data is available in the stream.
1. Create a Lambda Function to Process Stream Data
- In the Lambda Console, click Create function.
- Choose Author from Scratch, name your function (e.g.,
ProcessKinesisData), and choose the Python runtime. - Set the role to allow Lambda to access Kinesis and other services.
- Click Create function.
2. Add Kinesis as a Trigger
- In the Lambda function page, scroll to the Function overview section.
- Under Designer, click Add Trigger.
- Choose Kinesis as the trigger source.
- Select the stream you created earlier (
real-time-data-stream). - Set the batch size (e.g., 100 records).
- Click Add.
3. Lambda Function Code
Here is a simple Lambda function to process data from Kinesis and store the processed results into Amazon S3 (as a placeholder before loading into Redshift):
import json
import boto3
s3_client = boto3.client('s3')
def lambda_handler(event, context):
for record in event['Records']:
# Decode the Kinesis record data (Base64)
payload = json.loads(record['kinesis']['data'])
# Process the payload (for now, simply logging)
print(f"Processing record: {payload}")
# Store processed data into S3 (for later loading into Redshift)
s3_client.put_object(
Bucket='your-s3-bucket',
Key=f"processed/{payload['user_id']}.json",
Body=json.dumps(payload)
)
This function takes records from Kinesis, decodes the data, processes it, and stores it in an S3 bucket.
Step 3: Load Processed Data into Amazon Redshift
Amazon Redshift is a fully managed data warehouse service that allows you to analyze large datasets quickly. After processing the real-time data in Lambda, we can load it into Redshift for analysis.
1. Set Up Amazon Redshift Cluster
- Go to the Amazon Redshift Console, and click Create cluster.
- Provide a name, node type, and the number of nodes.
- Under Database configurations, set up a database and user.
- Click Create cluster.
2. Create Redshift Tables
Connect to your Redshift cluster using SQL client tools like SQL Workbench/J or Aginity. Create tables that match the structure of your incoming data.
CREATE TABLE user_activity (
user_id INT,
event VARCHAR(50),
page VARCHAR(100),
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
3. Set Up Data Loading from S3 to Redshift
Once your Lambda function stores data in S3, you can load it into Redshift using the COPY command. Ensure that your Redshift cluster can access S3 by creating an IAM role and attaching it to Redshift.
Use the COPY command to load data from S3 into Redshift:
COPY user_activity
FROM 's3://your-s3-bucket/processed/'
IAM_ROLE 'arn:aws:iam::your-account-id:role/your-redshift-role'
JSON 'auto';
Step 4: Analyze Real-Time Data in Redshift
Now that the data is loaded into Redshift, you can run SQL queries to analyze it. For example:
SELECT page, COUNT(*) AS views
FROM user_activity
GROUP BY page
ORDER BY views DESC;
This query will return the most popular pages viewed by users, processed in real-time.
Conclusion
In this tutorial, we’ve walked through how to build a real-time analytics application using AWS Kinesis for data streaming and Amazon Redshift for scalable data analytics. We used AWS Lambda to process streaming data and store it temporarily in Amazon S3, before loading it into Redshift for analysis.
This architecture is highly scalable and efficient for handling large volumes of real-time data, making it ideal for applications such as monitoring systems, user behavior analysis, and financial transactions.
With AWS’s serverless services, you can create cost-effective, highly available, and low-maintenance real-time analytics solutions that help you stay ahead of the competition.
Opinions expressed by DZone contributors are their own.
Comments