How I Supercharged My GenAI App and Saved My Sanity
Finding a better way to handle real-time data for my GenAI app pushed me to rethink the tools I use—and that decision paid off in ways I never expected.
Join the DZone community and get the full member experience.
Join For FreeDid you ever feel like working with Redshift is like being stuck in a never-ending queue at Starbucks during morning rush hour? Everyone’s waiting, getting cranky, and you’re hoping that your simplest request get processed before you’re next birthday!
After spending years working with various data warehousing solutions, I recently made a rather unconventional choice that I am excited to share. I switched from Amazon Redshift to DuckDB for my real-time GenAI application, and the results were eye-opening.
The Problem With Redshift
The Concurrency Nightmare
The concurrency limitation became a significant nightmare, and here’s why: -
- Default concurrency limit is 50 connections (500 with large nodes)
- Each user session requires a dedicated connection.
- Connection pooling helps but adds complexity.
- Timeout issues when connections weren’t properly released.
# Common scenario with Redshift
import psycopg2
from psycopg2 import pool
try:
# Creating connection pool
connection_pool = pool.SimpleConnectionPool(
1, 50, # min and max connections
host="myhost",
database="mydb",
user="myuser",
password="mypassword"
)
except psycopg2.Error as e:
print("Error: Could not connect to Redshift")
print(e)
# Even with connection pooling, we'd still hit limits during peak usage
Query Performance Deep Dive
The performance difference between Redshift and DuckDB was stark. Here is a real example.
-- This query in Redshift took ~35 seconds
WITH user_features AS (
SELECT
user_id,
COUNT(DISTINCT session_id) as session_count,
AVG(session_duration) as avg_duration,
SUM(interaction_count) as total_interactions
FROM user_sessions
WHERE date_trunc('day', timestamp) >= CURRENT_DATE - 30
GROUP BY user_id
),
recent_interactions AS (
SELECT
user_id,
ARRAY_AGG(item_id ORDER BY timestamp DESC LIMIT 10) as recent_items
FROM interactions
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY user_id
)
SELECT
uf.*,
ri.recent_items
FROM user_features uf
LEFT JOIN recent_interactions ri ON uf.user_id = ri.user_id;
The same query in DuckDB:
# DuckDB completed in ~4 seconds
import duckdb
conn = duckdb.connect('my_database.db')
result = conn.execute("""
WITH user_features AS (
SELECT
user_id,
COUNT(DISTINCT session_id) as session_count,
AVG(session_duration) as avg_duration,
SUM(interaction_count) as total_interactions
FROM user_sessions
WHERE timestamp >= CURRENT_DATE - 30
GROUP BY user_id
),
recent_interactions AS (
SELECT
user_id,
LIST(item_id ORDER BY timestamp DESC LIMIT 10) as recent_items
FROM interactions
WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY user_id
)
SELECT
uf.*,
ri.recent_items
FROM user_features uf
LEFT JOIN recent_interactions ri ON uf.user_id = ri.user_id;
""").fetchdf()
Why Such A Dramatic Difference?
Architecture
- Redshift: Distributed system with network overhead.
- DuckDB: In-process database with direct memory access.
Query Execution
- Redshift: Requires data distribution across nodes.
- DuckDB: Vectorized execution on local data.
Memory Management
- Redshift: Manages shared memory across multiple connections.
- DuckDB: Direct access to process memory.
Why DuckDB Won My Heart
Embedded Analytics Excellence
You could literally use any python data library along with DuckDB which makes life so much simpler. Check out this example below: -
import duckdb import pandas as pd import pyarrow as pa # Create an in-memory database # Look how simple it is to setup duckdb conn = duckdb.connect(':memory:') # Load data from multiple sources conn.execute(""" CREATE TABLE user_data AS SELECT * FROM read_parquet('users/*.parquet') """) # Complex analytics with zero network overhead result = conn.execute(""" SELECT DATE_TRUNC('hour', timestamp) as hour, COUNT(*) as events, COUNT(DISTINCT user_id) as unique_users, AVG(response_time) as avg_response FROM user_data WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours' GROUP BY 1 ORDER BY 1 """).fetchdf() # Direct integration with pandas/arrow arrow_table = pa.Table.from_pandas(result)
Memory Efficiency and Data Processing
It’s memory management is actually impressive. For my application, I had the data files stored in parquet format at S3, and boy, the speed at which it was able to eat them all! Here is a general example.
# Efficient handling of large CSV files conn.execute(""" CREATE TABLE large_dataset AS SELECT * FROM read_csv_auto('large_file.csv', SAMPLE_SIZE=1000000, ALL_VARCHAR=0) """) # Automatic type inference and compression conn.execute(""" SELECT typeof(column_name), approx_count_distinct(column_name) as cardinality, pg_size_pretty(sum(length(column_name))) as size FROM large_dataset GROUP BY 1 """)
Advanced Analytics Integration
Duckdb comes with all the analytics functions that we are so dependant on while working with conventional databases.
# Complex window functions conn.execute(""" WITH user_metrics AS ( SELECT user_id, timestamp, metric_value, AVG(metric_value) OVER ( PARTITION BY user_id ORDER BY timestamp ROWS BETWEEN 6 PRECEDING AND CURRENT ROW ) as moving_avg FROM metrics ) SELECT * FROM user_metrics WHERE moving_avg > 100 """) # Machine learning feature generation features = conn.execute(""" SELECT user_id, COUNT(*) as event_count, COUNT(DISTINCT session_id) as session_count, MIN(timestamp) as first_seen, MAX(timestamp) as last_seen, AVG(duration) as avg_duration, ARRAY_AGG(DISTINCT category) as categories FROM events GROUP BY user_id """).df()
When To Stick With Redshift (Because Let’s Be Real)
It’s perfect for:
Enterprise-scale data warehousing
- Handles petabytes-scale data with ease.
- Efficient for complex JOIN operations across large tables.
- Better for data spanning multiple regions/zones.
Complex ETL Workflows
- Native integration with AWS services.
- Scheduled maintenance and backups.
- Automated scaling and optimization.
- Supports complex ETL patterns.
When you’re security team won’t let you breathe without audit logs
- Fine-grained access control.
- Row-level security.
- Audit logging and compliance reporting.
- Integration with AWS security services.
Power-User Tips: Supercharging DuckDB
Essential Extensions and Configurations
Before we wrap-up, let me share some power-user configurations that took my DuckDB implementation from great to incredible.
import duckdb
# Basic connection with some tuned settings
conn = duckdb.connect('my_db.db', config={
'temp_directory': '/path/to/temp', # Custom temp directory
'memory_limit': '8GB', # Adjust based on your needs
'threads': 12 # Parallel processing threads
})
# Load essential extensions
conn.execute("INSTALL httpfs;") # HTTP file system support
conn.execute("LOAD httpfs;")
conn.execute("INSTALL json;") # Enhanced JSON support
conn.execute("LOAD json;")
conn.execute("INSTALL sqlite;") # SQLite scanner
conn.execute("LOAD sqlite;")
Working With Remote Data
One of the coolest features is direct access to remote files. Here’s how I handle it.
# Reading directly from S3
conn.execute("""
SET s3_region='us-east-1';
SET s3_access_key_id='your_access_key';
SET s3_secret_access_key='your_secret_key';
""")
# Now you can query S3 data directly!
conn.execute("""
SELECT *
FROM read_parquet('s3://my-bucket/data/*.parquet')
WHERE date_column >= '2024-01-01'
""")
# Or even better, create a view
conn.execute("""
CREATE VIEW recent_data AS
SELECT *
FROM read_parquet('s3://my-bucket/data/*.parquet')
WHERE date_column >= CURRENT_DATE - INTERVAL '30 days'
""")
Memory Management Like A Pro
# Configure adaptive memory management
def configure_memory_settings(conn, total_gb=8):
settings = {
'memory_limit': f'{total_gb}GB',
'threads': max(4, os.cpu_count() - 2), # Leave some CPU for other processes
'temp_directory': '/path/to/fast/ssd', # Use fast storage for spilling
}
for key, value in settings.items():
conn.execute(f"SET {key}='{value}'")
# Example usage with streaming results
def process_large_dataset(conn):
# Use chunks to handle large results
chunk_size = 100000
result = conn.execute("""
SELECT *
FROM huge_table
WHERE complex_condition
""").fetch_df_chunk(chunk_size=chunk_size)
for chunk in result:
process_chunk(chunk)
Troubleshooting Tips
When things don’t go as planned.
# Memory debugging
conn.execute("SELECT * FROM duckdb_memory();")
# Query profiling
conn.execute("""
EXPLAIN ANALYZE
SELECT complex_query_here;
""")
# Check extension status
conn.execute("SELECT * FROM duckdb_extensions();")
The Verdict
For my GenAI application, DuckDB proved to be a perfect fit. The perfect combination of:
- Lightening-fast query performance.
- Zero concurrency issues.
- Excellent Python integration.
- Simplified infrastructure.
But, would I use it for everything? Absolutely not. Only for the applications that requires Real-Time processing, has moderate data size, requires direct integration with Python/pandas and have low-latency requirements.
Choosing the right tool isn’t about following the trends. It’s about understanding your specific requirements and constraints. Sometimes, simpler is better. What’s your experience with DuckDB and Redshift. Drop your thoughts in the comments below.
Published at DZone with permission of Haymang Ahuja. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments