How I Supercharged My GenAI App and Saved My Sanity

Finding a better way to handle real-time data for my GenAI app pushed me to rethink the tools I use—and that decision paid off in ways I never expected.

Haymang Ahuja

Updated by

Saransh Arora

Jun. 04, 25 · Opinion

Likes (1)

Comment

Save

1.6K Views

Did you ever feel like working with Redshift is like being stuck in a never-ending queue at Starbucks during morning rush hour? Everyone’s waiting, getting cranky, and you’re hoping that your simplest request get processed before you’re next birthday!

After spending years working with various data warehousing solutions, I recently made a rather unconventional choice that I am excited to share. I switched from Amazon Redshift to DuckDB for my real-time GenAI application, and the results were eye-opening.

The Problem With Redshift

The Concurrency Nightmare

The concurrency limitation became a significant nightmare, and here’s why: -

Default concurrency limit is 50 connections (500 with large nodes)
Each user session requires a dedicated connection.
Connection pooling helps but adds complexity.
Timeout issues when connections weren’t properly released.

    Python
   
 

   # Common scenario with Redshift
import psycopg2
from psycopg2 import pool
try:
    # Creating connection pool
    connection_pool = pool.SimpleConnectionPool(
        1, 50,  # min and max connections
        host="myhost",
        database="mydb",
        user="myuser",
        password="mypassword"
    )
except psycopg2.Error as e:
    print("Error: Could not connect to Redshift")
    print(e)

# Even with connection pooling, we'd still hit limits during peak usage
  

Query Performance Deep Dive

The performance difference between Redshift and DuckDB was stark. Here is a real example.

    Python
   
 

   -- This query in Redshift took ~35 seconds
WITH user_features AS (
    SELECT 
        user_id,
        COUNT(DISTINCT session_id) as session_count,
        AVG(session_duration) as avg_duration,
        SUM(interaction_count) as total_interactions
    FROM user_sessions
    WHERE date_trunc('day', timestamp) >= CURRENT_DATE - 30
    GROUP BY user_id
),
recent_interactions AS (
    SELECT 
        user_id,
        ARRAY_AGG(item_id ORDER BY timestamp DESC LIMIT 10) as recent_items
    FROM interactions
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY user_id
)
SELECT 
    uf.*,
    ri.recent_items
FROM user_features uf
LEFT JOIN recent_interactions ri ON uf.user_id = ri.user_id;
  

The same query in DuckDB:

    Python
   
 

   # DuckDB completed in ~4 seconds

import duckdb

conn = duckdb.connect('my_database.db')
result = conn.execute("""
    WITH user_features AS (
        SELECT 
            user_id,
            COUNT(DISTINCT session_id) as session_count,
            AVG(session_duration) as avg_duration,
            SUM(interaction_count) as total_interactions
        FROM user_sessions
        WHERE timestamp >= CURRENT_DATE - 30
        GROUP BY user_id
    ),
    recent_interactions AS (
        SELECT 
            user_id,
            LIST(item_id ORDER BY timestamp DESC LIMIT 10) as recent_items
        FROM interactions
        WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
        GROUP BY user_id
    )
    SELECT 
        uf.*,
        ri.recent_items
    FROM user_features uf
    LEFT JOIN recent_interactions ri ON uf.user_id = ri.user_id;
""").fetchdf()
  

Why Such A Dramatic Difference?

Architecture

Redshift: Distributed system with network overhead.
DuckDB: In-process database with direct memory access.

Query Execution

Redshift: Requires data distribution across nodes.
DuckDB: Vectorized execution on local data.

Memory Management

Redshift: Manages shared memory across multiple connections.
DuckDB: Direct access to process memory.

Why DuckDB Won My Heart

Embedded Analytics Excellence

You could literally use any python data library along with DuckDB which makes life so much simpler. Check out this example below: -

import duckdb
import pandas as pd
import pyarrow as pa

# Create an in-memory database
# Look how simple it is to setup duckdb
conn = duckdb.connect(':memory:')
# Load data from multiple sources
conn.execute("""
    CREATE TABLE user_data AS 
    SELECT * FROM read_parquet('users/*.parquet')
""")

# Complex analytics with zero network overhead
result = conn.execute("""
    SELECT 
        DATE_TRUNC('hour', timestamp) as hour,
        COUNT(*) as events,
        COUNT(DISTINCT user_id) as unique_users,
        AVG(response_time) as avg_response
    FROM user_data
    WHERE timestamp >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
    GROUP BY 1
    ORDER BY 1
""").fetchdf()
# Direct integration with pandas/arrow
arrow_table = pa.Table.from_pandas(result)

Memory Efficiency and Data Processing

It’s memory management is actually impressive. For my application, I had the data files stored in parquet format at S3, and boy, the speed at which it was able to eat them all! Here is a general example.

# Efficient handling of large CSV files
conn.execute("""
    CREATE TABLE large_dataset AS 
    SELECT * FROM read_csv_auto('large_file.csv', 
                              SAMPLE_SIZE=1000000,
                              ALL_VARCHAR=0)
""")

# Automatic type inference and compression
conn.execute("""
    SELECT typeof(column_name), 
           approx_count_distinct(column_name) as cardinality,
           pg_size_pretty(sum(length(column_name))) as size
    FROM large_dataset
    GROUP BY 1
""")

Advanced Analytics Integration

Duckdb comes with all the analytics functions that we are so dependant on while working with conventional databases.

# Complex window functions
conn.execute("""
    WITH user_metrics AS (
        SELECT 
            user_id,
            timestamp,
            metric_value,
            AVG(metric_value) OVER (
                PARTITION BY user_id 
                ORDER BY timestamp 
                ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
            ) as moving_avg
        FROM metrics
    )
    SELECT * FROM user_metrics
    WHERE moving_avg > 100
""")

# Machine learning feature generation
features = conn.execute("""
    SELECT 
        user_id,
        COUNT(*) as event_count,
        COUNT(DISTINCT session_id) as session_count,
        MIN(timestamp) as first_seen,
        MAX(timestamp) as last_seen,
        AVG(duration) as avg_duration,
        ARRAY_AGG(DISTINCT category) as categories
    FROM events
    GROUP BY user_id
""").df()

When To Stick With Redshift (Because Let’s Be Real)

It’s perfect for:

Enterprise-scale data warehousing

Handles petabytes-scale data with ease.
Efficient for complex JOIN operations across large tables.
Better for data spanning multiple regions/zones.

Complex ETL Workflows

Native integration with AWS services.
Scheduled maintenance and backups.
Automated scaling and optimization.
Supports complex ETL patterns.

When you’re security team won’t let you breathe without audit logs

Fine-grained access control.
Row-level security.
Audit logging and compliance reporting.
Integration with AWS security services.

Power-User Tips: Supercharging DuckDB

Essential Extensions and Configurations

Before we wrap-up, let me share some power-user configurations that took my DuckDB implementation from great to incredible.

    Python
   
 

   import duckdb
# Basic connection with some tuned settings
conn = duckdb.connect('my_db.db', config={
    'temp_directory': '/path/to/temp',      # Custom temp directory
    'memory_limit': '8GB',                  # Adjust based on your needs
    'threads': 12                           # Parallel processing threads
})

# Load essential extensions
conn.execute("INSTALL httpfs;")             # HTTP file system support
conn.execute("LOAD httpfs;")
conn.execute("INSTALL json;")               # Enhanced JSON support
conn.execute("LOAD json;")
conn.execute("INSTALL sqlite;")             # SQLite scanner
conn.execute("LOAD sqlite;")
  

Working With Remote Data

One of the coolest features is direct access to remote files. Here’s how I handle it.

    Python
   
 

   # Reading directly from S3
conn.execute("""
    SET s3_region='us-east-1';
    SET s3_access_key_id='your_access_key';
    SET s3_secret_access_key='your_secret_key';
""")

# Now you can query S3 data directly!
conn.execute("""
    SELECT *
    FROM read_parquet('s3://my-bucket/data/*.parquet')
    WHERE date_column >= '2024-01-01'
""")
# Or even better, create a view
conn.execute("""
    CREATE VIEW recent_data AS
    SELECT *
    FROM read_parquet('s3://my-bucket/data/*.parquet')
    WHERE date_column >= CURRENT_DATE - INTERVAL '30 days'
""")
  

Memory Management Like A Pro

    Python
   
 

   # Configure adaptive memory management
def configure_memory_settings(conn, total_gb=8):
    settings = {
        'memory_limit': f'{total_gb}GB',
        'threads': max(4, os.cpu_count() - 2),  # Leave some CPU for other processes
        'temp_directory': '/path/to/fast/ssd',   # Use fast storage for spilling
    }
    
    for key, value in settings.items():
        conn.execute(f"SET {key}='{value}'")

# Example usage with streaming results
def process_large_dataset(conn):
    # Use chunks to handle large results
    chunk_size = 100000
    result = conn.execute("""
        SELECT *
        FROM huge_table
        WHERE complex_condition
    """).fetch_df_chunk(chunk_size=chunk_size)
    
    for chunk in result:
        process_chunk(chunk)
  

Troubleshooting Tips

When things don’t go as planned.

    Python
   
 

   # Memory debugging
conn.execute("SELECT * FROM duckdb_memory();")

# Query profiling
conn.execute("""
    EXPLAIN ANALYZE
    SELECT complex_query_here;
""")
# Check extension status
conn.execute("SELECT * FROM duckdb_extensions();")
  

The Verdict

For my GenAI application, DuckDB proved to be a perfect fit. The perfect combination of:

Lightening-fast query performance.
Zero concurrency issues.
Excellent Python integration.
Simplified infrastructure.

But, would I use it for everything? Absolutely not. Only for the applications that requires Real-Time processing, has moderate data size, requires direct integration with Python/pandas and have low-latency requirements.

Choosing the right tool isn’t about following the trends. It’s about understanding your specific requirements and constraints. Sometimes, simpler is better. What’s your experience with DuckDB and Redshift. Drop your thoughts in the comments below.

Database Integration AI

Published at DZone with permission of Haymang Ahuja. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending