DuckDB for Python Developers

DuckDB is an embeddable analytical database that runs inside your Python process with zero setup. It can query CSV files, Parquet, and pandas DataFrames.

varun joshi

May. 12, 26 · Analysis

Likes (0)

Comment

Save

2.2K Views

If you have ever tried to run a quick aggregation on a 3GB CSV file in pandas, you know the ritual: wait for it to load into the memory, watch your RAM climb, maybe get a Memory Error, then reach for something heavier — a Postgres instance, a Spark cluster, a cloud warehouse. It's a lot of infrastructure for what should be a five-minute analysis.

DuckDB exists to break that cycle. It's an analytical database that runs entirely in process, requires zero setup, and can query CSV files, Parquet, and pandas DataFrames directly — often faster than tools that cost thousands of dollars a month to run. This post is for Python developers who work with data and want a sharper tool in their kit.

The Problem DuckDB Solves

Pandas struggles past a few GB. Spinning up a Postgres or Redshift cluster for local analysis is overkill. DuckDB fills the gap with zero-setup columnar SQL.

What Makes It Different From SQLite

SQLite is row-oriented and built for transactional workloads. DuckDB is columnar and vectorized — designed for aggregations over millions of rows, not point lookups.

Getting Started: Zero Config, Instant Power

The first thing that surprises all developers about DuckDB is how little stands between installation and a working query. There is no server to start, no connection string to configure, no user to create, no port to open. You install a Python package, and you are done.

Installation and First Query

pip install duckdb — that's it. No daemon, no config file, no port to open.

This is the entire setup. No Docker container. No Brew install. No system dependency, as DuckDB is a single shared library that runs entirely inside your Python process — the same model as SQLite, but built for analytical workloads instead of transactional ones.

    Python
   
 

   import duckdb

duckdb.sql("""
    SELECT region, SUM(revenue) AS total
    FROM read_csv('sales.csv')
    GROUP BY region
    ORDER BY total DESC
""").show()
  

There is no need to load the file or define a schema, as DuckDB reads the file, infers the schema automatically, executes the aggregation, and prints a formatted result table — all in one shot.

In-Memory vs. Persistent Databases

By default, calling duckdb.sql() uses a shared in-memory database that lives for the duration of your Python session. This is ideal for exploratory analysis — fast, disposable, and zero cleanup required.

When you need results to persist between sessions, or you want to build up a set of tables over time, you connect to a file instead:

    Python
   
   import duckdb

# In-memory (default) — gone when the process ends
con = duckdb.connect()

# Persistent — saved to disk, readable next session
con = duckdb.connect("analytics.db")

The file-based database behaves identically to the in-memory one from a query perspective. You can use the same SQL, the same API, and the same result methods. The only difference is durability. For notebook-based analysis, you'll typically use in-memory; for pipelines that run on a schedule and need to accumulate state, a persistent database makes more sense.

One thing to note is that DuckDB allows multiple connections for reading but only one write connection at a time. This isn't a server database — it's not designed for concurrent writes from multiple processes. That's a deliberate trade-off.

The Two APIs: SQL-First vs. Relational API

DuckDB provides two ways to express queries. First is the SQL API — using the duckdb.sql() or con.sql() and get result back.

    Python
   
   result = con.sql("SELECT * FROM sales WHERE year = 2024")

Second is a relational API that lets you chain operations programmatically without building SQL strings.

    Python
   
 

   result = (
    con.table("sales")
       .filter("year = 2024")
       .aggregate("region, SUM(revenue) AS total", "region")
       .order("total DESC")
)
  

The SQL API is the natural starting point, especially if you already think in SQL. The relational API shines when you are building queries dynamically, where string concatenation will get messy and error-prone.

Querying DataFrames: SQL Meets Pandas

One of DuckDB's most useful features has nothing to do with files on disk. It can see the pandas DataFrames that are already in your Python session and query them directly — no loading step, no registration, no copying data into a separate structure. You just reference the variable by name in your SQL string.

    Python
   
 

   import pandas as pd

sales = pd.read_csv("sales.csv")

# 'sales' is just in scope — DuckDB finds it
result = duckdb.sql("""
    SELECT region, SUM(revenue) AS total
    FROM sales
    WHERE year = 2026
    GROUP BY region
    ORDER BY total DESC
""").df()
  

The DataFrame sales is a regular Python variable. DuckDB finds it in the local scope automatically. There is no duckdb.register("sales", sales) call required, no explicit hand-off between the two libraries. From DuckDB's perspective, your DataFrame is just a table.

The single feature changes how you use Pandas in practice. Instead of chaining .groupby().agg().sort_values().reset_index() — which is expressive if you think in pandas, but dense if you think in SQL — you write a query that says exactly what you mean. For developers who come from an SQL background and learned pandas as a second language, this is a better tool.

Why Is This Faster Than You Expect?

When DuckDB queries a pandas DataFrame, it doesn't copy the underlying DataFrame.It reads it directly through the Apache Arrow memory format, which pandas uses internally. This means the hand-off between pandas and DuckDB is essentially free from a memory perspective — there's no duplication, no serialization, no conversion overhead.

The performance gap between pandas and DuckDB quickly widens as the data size grows and query complexity increases. Pandas executes operations row-by-row in many cases; DuckDB uses a columnar vectorized execution engine that processes data in batches and automatically parallelizes across all available CPU cores. For a GROUP BY aggregation on a million rows, the difference can be an order of magnitude. For a multi-table join or a window function, it can be more.

Here's a concrete illustration. Say you have a DataFrame with 5 million rows and you want the top 10 customers by revenue for each region:

    Python
   
 

   # pandas approach
result = (
    sales
    .groupby(["region", "customer_id"])["revenue"]
    .sum()
    .reset_index()
    .sort_values(["region", "revenue"], ascending=[True, False])
    .groupby("region")
    .head(10)
)

# DuckDB approach
result = duckdb.sql("""
    SELECT region, customer_id, SUM(revenue) AS total
    FROM sales
    GROUP BY region, customer_id
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY region ORDER BY SUM(revenue) DESC
    ) <= 10
""").df()
  

Both produce the same result. The DuckDB version is typically 5–15× faster on large DataFrames, and most SQL practitioners would argue it's also more readable — the intent is explicit, and there's no intermediate .reset_index() to remember.

Zero-Copy With Arrow

When query results are large—millions of rows, wide schemas—the .df() conversion involves copying data from DuckDB's internal format into pandas' memory. For most use cases, this is fine. But if you're operating at the edge of your available RAM, or you need to pass results to a system that already speaks Arrow natively (like PyArrow, Polars, or many ML frameworks), you can skip the pandas copy entirely:

    Python
   
   arrow_table = con.sql("SELECT * FROM large_table").arrow()

# Pass directly to Polars — no copy
import polars as pl
polars_df = pl.from_arrow(arrow_table)

# Or pass to PyArrow for further processing
import pyarrow.compute as pc
filtered = arrow_table.filter(pc.greater(arrow_table["revenue"], 1000))

.arrow() returns a pyarrow.Table that shares memory and DuckDB's result buffer where possible. For a 10-million-row set, this can have several hundred megabytes of peak memory usage compared to going through pandas.

Polar DataFrames Work Too

Everything described above applies equally to Polars DataFrames. DuckDB recognizes them in scope the same way it does pandas DataFrames:

    Python
   
 

   import polars as pl

orders = pl.read_parquet("orders.parquet")

result = duckdb.sql("""
    SELECT customer_id, COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
    HAVING COUNT(*) > 5
""").pl()  # .pl() returns a Polars DataFrame directly
  

The .pl() results in the polar equivalent of .df() ds- It returns a native Polars DataFrame without an intermediate pandas conversion. If your stack already uses Polars, you can use DuckDB for the SQL-shaped parts of your pipeline and stay in Polars for everything else, with clean hand-offs in both directions.

When to Reach for Pandas vs. DuckDB

DuckDB doesn't replace pandas — it fills a different shape of problem. A useful mental split:

Reach for DuckDB when you're doing aggregations, joins, filtering, sorting, window functions, or anything that maps naturally to a SQL query. These are things DuckDB does faster, with less code, and with better memory efficiency at scale.

Reach for pandas when you need row-level Python logic — custom string parsing, calling a Python function on each row, integrating with libraries that expect a DataFrame as input. Pandas is more natural here, and DuckDB can't execute arbitrary Python inside a SQL query.

The best workflows use both. DuckDB handles the heavy analytical lifting; pandas handles the row-level transformations and the hand-offs to the rest of the Python ecosystem. Neither tool needs to know or care that the other is involved.

Reading Files Directly — CSV, Parquet, JSON

So far, we have seen how DuckDB works with querying the data that is already in memory — DataFrames loaded by pandas and Polars. But one of DuckDB's most powerful capabilities is querying files directly, without loading them into memory at all. You point DuckDB at a file, write a SQL query, and it streams through only the data it needs to answer that query. Your Python process never holds the full dataset in RAM.

This is extremely important as A 10 GB file might completely exhaust the RAM on most development machines if you try to load it into pandas. DuckDB can run aggregations on that same file in seconds, using a fraction of the memory, without any special configuration.

Querying CSV Files

    Python
   
 

   import duckdb

duckdb.sql("""
    SELECT region, COUNT(*) AS orders, SUM(revenue) AS total
    FROM read_csv('sales.csv')
    GROUP BY region
    ORDER BY total DESC
""").show()
  

DuckDB samples the file to infer column names and types automatically. In most cases, the inference is accurate enough that you don't need to specify anything. When it gets a type wrong — a numeric column stored as a string, a date in an unusual format — you can override it explicitly:

    Python
   
 

   duckdb.sql("""
    SELECT *
    FROM read_csv('sales.csv', columns = {
        'order_date': 'DATE',
        'postal_code': 'VARCHAR'
    })
    LIMIT 5
""").show()
  

Parquet: Where DuckDB Really Shines

If CSV is the format you start with, Parquet is the format you graduate to — and DuckDB treats it as a first-class citizen. The syntax is identical to CSV, just with read_parquet():

    Python
   
 

   duckdb.sql("""
    SELECT product_category, SUM(revenue) AS total
    FROM read_parquet('orders.parquet')
    GROUP BY product_category
    ORDER BY total DESC
""").show()
  

But the performance characteristics are dramatically different. Parquet is a columnar format, which means data for each column is stored together on disk.

DuckDB exploits this in two ways that matter enormously for large files.

The first is projection pushdown: if your query only touches three columns out of fifty, DuckDB reads only those three columns from disk. The other forty-seven are never touched. On a wide dataset, this can reduce I/O by 90% or more.

The second is predicate pushdown: Parquet files store min/max statistics for each row group (a chunk of roughly 100,000 rows). If your WHERE clause filters on a column and an entire row group falls outside the filter range, DuckDB skips it completely without reading a single byte of actual data. A query that filters to a narrow date range on a year's worth of data might read only 5% of the file.

Together, these two optimizations mean that querying a 20GB Parquet file can feel faster than querying a 2GB CSV — because DuckDB is doing far less actual work.

Reading From S3 and URLs

DuckDB's file reading capability extends beyond the local file system. With the httpfs extension installed. You can query files directly from S3, Google Cloud Storage, or any public HTTP URL.

    Python
   
 

   duckdb.sql("INSTALL httpfs; LOAD httpfs;")

duckdb.sql("""
    SELECT region, SUM(revenue) AS total
    FROM read_parquet('s3://my-bucket/data/sales/*.parquet')
    GROUP BY region
    ORDER BY total DESC
""").show()
  

DuckDB uses HTTP range requests to implement the same pushdown optimizations remotely as it does locally. When it can skip a row group based on statistics, it issues a range request for only the byte it needs- it does not download the entire file.

Or large Parquet files on S3, this can mean the difference between a query that transfers 200MB and one that transfers 4GB, with a corresponding difference in speed and cost.

For authenticated access, you set credentials once on the connection:

    Python
   
 

   con = duckdb.connect()
con.sql("INSTALL httpfs; LOAD httpfs;")
con.sql("""
    SET s3_region = 'us-east-1';
    SET s3_access_key_id = 'your_key';
    SET s3_secret_access_key = 'your_secret';
""")

result = con.sql("""
    SELECT * FROM read_parquet('s3://private-bucket/data.parquet')
    LIMIT 10
""").df()
  

For teams that store their data in cloud object storage, this makes DuckDB a surprisingly capable query engine for ad-hoc analysis — no data warehouse subscription required.

Conclusion

DuckDB occupies a gap, until recently, had no good solution in the python data ecosystem. Too big for pandas and too small to justify a warehouse, too analytical for SQLite — that middle ground where you reach something heavy and slow, or give up and sample your data down to a size that fits. DuckDB closes that gap cleanly.

What makes it stick as a tool is not any single feature. It's the accumulation of small frictions that it removes. You don't set up a server. You don't define a schema before querying a CSV. You don't register a DataFrame before running SQL against it. You don't write a boto3 pipeline to pull data off S3 before you can look at it. Each of these is a small thing, but together they add up to a qualitatively different experience — one where the gap between "I have some data" and "I have an answer" collapses to almost nothing.

The honest caveats are worth reiterating: DuckDB is not a replacement for a production database that handles concurrent writes. It is not the right choice for OLTP workloads, for multi-user applications, or anywhere you need more than one process writing to the same database at the same time. It knows what it is — an analytical engine — and it's extraordinarily good at that specific thing.

For Python developers who work with data, the practical recommendation is simple: add DuckDB to your toolkit alongside pandas, not instead of it. Use pandas where Python-level row logic is what you need. Use DuckDB where SQL is the natural shape of the problem. The two libraries hand off to each other cleanly, and together they cover nearly everything a working data developer encounters day to day.

The best place to start is the simplest one: next time you reach for pd.read_csv()on a file larger than a gigabyte, try duckdb.sql("SELECT ... FROM read_csv('file.csv')") first. The results will speak for themselves.

Database Pandas Python (language) sql

Opinions expressed by DZone contributors are their own.

Related

Trending