Getting Started With PyIceberg: A Pythonic Approach to Managing Apache Iceberg Tables

If you're an ML Engineer and want to quickly run a feature engineering in your model, PyIceberg provides a lightweight implementation.

Junaith Haja

Aug. 20, 25 · Tutorial

Likes (3)

Comment

Save

2.4K Views

Modern data platforms are evolving rapidly—driven by a need for scalability, flexibility, and analytics at scale. Lakehouse architecture sits at the center of this evolution, combining the low-cost storage of data lakes with the reliability and structure of data warehouses.

To power these lakehouses, organizations are turning to open table formats like Apache Iceberg. Originally developed at Netflix, Apache Iceberg was built to manage petabyte-scale analytics in cloud object storage. It brings database-style features—ACID transactions, schema evolution, partition pruning, and time travel—to large-scale files stored in systems like Amazon S3 or Azure Data Lake.

But most Iceberg implementations today rely on heavy compute engines like Spark, Flink, or Trino. That’s a hurdle for developers and ML engineers who want to experiment locally, version feature sets, or manage metadata—without deploying clusters. In this article we will learn how to implement PyIceberg for a supply chain company

Enter PyIceberg.

What Is PyIceberg?

PyIceberg is an official tool in Python that helps you work with large datasets stored in the Apache Iceberg format. Apache Iceberg is a modern way of organizing huge data tables, especially on cloud platforms like Amazon S3 or Google Cloud Storage. It makes sure your data stays clean, consistent, and easy to manage—even if you’re dealing with lakhs or crores of records. Usually, to work with Iceberg, you’d need heavy software like Spark or Flink (which runs on Java), but PyIceberg removes that headache—you can do most things using just Python.

With PyIceberg, you can easily check the structure of your data tables, add new information, or even see how the data looked at a specific point in time. This “time travel” feature is especially helpful for machine learning teams—when you need to recreate the exact dataset used in training a model, or when you want to debug something that worked before but is failing now. Since it works without needing a large server setup, PyIceberg makes local testing, development, and automation much easier.

For many companies and data teams moving towards cloud-native platforms and open formats, PyIceberg is a very handy tool. It helps you manage data like a pro without depending on big infrastructure. You can validate data, test pipelines, explore schema changes, and keep everything under control—all from your Python environment. As more teams start using Apache Iceberg, PyIceberg is becoming the go-to option for those who want something powerful but also lightweight and developer-friendly.

With PyIceberg, you can:

Create, modify, and inspect Iceberg tables locally
Append data from Pandas or Arrow tables
Perform schema evolution
Explore table metadata and snapshots
Reproduce ML training data using Iceberg time travel

It’s fast, flexible, and lightweight—perfect for modern Python-based workflows.

Real-World Example: Feature Store for Supply Chain Forecasting

Let’s say you work for a retail supply chain company. Your job is to build a daily feature store that feeds machine learning models which predict product demand across regions.

You need to:

Version your features (so ML models are reproducible)
Evolve schema as new signals are added
Inspect metadata without spinning up Spark
Enable local testing and development

This is a perfect use case for PyIceberg.

Folder Structure

supplychain_forecast/
├── .venv/ # Virtual environment
├── warehouse/ # Iceberg warehouse (local)
│ └── metadata/ # Iceberg metadata files
├── pyiceberg.db # SQLite catalog
├── feature_pipeline.py # Create & append features
├── schema_evolution.py # Add new columns
├── snapshot_inspector.py # View snapshot history
├── notebooks/
│ └── iceberg_explore.ipynb # Jupyter exploration
├── data/
│ └── raw_input.csv # External inputs
├── requirements.txt # Python dependencies
└── README.md

Architecture Diagram

Setup and Installation

Let's work on the setting up of the environment

    Python
   
   pip install "pyiceberg[sql,pyarrow]" pandas

Create and activate a virtual environment:

    Python
   
   python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate

Step 1: Create the Feature Table

    Python
   
 

   # feature_pipeline.py
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import StringType, IntegerType, DateType, FloatType, BooleanType, StructType
import pandas as pd
import pyarrow as pa
from datetime import date

# Step 1.1: Load catalog
catalog = load_catalog(
name="local",
type="sql",
uri="sqlite:///warehouse/pyiceberg.db",
warehouse="file:///warehouse"
)

# Step 1.2: Define schema
schema = Schema(
StructType(fields=[
("date", DateType(), False, "Forecast date"),
("sku_id", StringType(), False, "SKU identifier"),
("region", StringType(), False, "Sales region"),
("units_sold", IntegerType(), True, "Actual sales"),
("inventory", IntegerType(), True, "Stock level"),
("promo_flag", BooleanType(), True, "Was promoted?"),
("day_of_week", StringType(), True, "Signal for ML")
])
)

# Step 1.3: Create Iceberg table
table = catalog.create_table("features.daily_sku_demand_v1", schema)

# Step 1.4: Append initial data
df = pd.DataFrame([
{
"date": date(2025, 7, 8),
"sku_id": "SKU_123",
"region": "West",
"units_sold": 120,
"inventory": 340,
"promo_flag": True,
"day_of_week": "Tuesday"
},
{
"date": date(2025, 7, 8),
"sku_id": "SKU_456",
"region": "East",
"units_sold": 95,
"inventory": 200,
"promo_flag": False,
"day_of_week": "Tuesday"
}
])
arrow_table = pa.Table.from_pandas(df)
table.append(arrow_table)

print("Table created and data written.")


  

Step 2: Evolve the Schema

    Python
   
   # schema_evolution.py
from pyiceberg.types import FloatType

table = catalog.load_table("features.daily_sku_demand_v1")

with table.update_schema() as update:
update.add_column("forecast_demand", FloatType(), "ML prediction")
update.add_column("forecast_confidence", FloatType(), "Confidence level")

print( "Forecast columns added.")

Step 3: Inspect Metadata and Snapshots

    Python
   
   # snapshot_inspector.py
table = catalog.load_table("features.daily_sku_demand_v1")

for snapshot in table.snapshots:
print(f"Snapshot ID: {snapshot.snapshot_id}, Time: {snapshot.timestamp_ms}")

To load data from a past snapshot:

    Python
   
   snapshot_id = table.snapshots[0].snapshot_id
df_old = table.scan(snapshot_id=snapshot_id).to_pandas()

Why This Matters for ML and DataOps

Benefit	Why It Helps
Time Travel	Reproduce past training sets with snapshot IDs
Schema Evolution	Add new features without migrations
Local Development	Work without Spark, Hive, or EMR
Fast Feedback Loop	Test new features in notebooks immediately
Modular Table Management	Ideal for versioned, testable data products

Limitations to Keep in Mind

Not yet suitable for high-throughput writes
No support for compaction or partition rewriting
Lacks distributed query capability (can be layered with DuckDB)

PyIceberg is a game-changer for Python-native data lakehouse development. It gives engineers and ML teams powerful tooling to:

Create and evolve Iceberg tables
Use time travel for versioned features
Prototype rapidly using familiar tools (Pandas, Arrow, Jupyter)
Whether you're tracking demand forecasts or running experiments, PyIceberg brings the power of Apache Iceberg to your laptop—without the overhead.

Machine learning Apache Python programming language

Opinions expressed by DZone contributors are their own.

Related

Trending