DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • How We Rebuilt a Legacy HBase + Elasticsearch System Using Apache Iceberg, Spark, Trino, and Doris
  • Event-Driven Pipelines With Apache Pulsar and Go
  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever

Trending

  • From APIs to Event-Driven Systems: Modern Java Backend Design
  • Modernization Is Not Migration
  • Architectural Evidence in Enterprise Java: Making Domain-Driven Design Visible
  • Building a Reusable Framework to Standardize API Ingestion in an On-Prem Lakehouse
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Getting Started With PyIceberg: A Pythonic Approach to Managing Apache Iceberg Tables

Getting Started With PyIceberg: A Pythonic Approach to Managing Apache Iceberg Tables

If you're an ML Engineer and want to quickly run a feature engineering in your model, PyIceberg provides a lightweight implementation.

By 
Junaith Haja user avatar
Junaith Haja
·
Aug. 20, 25 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
2.2K Views

Join the DZone community and get the full member experience.

Join For Free

Modern data platforms are evolving rapidly—driven by a need for scalability, flexibility, and analytics at scale. Lakehouse architecture sits at the center of this evolution, combining the low-cost storage of data lakes with the reliability and structure of data warehouses.

To power these lakehouses, organizations are turning to open table formats like Apache Iceberg. Originally developed at Netflix, Apache Iceberg was built to manage petabyte-scale analytics in cloud object storage. It brings database-style features—ACID transactions, schema evolution, partition pruning, and time travel—to large-scale files stored in systems like Amazon S3 or Azure Data Lake.

But most Iceberg implementations today rely on heavy compute engines like Spark, Flink, or Trino. That’s a hurdle for developers and ML engineers who want to experiment locally, version feature sets, or manage metadata—without deploying clusters. In this article we will learn how to implement PyIceberg for a supply chain company

Enter PyIceberg.

What Is PyIceberg?

PyIceberg is an official tool in Python that helps you work with large datasets stored in the Apache Iceberg format. Apache Iceberg is a modern way of organizing huge data tables, especially on cloud platforms like Amazon S3 or Google Cloud Storage. It makes sure your data stays clean, consistent, and easy to manage—even if you’re dealing with lakhs or crores of records. Usually, to work with Iceberg, you’d need heavy software like Spark or Flink (which runs on Java), but PyIceberg removes that headache—you can do most things using just Python.

With PyIceberg, you can easily check the structure of your data tables, add new information, or even see how the data looked at a specific point in time. This “time travel” feature is especially helpful for machine learning teams—when you need to recreate the exact dataset used in training a model, or when you want to debug something that worked before but is failing now. Since it works without needing a large server setup, PyIceberg makes local testing, development, and automation much easier.

For many companies and data teams moving towards cloud-native platforms and open formats, PyIceberg is a very handy tool. It helps you manage data like a pro without depending on big infrastructure. You can validate data, test pipelines, explore schema changes, and keep everything under control—all from your Python environment. As more teams start using Apache Iceberg, PyIceberg is becoming the go-to option for those who want something powerful but also lightweight and developer-friendly.

With PyIceberg, you can:

  • Create, modify, and inspect Iceberg tables locally
  • Append data from Pandas or Arrow tables
  • Perform schema evolution
  • Explore table metadata and snapshots
  • Reproduce ML training data using Iceberg time travel

It’s fast, flexible, and lightweight—perfect for modern Python-based workflows.

Real-World Example: Feature Store for Supply Chain Forecasting

Let’s say you work for a retail supply chain company. Your job is to build a daily feature store that feeds machine learning models which predict product demand across regions.

You need to:

  • Version your features (so ML models are reproducible)
  • Evolve schema as new signals are added
  • Inspect metadata without spinning up Spark
  • Enable local testing and development

This is a perfect use case for PyIceberg.

Folder Structure

supplychain_forecast/
├── .venv/ # Virtual environment
├── warehouse/ # Iceberg warehouse (local)
│ └── metadata/ # Iceberg metadata files
├── pyiceberg.db # SQLite catalog
├── feature_pipeline.py # Create & append features
├── schema_evolution.py # Add new columns
├── snapshot_inspector.py # View snapshot history
├── notebooks/
│ └── iceberg_explore.ipynb # Jupyter exploration
├── data/
│ └── raw_input.csv # External inputs
├── requirements.txt # Python dependencies
└── README.md

Architecture Diagram


Plyceberg Feature Store for Supply Chain Forecasting


Setup and Installation

Let's work on the setting up of the environment

Python
 
pip install "pyiceberg[sql,pyarrow]" pandas 


Create and activate a virtual environment:

Python
 
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate


Step 1: Create the Feature Table

Python
 
# feature_pipeline.py
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import StringType, IntegerType, DateType, FloatType, BooleanType, StructType
import pandas as pd
import pyarrow as pa
from datetime import date

# Step 1.1: Load catalog
catalog = load_catalog(
name="local",
type="sql",
uri="sqlite:///warehouse/pyiceberg.db",
warehouse="file:///warehouse"
)

# Step 1.2: Define schema
schema = Schema(
StructType(fields=[
("date", DateType(), False, "Forecast date"),
("sku_id", StringType(), False, "SKU identifier"),
("region", StringType(), False, "Sales region"),
("units_sold", IntegerType(), True, "Actual sales"),
("inventory", IntegerType(), True, "Stock level"),
("promo_flag", BooleanType(), True, "Was promoted?"),
("day_of_week", StringType(), True, "Signal for ML")
])
)

# Step 1.3: Create Iceberg table
table = catalog.create_table("features.daily_sku_demand_v1", schema)

# Step 1.4: Append initial data
df = pd.DataFrame([
{
"date": date(2025, 7, 8),
"sku_id": "SKU_123",
"region": "West",
"units_sold": 120,
"inventory": 340,
"promo_flag": True,
"day_of_week": "Tuesday"
},
{
"date": date(2025, 7, 8),
"sku_id": "SKU_456",
"region": "East",
"units_sold": 95,
"inventory": 200,
"promo_flag": False,
"day_of_week": "Tuesday"
}
])
arrow_table = pa.Table.from_pandas(df)
table.append(arrow_table)

print("Table created and data written.")


Step 2: Evolve the Schema

Python
 
# schema_evolution.py
from pyiceberg.types import FloatType

table = catalog.load_table("features.daily_sku_demand_v1")

with table.update_schema() as update:
update.add_column("forecast_demand", FloatType(), "ML prediction")
update.add_column("forecast_confidence", FloatType(), "Confidence level")

print( "Forecast columns added.")


Step 3: Inspect Metadata and Snapshots

Python
 
# snapshot_inspector.py
table = catalog.load_table("features.daily_sku_demand_v1")

for snapshot in table.snapshots:
print(f"Snapshot ID: {snapshot.snapshot_id}, Time: {snapshot.timestamp_ms}")


To load data from a past snapshot:

Python
 
snapshot_id = table.snapshots[0].snapshot_id
df_old = table.scan(snapshot_id=snapshot_id).to_pandas()


Why This Matters for ML and DataOps

Benefit Why It Helps
Time Travel Reproduce past training sets with snapshot IDs
Schema Evolution Add new features without migrations
Local Development Work without Spark, Hive, or EMR
Fast Feedback Loop Test new features in notebooks immediately
Modular Table Management Ideal for versioned, testable data products


Limitations to Keep in Mind

  • Not yet suitable for high-throughput writes
  • No support for compaction or partition rewriting
  • Lacks distributed query capability (can be layered with DuckDB)

PyIceberg is a game-changer for Python-native data lakehouse development. It gives engineers and ML teams powerful tooling to:

  • Create and evolve Iceberg tables
  • Use time travel for versioned features
  • Prototype rapidly using familiar tools (Pandas, Arrow, Jupyter)
  • Whether you're tracking demand forecasts or running experiments, PyIceberg brings the power of Apache Iceberg to your laptop—without the overhead.


Machine learning Apache Python programming language

Opinions expressed by DZone contributors are their own.

Related

  • How We Rebuilt a Legacy HBase + Elasticsearch System Using Apache Iceberg, Spark, Trino, and Doris
  • Event-Driven Pipelines With Apache Pulsar and Go
  • Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook