Getting Started With ClickHouse for AI/ML in Python
ClickHouse is a high-performance, column-oriented OLAP database designed to handle petabyte-scale data with lightning-fast query execution.
Join the DZone community and get the full member experience.
Join For FreeAs artificial intelligence (AI) and machine learning (ML) workloads grow in complexity and volume, traditional databases often struggle to meet the performance needs of large-scale, real-time analytics. ClickHouse, a high-performance, column-oriented OLAP (Online Analytical Processing) database designed to handle petabyte-scale data with lightning-fast query execution, offers a compelling solution for data engineers and ML practitioners alike. Its unique columnar storage, vectorized execution, and support for distributed deployments make it highly suitable for processing massive datasets generated by IoT devices, web platforms, and large-scale enterprise applications. By enabling both high-speed querying and efficient storage, ClickHouse allows organizations to analyze and act on data in near real-time, which is critical for dynamic AI/ML pipelines and feature engineering tasks.
In this article, we’ll explore how to get started with ClickHouse and Python for building fast, scalable AI/ML pipelines.
We'll walk through practical examples:
- Setting up ClickHouse
- Ingesting customer data
- Querying with Python
- Aggregations
- Visualizations
- Using it in an ML pipeline
Why ClickHouse for AI/ML?
ClickHouse stands out for several reasons:
- High Performance: Its columnar storage format and vectorized query engine make analytical queries blazingly fast.
- Compression and Parallelism: Data compression reduces storage footprint, while support for parallel processing enables massive scalability.
- SQL Interface: Supports standard SQL syntax for ease of use and integration.
- Real-Time Analytics: Ideal for use cases requiring low-latency insights, including feature engineering and model monitoring.
These features make ClickHouse particularly well-suited for tasks like:
- Training set generation from large logs
- Feature aggregation in real-time systems
- Monitoring and drift detection in production ML pipelines
Prerequisites
Before we dive into Python integration, ensure the following are installed:
- Python ≥ 3.7
- clickhouse-connect (or clickhouse-driver)
- pandas, scikit-learn, numpy, seaborn
- Docker (for quick ClickHouse setup)
Step 1: Installing ClickHouse
Install the required packages:
pip install clickhouse-connect pandas numpy scikit-learn seaborn
The fastest way to spin up a ClickHouse server locally is using Docker:
docker run -d --name clickhouse-server -p 8123:8123 -p 9000:9000 clickhouse/clickhouse-server
You can access the web UI at: http://localhost:8123
Step 2: Connect to ClickHouse Using Python
Cerate a basic Python connection:
import clickhouse_connect
client = clickhouse_connect.get_client(host='localhost', port=8123)
# Test connection
print(client.query('SELECT version()').result_rows)
Step 3: Create a Table and Insert Data
Suppose we want to analyze synthetic customer behavior for churn prediction:
client.command('''
CREATE TABLE IF NOT EXISTS user_sessions (
user_id UInt64,
session_id UUID,
timestamp DateTime,
device String,
page_views UInt8,
time_spent Float32,
conversion UInt8
) ENGINE = MergeTree()
ORDER BY (user_id, timestamp)
''')
Now insert some data:
import pandas as pd
import numpy as np
from uuid import uuid4
from datetime import datetime, timedelta
n = 5000
df = pd.DataFrame({
'user_id': np.random.randint(1000, 2000, size=n),
'session_id': [str(uuid4()) for _ in range(n)],
'timestamp': [datetime.now() - timedelta(minutes=np.random.randint(0, 10000)) for _ in range(n)],
'device': np.random.choice(['mobile', 'desktop', 'tablet'], size=n),
'page_views': np.random.poisson(5, size=n),
'time_spent': np.abs(np.random.normal(300, 50, size=n)),
'conversion': np.random.binomial(1, 0.05, size=n)
})
client.insert_df('user_sessions', df)
Step 4: Query Data from ClickHouse and Preview
query = '''
SELECT user_id, session_id, timestamp, device, page_views, time_spent, conversion
FROM user_sessions
WHERE timestamp >= now() - INTERVAL 7 DAY
LIMIT 1000
'''
df = client.query_df(query)
print(df.head())
Step 5: Fetching and Exploring Data With Pandas
Fetch customer_data into a DataFrame:
query = '''
SELECT customer_id, session_id, device, region, page_views, time_spent, conversion, timestamp
FROM customer_data
WHERE timestamp >= now() - INTERVAL 7 DAY
LIMIT 1000
'''
df = client.query_df(query)
print(df.head())
Step 6: Aggregation Query for Insights
agg_query = '''
SELECT device, region,
AVG(page_views) AS avg_views,
AVG(time_spent) AS avg_time,
SUM(conversion) AS total_conversions
FROM customer_data
GROUP BY device, region
ORDER BY total_conversions DESC
'''
agg_df = client.query_df(agg_query)
print(agg_df.head())
This demonstrates how ClickHouse can feed ML pipelines efficiently by acting as a high-speed data warehouse.
Step 7: Visualizing Customer Behavior
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
# Average Page Views per Device
plt.figure(figsize=(10,6))
sns.barplot(x='device', y='avg_views', data=df_agg, palette='Blues_d')
plt.title('Average Page Views per Device')
plt.xlabel('Device')
plt.ylabel('Average Page Views')
plt.show()
# Average Time Spent per Device
plt.figure(figsize=(10,6))
sns.barplot(x='device', y='avg_time', data=df_agg, palette='Greens_d')
plt.title('Average Time Spent per Device (seconds)')
plt.xlabel('Device')
plt.ylabel('Average Time Spent (seconds)')
plt.show()
# Total Conversions per Device
plt.figure(figsize=(10,6))
sns.barplot(x='device', y='total_conversions', data=df_agg, palette='Reds_d')
plt.title('Total Conversions per Device')
plt.xlabel('Device')
plt.ylabel('Total Conversions')
plt.show()
Step 8: Train a Simple Model Using Python
# Encode categorical variables
df['device'] = df['device'].astype('category').cat.codes
X = df[['device', 'page_views', 'time_spent']]
y = df['conversion']
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
clf = GradientBoostingClassifier().fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))
Such aggregated features can be joined back into training data or served to live models.
Performance Optimization Tips
- Use Materialized Views to pre-aggregate heavy features.
- Leverage LowCardinality types for efficient string storage.
- Avoid complex joins; favor denormalization.
- Use TTL clauses to automatically purge old data.
- Be cautious with FINAL in queries as it forces merges and slows queries.
Limitations
While ClickHouse delivers exceptional performance, there are certain constraints and trade-offs to be aware of:
- No built-in ML engine (unlike Vertica or SingleStore)
- Limited transaction support – not ideal for OLTP workloads
- Requires external tools for orchestration and model deployment
Future Potential
ClickHouse is gaining traction in the ML community. Possible future enhancements include:
- Native Python UDFs for embedded ML logic
- Better integration with distributed computing frameworks like Ray and Dask
- Advanced support for vector embeddings and LLM workloads
Summary
By integrating ClickHouse with Python, teams can efficiently process large datasets, automate feature aggregation, and reduce the time from data ingestion to model training. Its performance advantages make it ideal for operational analytics, predictive modeling, and even serving live features to ML models in production environments. Organizations can also implement monitoring pipelines to detect data drift or anomalies, ensuring their AI systems remain accurate and responsive.
Whether you're building fraud detection models, recommendation engines, or customer segmentation workflows, ClickHouse provides the speed, scalability, and reliability needed to support sophisticated AI/ML applications. As the demand for real-time insights continues to grow, mastering ClickHouse integration with Python can significantly enhance your data infrastructure, helping teams unlock actionable insights and maintain a competitive edge in fast-paced data-driven environments.
Opinions expressed by DZone contributors are their own.
Comments