Scaling Real-Time Data Systems With DataOps: Principles, Practices, and Use Cases
Learn how DataOps principles like CI/CD, schema versioning, and observability improve reliability in real-time streaming data pipelines.
Join the DZone community and get the full member experience.
Join For FreeEditor's Note: The following is an article written for and published in DZone's 2025 Trend Report, Data Engineering: Scaling Intelligence With the Modern Data Stack.
Real-time decision making is no longer a competitive advantage; it's becoming a baseline expectation. From fraud detection to personalized recommendations, modern systems are expected to process and respond to user activity within milliseconds. But while demand for real-time data has exploded, many engineering teams are still struggling with brittle pipelines, silent failures, and fragile deployments.
In this article, we explore how DataOps brings much-needed discipline to real-time architectures. We'll dive into principles like CI/CD, schema versioning, observability, and environment parity, and walk through a fully open-source clickstream pipeline that puts these ideas into action.
DataOps Principles for Streaming Architectures
One of the foundational principles of DataOps is treating data as a product. This means going beyond the one-off script or ad hoc transformation logic to view data pipelines as long-lived assets. A real-time data pipeline, much like an API or service, should have version control, proper documentation, automated tests, and clearly defined contracts with its consumers. This enables pipelines to evolve safely over time, much like APIs.
Another fundamental pillar of DataOps is enabling the continuous delivery of both data and metadata. While batch workflows often align deployment cycles with scheduled reports or model refreshes, real-time systems demand a different approach. Because they operate in an ongoing, always-on mode, the supporting delivery mechanisms must also be continuous, automated, and resilient. In this setting, CI/CD pipelines take on expanded responsibilities. They not only deploy new code but also track schema changes, propagate metadata like lineage and freshness, and ensure consistent configuration across development, staging, and production environments. Whether it's a transformation update, a configuration change, or a schema evolution, every modification must be treated with the same engineering discipline and testing rigor as a production software release.
Reproducibility and environment parity are equally vital in streaming architectures. The cost of "it works on my machine" can be disastrous when data is flowing in real time. Consistent configuration across environments helps prevent subtle bugs that emerge only in production. This requires a strong discipline of managing infrastructure and pipeline definitions as code.
Together, these principles form the backbone of DataOps for real-time systems. But to put them into practice, we need supporting strategies around schema management, orchestration, and monitoring. In the next few sections, we'll explore each of these dimensions in detail and learn about ways we could incorporate them in our systems.
Managing Schemas and Sources in Streaming Architectures
One of the most important aspects of building real-time pipelines is schema management. Before you can monitor or deploy pipelines safely, you need to manage the structure of your data. Schemas should be version controlled and validated before being promoted to production. For instance, using Apache Avro or Protobuf, you can define a record structure that producers and consumers agree on. These schemas are stored in a registry and validated during pull requests, ensuring backward and forward compatibility.
//Avro schema for a sample 'ClickEvent'
{
"type": "record",
"name": "ClickEvent",
"namespace": "com.example.analytics",
"fields": [
{ "name": "user_id", "type": "int" },
{ "name": "event_type", "type": "string" },
{ "name": "timestamp", "type": "long" }
]
}
In addition to managing schemas, teams should apply the same rigor to the logic and configuration that power their pipelines. Streaming SQL, transformation logic, and configuration files should all live in Git alongside test harnesses and example inputs. A modular repo structure where business logic, test data, and environment configs are separated makes it easier to review changes and trace errors when incidents arise. This way, teams can build pipelines that are testable and reproducible by treating both schemas and code as source-controlled assets.
// A modular project structure for storing configs, code and tests
clickstream-pipeline/
├── README.md
├── ci/
│ └── validate-schema.sh # CI scripts for schema compatibility, linting
├── config/
│ ├── dev/
│ │ └── flink-config.yaml # Dev-specific job config
│ ├── staging/
│ │ └── flink-config.yaml
│ └── prod/
│ └── flink-config.yaml # Production-specific parameters (e.g., Kafka topics)
├── jobs/
│ └── clickstream-sessionizer/
│ ├── src/
│ │ └── MainFlinkJob.java # Main Flink job logic
│ ├── resources/
│ │ └── application.conf # Job-level config
│ └── build.gradle # Build file for the job
├── schemas/
│ ├── v1/
│ │ └── click_event.avsc # Initial schema version
│ └── v2/
│ └── click_event.avsc # Updated schema version
├── test/
│ ├── integration/
│ │ ├── sample-click-events.json # Test data for end-to-end validation
│ │ └── expected-output.json
│ └── unit/
│ └── ClickstreamJobTest.java # Unit tests for transformation logic
├── docker/
│ └── Dockerfile # Container definition for deployment
└── .github/
└── workflows/
└── ci.yaml # GitHub Actions workflow for CI/CD
Automating Deployments and Validating Pipeline Changes
Schema management lays the foundation, but safe deployments require robust orchestration and automation. Streaming pipelines should be deployed declaratively, using versioned job definitions and Infrastructure-as-Code principles. For example, rather than manually submitting jobs or modifying runtime parameters, teams can define their Flink or Airflow jobs as code, store them in Git, and apply changes via pull requests.
name: CI/CD for Clickstream Pipeline
on: [push]
jobs:
build-test-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Build JAR
run: ./gradlew shadowJar
- name: Validate Schema
run: ./ci/validate-schema.sh
- name: Run Integration Tests
run: ./test/run-integration-tests.sh
- name: Deploy to Staging
run: ./scripts/deploy.sh staging
- name: Canary Deploy to Prod
run: ./scripts/canary-deploy.sh
This kind of automation removes the human error from critical deployments, encourages frequent iteration, and creates a paper trail for audits or incident response. It also supports true environment parity, where staging and production differ only by templated variables and not code.
Observability and On-Call Practices for Streaming Pipelines
Even with robust deployments, real-time systems will eventually fail. Metrics and observability are what turn these failures into solvable problems rather than open-ended mysteries. Every layer of a streaming pipeline emits important operational signals that help engineers monitor health, performance, and failures. The table below highlights key metrics from the most common components:
| Pipeline layer | key operational signals |
|---|---|
|
Apache Kafka |
Consumer lag, message throughput, partition skew |
|
Apache Flink |
Checkpoint duration, processing latency, error rates |
|
Apache Airflow |
DAG success/failure rates, task duration, scheduling delays |
Table 1. Key metrics for most common components
These metrics should be exposed through tools like Prometheus and visualized with dashboards (e.g., Grafana). But the real value comes from creating business-aware alerts. Instead of paging on CPU spikes, alert when event throughput drops unexpectedly or when schema violations spike.
Validation checks can also be embedded directly in job logic. A Flink job, for example, might verify that a user_id field is never null or that a specific distribution of event types remains within an expected range. These checks help surface data quality issues before they reach consumers. On-call readiness also means having clear runbooks. When an alert fires, responders should know: what changed, who owns it, and what the rollback plan is. Post-incident reviews help refine these processes over time, improving resilience across the board.
Use Case: Real-Time Clickstream Aggregation
To see DataOps in action, let's walk through a simplified clickstream pipeline. We'll use real data, simulate real-time ingestion, and process events using streaming tools, all while applying versioning, testing, and observability best practices.
Setting Up the Infrastructure Stack
Before actually writing code, we need to set up the infrastructure stack for our DataOps-enabled project. This involves installing Kafka and Apache Zookeeper for data ingestion, Flink for stream processing, Apicurio Schema Registry for schema management, and Prometheus and Grafana for observability. We can use this sample file for installation:
# Docker Compose setup for Kafka, Flink, Apicurio, Prometheus, Grafana
version: '3.8'
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.0.1
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
ports:
- "2181:2181"
kafka:
image: confluentinc/cp-kafka:7.0.1
depends_on:
- zookeeper
ports:
- "9092:9092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
apicurio:
image: apicurio/apicurio-registry-mem:2.3.2.Final
ports:
- "8081:8080"
environment:
QUARKUS_PROFILE: prod
REGISTRY_DATASOURCE_URL: jdbc:h2:mem:registry
REGISTRY_UI_CONFIG_URL: http://localhost:8081
jobmanager:
image: flink:1.17.1
ports:
- "8082:8081"
command: jobmanager
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
taskmanager:
image: flink:1.17.1
depends_on:
- jobmanager
command: taskmanager
environment:
- JOB_MANAGER_RPC_ADDRESS=jobmanager
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
Ingesting Events
The next phase focuses on data ingestion. We start by exploring the dataset and defining a schema that represents the event structure accurately. This schema is registered to enforce compatibility and enable downstream validation. Real-time events are then produced into Apache Kafka, and Apache Flink is used to ingest and process the event stream as it flows through the system.
Registering Schema
Once we have performed the above, the next step is to explore the dataset, finalize the schema, and register it. We will be using the E-Commerce Behavior Data From Multi-Category Store for our walkthrough, which includes user events such as page views and purchases with timestamps. Registering schema in a registry is essential as it validates the incoming events and enforces backward compatibility across versions. This ensures producers don't accidentally introduce breaking changes that affect downstream jobs. You can use the CLI or cURL to register your Avro schema:
cat <<EOF > click_event.avsc
{
"type": "record",
"name": "ClickEvent",
"fields": [
{"name": "user_id", "type": "long"},
{"name": "event_type", "type": "string"},
{"name": "timestamp", "type": "long"}
]
}
EOF
curl -X POST http://localhost:8081/apis/registry/v2/groups/default/artifacts -H "Content-Type: application/json" -H "X-Registry-ArtifactId: click-event" --data-binary @schemas/click_event.avsc
Produce to Kafka (Simulated Real Time)
Next, we will simulate real-time events by reading records of the e-commerce behavior dataset, pausing for 50ms between each record, and sending it to Kafka:
import time
import json
import pandas as pd
from fastavro import schemaless_writer
from io import BytesIO
from confluent_kafka import Producer
# Load schema
with open("schemas/click_event.avsc", "r") as f:
schema = json.load(f)
# Set up Kafka producer
producer = Producer({'bootstrap.servers': 'localhost:9092'})
# Load the CSV (you can limit for testing)
df = pd.read_csv("test/data/2019-Nov.csv", nrows=10000)
# Filter and rename
df = df[["user_id", "event_type", "event_time"]]
df = df.dropna()
# Convert datetime to epoch ms
df["event_time"] = pd.to_datetime(df["event_time"], utc=True)
df["event_time"] = df["event_time"].astype(int) // 10**6 # to milliseconds
# Publish to Kafka
for _, row in df.iterrows():
event = {
"user_id": int(row['user_id']),
"event_type": str(row['event_type']),
"timestamp": int(row['event_time'])
}
buffer = BytesIO()
schemaless_writer(buffer, schema, event)
producer.produce("clickstream", value=buffer.getvalue())
print(f"Sent: {event}")
time.sleep(0.05) # simulate real-time
producer.flush()
Stream Processing in Real Time
We'll use a Flink job to handle these events as they arrive, grouping them by user session, tracking user actions through conversion funnels, and counting event types within tumbling time windows, by using either SQL or the DataStream API.
-- Define source table using Avro and schema registry
CREATE TABLE clickstream (
user_id BIGINT,
event_type STRING,
timestamp TIMESTAMP(3),
WATERMARK FOR timestamp AS timestamp - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'clickstream',
'properties.bootstrap.servers' = 'localhost:9092',
'scan.startup.mode' = 'earliest-offset',
'format' = 'avro-confluent',
'avro-confluent.schema-registry.url' = 'http://apicurio:8080/apis/registry/v2',
'avro-confluent.subject' = 'click-event'
);
-- Define sink table (printing for now)
CREATE TABLE event_summary (
event_type STRING,
event_count BIGINT,
window_start TIMESTAMP(3),
window_end TIMESTAMP(3)
) WITH (
'connector' = 'print'
);
-- Aggregate click events by type and time window
INSERT INTO event_summary
SELECT
event_type,
COUNT(*) AS event_count,
TUMBLE_START(timestamp, INTERVAL '10' MINUTE),
TUMBLE_END(timestamp, INTERVAL '10' MINUTE)
FROM clickstream
GROUP BY
TUMBLE(timestamp, INTERVAL '10' MINUTE),
event_type;
CI/CD Validation
The job's logic, configuration, and test cases are version controlled in Git. When a change is proposed, continuous integration runs schema validations and unit tests before deploying to a staging environment. After successful verification, a canary deployment gradually rolls out the update to production.
name: CI for Clickstream Pipeline
on: [push]
jobs:
build-test-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Build pipeline logic
run: ./gradlew shadowJar
- name: Validate Avro Schema
run: ./ci/validate-schema.sh
- name: Run integration tests
run: ./test/integration/run-tests.sh
- name: Deploy to staging
run: ./deploy/deploy.sh staging
- name: Canary deploy to prod
run: ./deploy/deploy.sh production --canary
The following table lists different configs for different environments.
| config category | what to customize | examples |
|---|---|---|
|
Kafka topics |
Use separate topics per env to isolate data |
clickstream-dev, clickstream-staging, clickstream-prod |
|
Schema registry |
Point to isolated schema registry or use group namespace |
http://localhost:8081/groups/dev |
|
Checkpoint storage |
Use different file system paths |
/tmp/flink/dev-checkpoints/, /mnt/prod-checkpoints/ |
|
State backends |
Enable RocksDB in prod, memory in dev |
rocksdb vs heap |
|
Parallelism |
Lower parallelism for dev/test |
parallelism.default: 1 (dev), 8 (prod) |
|
Job timeouts |
Use shorter timeouts for rapid iteration in dev |
execution.checkpointing.interval: 2m (dev) vs 10m (prod) |
|
Logging levels |
Verbose logging in dev, warnings only in prod |
INFO (dev), WARN or ERROR (prod) |
|
Dead letter queues |
Use isolated DLQ Kafka topics |
dlq-dev, dlq-prod |
|
Validation flags |
Enable strict checks in staging/prod |
validate.schema: true |
|
Alert thresholds |
More tolerant in dev |
Alert on lag > 10k (dev) vs > 1k (prod) |
Table 2. Configs for various environments
Observing the Pipeline
Flink exports metrics like processing lag and checkpoint duration to Prometheus. Dashboards in Grafana visualize these metrics alongside custom alerts for event drops or schema anomalies. For example, we can update the config file with following metrics and restart the flink job with these configs to export metrics to Prometheus:
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9250
Prometheus will scrape metrics at: http://flink-jobmanager:9250/metrics. To set up Prometheus, we need to edit the file as shown below:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'flink'
static_configs:
- targets: ['jobmanager:9250']
- job_name: 'kafka'
static_configs:
- targets: ['kafka:9101']
Similarly, we can visualize these metrics in Grafana by these steps:
- Login: http://localhost:3000 (default user: admin/admin)
- Add Prometheus as a data source
- Create dashboards:
- Kafka consumer lag
- Flink job latency and checkpoint time
- Data freshness or throughput by event type
To set up alerting, use these configs in the Grafana rules:
alert:
- alert: HighLag
expr: kafka_consumer_lag > 5000
for: 2m
labels:
severity: critical
annotations:
summary: "Kafka consumer lag too high"
Conclusion
As the industry accelerates toward real-time decisioning, DataOps is no longer a nice-to-have; it's becoming essential infrastructure. The fragility of streaming systems makes traditional batch-era practices obsolete, and teams need new ways to scale trust, resilience, and speed. By applying software engineering principles to the data layer, organizations can tame the chaos of streaming pipelines. This includes versioning schemas, automating deployments, validating logic before release, and building observability into every stage of the data pipeline.
The journey doesn't require a full overhaul. Teams can start small, with schema validation or simple CI pipelines, and evolve toward full automation and on-call readiness. Those who do will reduce downtime and unlock faster iteration, better data quality, and a system that's built to last.
This is an excerpt from DZone's 2025 Trend Report, Data Engineering: Scaling Intelligence With the Modern Data Stack.
Read the Free Report
Opinions expressed by DZone contributors are their own.
Comments