Scaling Real-Time Data Systems With DataOps: Principles, Practices, and Use Cases

Learn how DataOps principles like CI/CD, schema versioning, and observability improve reliability in real-time streaming data pipelines.

Tulika Bhatt

Aug. 27, 25 · Analysis

Likes (2)

Comment

Save

1.5K Views

Editor's Note: The following is an article written for and published in DZone's 2025 Trend Report, Data Engineering: Scaling Intelligence With the Modern Data Stack.

Real-time decision making is no longer a competitive advantage; it's becoming a baseline expectation. From fraud detection to personalized recommendations, modern systems are expected to process and respond to user activity within milliseconds. But while demand for real-time data has exploded, many engineering teams are still struggling with brittle pipelines, silent failures, and fragile deployments.

In this article, we explore how DataOps brings much-needed discipline to real-time architectures. We'll dive into principles like CI/CD, schema versioning, observability, and environment parity, and walk through a fully open-source clickstream pipeline that puts these ideas into action.

DataOps Principles for Streaming Architectures

One of the foundational principles of DataOps is treating data as a product. This means going beyond the one-off script or ad hoc transformation logic to view data pipelines as long-lived assets. A real-time data pipeline, much like an API or service, should have version control, proper documentation, automated tests, and clearly defined contracts with its consumers. This enables pipelines to evolve safely over time, much like APIs.

Another fundamental pillar of DataOps is enabling the continuous delivery of both data and metadata. While batch workflows often align deployment cycles with scheduled reports or model refreshes, real-time systems demand a different approach. Because they operate in an ongoing, always-on mode, the supporting delivery mechanisms must also be continuous, automated, and resilient. In this setting, CI/CD pipelines take on expanded responsibilities. They not only deploy new code but also track schema changes, propagate metadata like lineage and freshness, and ensure consistent configuration across development, staging, and production environments. Whether it's a transformation update, a configuration change, or a schema evolution, every modification must be treated with the same engineering discipline and testing rigor as a production software release.

Reproducibility and environment parity are equally vital in streaming architectures. The cost of "it works on my machine" can be disastrous when data is flowing in real time. Consistent configuration across environments helps prevent subtle bugs that emerge only in production. This requires a strong discipline of managing infrastructure and pipeline definitions as code.

Together, these principles form the backbone of DataOps for real-time systems. But to put them into practice, we need supporting strategies around schema management, orchestration, and monitoring. In the next few sections, we'll explore each of these dimensions in detail and learn about ways we could incorporate them in our systems.

Managing Schemas and Sources in Streaming Architectures

One of the most important aspects of building real-time pipelines is schema management. Before you can monitor or deploy pipelines safely, you need to manage the structure of your data. Schemas should be version controlled and validated before being promoted to production. For instance, using Apache Avro or Protobuf, you can define a record structure that producers and consumers agree on. These schemas are stored in a registry and validated during pull requests, ensuring backward and forward compatibility.

     JSON
    
 

    //Avro schema for a sample 'ClickEvent'

{
  "type": "record",
  "name": "ClickEvent",
  "namespace": "com.example.analytics",
  "fields": [
    { "name": "user_id", "type": "int" },
    { "name": "event_type", "type": "string" },
    { "name": "timestamp", "type": "long" }
  ]
}
   

In addition to managing schemas, teams should apply the same rigor to the logic and configuration that power their pipelines. Streaming SQL, transformation logic, and configuration files should all live in Git alongside test harnesses and example inputs. A modular repo structure where business logic, test data, and environment configs are separated makes it easier to review changes and trace errors when incidents arise. This way, teams can build pipelines that are testable and reproducible by treating both schemas and code as source-controlled assets.

    Plain Text
   
 

   // A modular project structure for storing configs, code and tests

clickstream-pipeline/
├── README.md
├── ci/
│   └── validate-schema.sh             # CI scripts for schema compatibility, linting
├── config/
│   ├── dev/
│   │   └── flink-config.yaml          # Dev-specific job config
│   ├── staging/
│   │   └── flink-config.yaml
│   └── prod/
│       └── flink-config.yaml          # Production-specific parameters (e.g., Kafka topics)
├── jobs/
│   └── clickstream-sessionizer/
│       ├── src/
│       │   └── MainFlinkJob.java      # Main Flink job logic
│       ├── resources/
│       │   └── application.conf       # Job-level config
│       └── build.gradle               # Build file for the job
├── schemas/
│   ├── v1/
│   │   └── click_event.avsc           # Initial schema version
│   └── v2/
│       └── click_event.avsc           # Updated schema version
├── test/
│   ├── integration/
│   │   ├── sample-click-events.json   # Test data for end-to-end validation
│   │   └── expected-output.json
│   └── unit/
│       └── ClickstreamJobTest.java    # Unit tests for transformation logic
├── docker/
│   └── Dockerfile                     # Container definition for deployment
└── .github/
    └── workflows/
        └── ci.yaml                    # GitHub Actions workflow for CI/CD
  

Automating Deployments and Validating Pipeline Changes

Schema management lays the foundation, but safe deployments require robust orchestration and automation. Streaming pipelines should be deployed declaratively, using versioned job definitions and Infrastructure-as-Code principles. For example, rather than manually submitting jobs or modifying runtime parameters, teams can define their Flink or Airflow jobs as code, store them in Git, and apply changes via pull requests.

    Plain Text
   
 

   name: CI/CD for Clickstream Pipeline

on: [push]

jobs:
  build-test-deploy:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v3

    - name: Build JAR
      run: ./gradlew shadowJar

    - name: Validate Schema
      run: ./ci/validate-schema.sh

    - name: Run Integration Tests
      run: ./test/run-integration-tests.sh

    - name: Deploy to Staging
      run: ./scripts/deploy.sh staging

    - name: Canary Deploy to Prod
      run: ./scripts/canary-deploy.sh
  

This kind of automation removes the human error from critical deployments, encourages frequent iteration, and creates a paper trail for audits or incident response. It also supports true environment parity, where staging and production differ only by templated variables and not code.

Observability and On-Call Practices for Streaming Pipelines

Even with robust deployments, real-time systems will eventually fail. Metrics and observability are what turn these failures into solvable problems rather than open-ended mysteries. Every layer of a streaming pipeline emits important operational signals that help engineers monitor health, performance, and failures. The table below highlights key metrics from the most common components:

Pipeline layer	key operational signals
Apache Kafka	Consumer lag, message throughput, partition skew
Apache Flink	Checkpoint duration, processing latency, error rates
Apache Airflow	DAG success/failure rates, task duration, scheduling delays

Table 1. Key metrics for most common components

These metrics should be exposed through tools like Prometheus and visualized with dashboards (e.g., Grafana). But the real value comes from creating business-aware alerts. Instead of paging on CPU spikes, alert when event throughput drops unexpectedly or when schema violations spike.

Validation checks can also be embedded directly in job logic. A Flink job, for example, might verify that a user_id field is never null or that a specific distribution of event types remains within an expected range. These checks help surface data quality issues before they reach consumers. On-call readiness also means having clear runbooks. When an alert fires, responders should know: what changed, who owns it, and what the rollback plan is. Post-incident reviews help refine these processes over time, improving resilience across the board.

Use Case: Real-Time Clickstream Aggregation

To see DataOps in action, let's walk through a simplified clickstream pipeline. We'll use real data, simulate real-time ingestion, and process events using streaming tools, all while applying versioning, testing, and observability best practices.

Setting Up the Infrastructure Stack

Before actually writing code, we need to set up the infrastructure stack for our DataOps-enabled project. This involves installing Kafka and Apache Zookeeper for data ingestion, Flink for stream processing, Apicurio Schema Registry for schema management, and Prometheus and Grafana for observability. We can use this sample file for installation:

    YAML
   
 

   # Docker Compose setup for Kafka, Flink, Apicurio, Prometheus, Grafana

version: '3.8'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.0.1
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    ports:
      - "2181:2181"

  kafka:
    image: confluentinc/cp-kafka:7.0.1
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

  apicurio:
    image: apicurio/apicurio-registry-mem:2.3.2.Final
    ports:
      - "8081:8080"
    environment:
      QUARKUS_PROFILE: prod
      REGISTRY_DATASOURCE_URL: jdbc:h2:mem:registry
      REGISTRY_UI_CONFIG_URL: http://localhost:8081

  jobmanager:
    image: flink:1.17.1
    ports:
      - "8082:8081"
    command: jobmanager
    environment:
      - JOB_MANAGER_RPC_ADDRESS=jobmanager

  taskmanager:
    image: flink:1.17.1
    depends_on:
      - jobmanager
    command: taskmanager
    environment:
      - JOB_MANAGER_RPC_ADDRESS=jobmanager

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
  

Ingesting Events

The next phase focuses on data ingestion. We start by exploring the dataset and defining a schema that represents the event structure accurately. This schema is registered to enforce compatibility and enable downstream validation. Real-time events are then produced into Apache Kafka, and Apache Flink is used to ingest and process the event stream as it flows through the system.

Registering Schema

Once we have performed the above, the next step is to explore the dataset, finalize the schema, and register it. We will be using the E-Commerce Behavior Data From Multi-Category Store for our walkthrough, which includes user events such as page views and purchases with timestamps. Registering schema in a registry is essential as it validates the incoming events and enforces backward compatibility across versions. This ensures producers don't accidentally introduce breaking changes that affect downstream jobs. You can use the CLI or cURL to register your Avro schema:

    Shell
   
 

   cat <<EOF > click_event.avsc
{
  "type": "record",
  "name": "ClickEvent",
  "fields": [
    {"name": "user_id", "type": "long"},
    {"name": "event_type", "type": "string"},
    {"name": "timestamp", "type": "long"}
  ]
}
EOF

curl -X POST http://localhost:8081/apis/registry/v2/groups/default/artifacts -H "Content-Type: application/json" -H "X-Registry-ArtifactId: click-event" --data-binary @schemas/click_event.avsc
  

Produce to Kafka (Simulated Real Time)

Next, we will simulate real-time events by reading records of the e-commerce behavior dataset, pausing for 50ms between each record, and sending it to Kafka:

    Python
   
 

   import time
import json
import pandas as pd
from fastavro import schemaless_writer
from io import BytesIO
from confluent_kafka import Producer

# Load schema
with open("schemas/click_event.avsc", "r") as f:
    schema = json.load(f)

# Set up Kafka producer
producer = Producer({'bootstrap.servers': 'localhost:9092'})

# Load the CSV (you can limit for testing)
df = pd.read_csv("test/data/2019-Nov.csv", nrows=10000)

# Filter and rename
df = df[["user_id", "event_type", "event_time"]]
df = df.dropna()

# Convert datetime to epoch ms
df["event_time"] = pd.to_datetime(df["event_time"], utc=True)
df["event_time"] = df["event_time"].astype(int) // 10**6  # to milliseconds

# Publish to Kafka
for _, row in df.iterrows():
    event = {
        "user_id": int(row['user_id']),
        "event_type": str(row['event_type']),
        "timestamp": int(row['event_time'])
    }

    buffer = BytesIO()
    schemaless_writer(buffer, schema, event)

    producer.produce("clickstream", value=buffer.getvalue())
    print(f"Sent: {event}")
    time.sleep(0.05)  # simulate real-time

producer.flush()
  

Stream Processing in Real Time

We'll use a Flink job to handle these events as they arrive, grouping them by user session, tracking user actions through conversion funnels, and counting event types within tumbling time windows, by using either SQL or the DataStream API.

     SQL
    
 

    -- Define source table using Avro and schema registry
CREATE TABLE clickstream (
  user_id BIGINT,
  event_type STRING,
  timestamp TIMESTAMP(3),
  WATERMARK FOR timestamp AS timestamp - INTERVAL '5' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'clickstream',
  'properties.bootstrap.servers' = 'localhost:9092',
  'scan.startup.mode' = 'earliest-offset',
  'format' = 'avro-confluent',
  'avro-confluent.schema-registry.url' = 'http://apicurio:8080/apis/registry/v2',
  'avro-confluent.subject' = 'click-event'
);

-- Define sink table (printing for now)
CREATE TABLE event_summary (
  event_type STRING,
  event_count BIGINT,
  window_start TIMESTAMP(3),
  window_end TIMESTAMP(3)
) WITH (
  'connector' = 'print'
);

-- Aggregate click events by type and time window
INSERT INTO event_summary
SELECT
  event_type,
  COUNT(*) AS event_count,
  TUMBLE_START(timestamp, INTERVAL '10' MINUTE),
  TUMBLE_END(timestamp, INTERVAL '10' MINUTE)
FROM clickstream
GROUP BY
  TUMBLE(timestamp, INTERVAL '10' MINUTE),
  event_type;
   

CI/CD Validation

The job's logic, configuration, and test cases are version controlled in Git. When a change is proposed, continuous integration runs schema validations and unit tests before deploying to a staging environment. After successful verification, a canary deployment gradually rolls out the update to production.

      YAML
     
 

     name: CI for Clickstream Pipeline
on: [push]
jobs:
  build-test-deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Build pipeline logic
        run: ./gradlew shadowJar

      - name: Validate Avro Schema
        run: ./ci/validate-schema.sh

      - name: Run integration tests
        run: ./test/integration/run-tests.sh

      - name: Deploy to staging
        run: ./deploy/deploy.sh staging

      - name: Canary deploy to prod
        run: ./deploy/deploy.sh production --canary
    

The following table lists different configs for different environments.

config category	what to customize	examples
Kafka topics	Use separate topics per env to isolate data	clickstream-dev, clickstream-staging, clickstream-prod
Schema registry	Point to isolated schema registry or use group namespace	http://localhost:8081/groups/dev
Checkpoint storage	Use different file system paths	/tmp/flink/dev-checkpoints/, /mnt/prod-checkpoints/
State backends	Enable RocksDB in prod, memory in dev	rocksdb vs heap
Parallelism	Lower parallelism for dev/test	parallelism.default: 1 (dev), 8 (prod)
Job timeouts	Use shorter timeouts for rapid iteration in dev	execution.checkpointing.interval: 2m (dev) vs 10m (prod)
Logging levels	Verbose logging in dev, warnings only in prod	INFO (dev), WARN or ERROR (prod)
Dead letter queues	Use isolated DLQ Kafka topics	dlq-dev, dlq-prod
Validation flags	Enable strict checks in staging/prod	validate.schema: true
Alert thresholds	More tolerant in dev	Alert on lag > 10k (dev) vs > 1k (prod)

Table 2. Configs for various environments

Observing the Pipeline

Flink exports metrics like processing lag and checkpoint duration to Prometheus. Dashboards in Grafana visualize these metrics alongside custom alerts for event drops or schema anomalies. For example, we can update the config file with following metrics and restart the flink job with these configs to export metrics to Prometheus:

    YAML
   
   metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9250

Prometheus will scrape metrics at: http://flink-jobmanager:9250/metrics. To set up Prometheus, we need to edit the file as shown below:

    YAML
   
 

   global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'flink'
    static_configs:
      - targets: ['jobmanager:9250']

  - job_name: 'kafka'
    static_configs:
      - targets: ['kafka:9101']
  

Similarly, we can visualize these metrics in Grafana by these steps:

Login: http://localhost:3000 (default user: admin/admin)
Add Prometheus as a data source
Create dashboards:
- Kafka consumer lag
- Flink job latency and checkpoint time
- Data freshness or throughput by event type

To set up alerting, use these configs in the Grafana rules:

    YAML
   
 

   alert:
  - alert: HighLag
    expr: kafka_consumer_lag > 5000
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Kafka consumer lag too high"
  

Conclusion

As the industry accelerates toward real-time decisioning, DataOps is no longer a nice-to-have; it's becoming essential infrastructure. The fragility of streaming systems makes traditional batch-era practices obsolete, and teams need new ways to scale trust, resilience, and speed. By applying software engineering principles to the data layer, organizations can tame the chaos of streaming pipelines. This includes versioning schemas, automating deployments, validating logic before release, and building observability into every stage of the data pipeline.

The journey doesn't require a full overhaul. Teams can start small, with schema validation or simple CI pipelines, and evolve toward full automation and on-call readiness. Those who do will reduce downtime and unlock faster iteration, better data quality, and a system that's built to last.

This is an excerpt from DZone's 2025 Trend Report, Data Engineering: Scaling Intelligence With the Modern Data Stack.

Read the Free Report

Data (computing) Schema systems

Opinions expressed by DZone contributors are their own.

Related

Trending