DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • You Learned AI. So Why Are You Still Not Getting Hired?
  • Stop Using the ATM-Didn’t-Kill-Jobs Story to Reassure Developers About AI
  • 6 Books That Changed How I Think About Software Engineering in 2026
  • Accelerating Your Software Engineering Career With Open Source and Jakarta EE

Trending

  • Context Is the New Schema
  • Improving Java Application Reliability with Dynatrace AI Engine
  • Dear Micromanager: Your Distrust Has a Job; It’s Just Not the One You’re Doing
  • AI Agents in Java: Architecting Intelligent Health Data Systems
  1. DZone
  2. Culture and Methodologies
  3. Career Development
  4. How We Predict Dataflow Job Duration Using ML and Observability Data

How We Predict Dataflow Job Duration Using ML and Observability Data

How Real-Time Observability Signals and Cloud-Native ML Models Improve Runtime Forecasting, Capacity Planning, and Operational Efficiency in Dataflow Pipelines

By 
Deepika Singh user avatar
Deepika Singh
·
Updated by 
Madhvesh Kumar user avatar
Madhvesh Kumar
·
Dec. 16, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.1K Views

Join the DZone community and get the full member experience.

Join For Free

Efficiently managing large-scale data pipelines requires not only monitoring job performance but also anticipating how long jobs will run before they begin. This paper presents a practical, telemetry-driven approach for predicting the execution time of Google Cloud Dataflow jobs using machine learning. By combining Apache Airflow for workflow coordination, OpenTelemetry for collecting traces and resource metrics, and BigQuery ML for scalable model training, we develop an end-to-end system capable of generating reliable runtime estimates. 

The solution continuously ingests real-time observability data, performs feature engineering, updates predictive models, and surfaces insights that support capacity planning, scheduling, and early anomaly detection. Experimental results across multiple regression techniques show that observability-rich signals significantly improve prediction accuracy. This work demonstrates how integrating modern observability frameworks with machine learning can help teams reduce costs, avoid operational bottlenecks, and operate cloud-based data processing systems more efficiently.

Introduction

Predicting how long a Dataflow job will run is more than a convenience — it plays a major role in planning resources, controlling costs, and meeting service level agreements (SLAs). In large data platforms where hundreds of jobs run daily, even small miscalculations can affect downstream systems, particularly in environments where latency spikes and tail behavior can have outsized effects on performance. This paper presents a practical, engineering-focused approach to estimating Google Cloud Dataflow job durations using telemetry data collected through OpenTelemetry.

The system integrates Apache Airflow for workflow management, OpenTelemetry for distributed traces and performance metrics, and BigQuery ML for model training and inference. By combining detailed runtime traces, resource usage patterns, and job-specific characteristics, we train models that can estimate the expected duration of a Dataflow job before it starts.

The pipeline continuously ingests new telemetry data, generates features, retrains models as workloads evolve, and produces predictions that can be used for scheduling, capacity planning, proactive optimization, and early anomaly detection. The architecture demonstrates how observability data and machine-learning techniques can be woven together in a real production environment without introducing unnecessary operational overhead.

Through experiments with several regression techniques, we found that the observability-driven models deliver accurate predictions. This provides teams with a practical tool for preventing unexpected delays and improving the efficiency of large-scale data operations implemented on Google Cloud Dataflow.

Methodology

Our approach combines observability data with machine learning to predict Dataflow job durations. The methodology consists of three key steps:

Data Collection and Preprocessing

Telemetry spans from OpenTelemetry traces are captured and flattened using BigQuery’s UNNEST function. This ensures that features like start time, end time, and resource attributes are accessible for modeling.

Feature Engineering

Feature engineering transforms raw telemetry into meaningful predictors:

  • Duration: Difference between end_time and start_time.
  • Job complexity indicators: Number of tasks, pipeline depth, and parallelism level.
  • Resource metrics: CPU utilization, memory footprint, and I/O latency.
  • Temporal features: Day of week, hour of execution, and historical averages.

Model Training Pipeline

  1. Data extraction and flattening using BigQuery
  2. Feature selection and cleaning
  3. Train–test split (typically 80/20)
  4. Model training using BigQuery ML regression
  5. Prediction and evaluation

BigQuery ML: How It Works

BigQuery ML allows users to create and train machine-learning models directly in BigQuery using SQL syntax. It supports regression, classification, clustering, and time-series forecasting. For this use case, a linear regression model predicts job duration based on engineered features. Training and prediction occur within BigQuery, eliminating data movement and leveraging Google’s distributed infrastructure.

Advantages

  • No need for external ML infrastructure
  • Scales seamlessly with large datasets
  • Simple SQL-based interface

Limitations

  • Limited algorithm choices compared to dedicated ML frameworks
  • Feature engineering must be done in SQL, which can be complex for advanced transformations
  • Hyperparameter-tuning options are basic

Airflow and OpenTelemetry

Apache Airflow is an orchestration platform to programmatically author, schedule, and execute workflows. OpenTelemetry is used to generate, collect, and export telemetry data (metrics, logs, and traces) to help analyze software performance and behavior. OpenTelemetry traces provide insight into how the pipeline is executed in real time and how the various modules interact.

Monitoring Airflow Logs and Metrics

Monitoring Apache Airflow involves collecting:

  • Logs: Scheduler logs, worker logs, and DAG execution logs.
  • Metrics: Task execution times, DAG-run durations, queue latencies, and resource utilization.

To achieve this, Airflow is configured to send logs and metrics to an OpenTelemetry (OTel) collector, which can then be inserted into BigQuery through a Dataflow job.

Dataflow Architecture


OpenTelemetry (OTel) Collector

The OpenTelemetry Collector offers a vendor-agnostic implementation for receiving, processing, and exporting telemetry data. The collector gathers all Airflow metrics into a central location so the data can be sent to Pub/Sub and then stored in BigQuery using Dataflow.

Configure airflow.cfg for OpenTelemetry (OTel)

Edit the airflow.cfg file to enable OpenTelemetry for Airflow metrics:

Python
 
[metrics]
otel_on = True
otel_host = localhost
otel_port = 4318


Restart Airflow services after making these changes:

Shell
 
airflow db migrate
airflow scheduler -D
airflow webserver -D


Creating a Model on BigQuery Dataset

Since OTel spans contain a nested structure, it is important to UNNEST the data to perform operations in BigQuery. For a better model, build on a selective dataset relevant to predictions.

Sample OTel Span for Dataflow Metrics

JSON-LD
 
{
  "name": "checkout.process_payment",
  "context": {
    "trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "a90b4d2c872e8321"
  },
  "parent_id": "051581bf3cb55c13",
  "kind": "CLIENT",
  "start_time": "2025-10-24T15:45:00.123456789Z",
  "end_time": "2025-10-24T15:45:00.345678901Z",
  "status": {
    "code": "STATUS_CODE_OK",
    "message": ""
  },
  "attributes": {
    "http.method": "POST",
    "http.url": "https://payment-gateway.com/charge",
    "payment.method": "Credit Card",
    "order.id": "ORDER-12345"
  },
  "events": [
    {
      "name": "payment_request_sent",
      "timestamp": "2025-10-24T15:45:00.150000000Z",
      "attributes": {}
    },
    {
      "name": "payment_gateway_response",
      "timestamp": "2025-10-24T15:45:00.300000000Z",
      "attributes": {
        "http.status_code": 200,
        "payment.success": true
      }
    }
  ]
}


Model Creation: Target as Duration

The target field for the model is the duration, calculated as the difference between end_time and start_time.

BigQuery ML Model Creation

SQL
 
CREATE OR REPLACE MODEL
‘your_project.your_dataset.purchase_predictor_model‘
OPTIONS (...) AS
SELECT
-- feature selection and preprocessing here
FROM
your_project.your_dataset.your_table;


Using the Model for Predictions

SQL
 
SELECT *
FROM ML.PREDICT(
MODEL ‘project_id.dataset.model_name‘,
< Select the new data rows you want to predict on. >
);


Select and Preprocess Features

Select and preprocess the features used in the model. In this scenario, we are building a prediction model for estimating durations of Dataflow jobs, so we construct the field that contains the total duration of existing Dataflow jobs.

Dataflow Job Cost Predictor Diagram

The figure below visualizes how historical telemetry informs a BigQuery ML model that predicts runtime and guides the decision to submit or defer a Dataflow job.

Dataflow Job Cost Predictor


career

Opinions expressed by DZone contributors are their own.

Related

  • You Learned AI. So Why Are You Still Not Getting Hired?
  • Stop Using the ATM-Didn’t-Kill-Jobs Story to Reassure Developers About AI
  • 6 Books That Changed How I Think About Software Engineering in 2026
  • Accelerating Your Software Engineering Career With Open Source and Jakarta EE

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook