How We Predict Dataflow Job Duration Using ML and Observability Data

How Real-Time Observability Signals and Cloud-Native ML Models Improve Runtime Forecasting, Capacity Planning, and Operational Efficiency in Dataflow Pipelines

Deepika Singh

Updated by

Madhvesh Kumar

Dec. 16, 25 · Analysis

Likes (0)

Comment

Save

1.3K Views

Efficiently managing large-scale data pipelines requires not only monitoring job performance but also anticipating how long jobs will run before they begin. This paper presents a practical, telemetry-driven approach for predicting the execution time of Google Cloud Dataflow jobs using machine learning. By combining Apache Airflow for workflow coordination, OpenTelemetry for collecting traces and resource metrics, and BigQuery ML for scalable model training, we develop an end-to-end system capable of generating reliable runtime estimates.

The solution continuously ingests real-time observability data, performs feature engineering, updates predictive models, and surfaces insights that support capacity planning, scheduling, and early anomaly detection. Experimental results across multiple regression techniques show that observability-rich signals significantly improve prediction accuracy. This work demonstrates how integrating modern observability frameworks with machine learning can help teams reduce costs, avoid operational bottlenecks, and operate cloud-based data processing systems more efficiently.

Introduction

Predicting how long a Dataflow job will run is more than a convenience — it plays a major role in planning resources, controlling costs, and meeting service level agreements (SLAs). In large data platforms where hundreds of jobs run daily, even small miscalculations can affect downstream systems, particularly in environments where latency spikes and tail behavior can have outsized effects on performance. This paper presents a practical, engineering-focused approach to estimating Google Cloud Dataflow job durations using telemetry data collected through OpenTelemetry.

The system integrates Apache Airflow for workflow management, OpenTelemetry for distributed traces and performance metrics, and BigQuery ML for model training and inference. By combining detailed runtime traces, resource usage patterns, and job-specific characteristics, we train models that can estimate the expected duration of a Dataflow job before it starts.

The pipeline continuously ingests new telemetry data, generates features, retrains models as workloads evolve, and produces predictions that can be used for scheduling, capacity planning, proactive optimization, and early anomaly detection. The architecture demonstrates how observability data and machine-learning techniques can be woven together in a real production environment without introducing unnecessary operational overhead.

Through experiments with several regression techniques, we found that the observability-driven models deliver accurate predictions. This provides teams with a practical tool for preventing unexpected delays and improving the efficiency of large-scale data operations implemented on Google Cloud Dataflow.

Methodology

Our approach combines observability data with machine learning to predict Dataflow job durations. The methodology consists of three key steps:

Data Collection and Preprocessing

Telemetry spans from OpenTelemetry traces are captured and flattened using BigQuery’s UNNEST function. This ensures that features like start time, end time, and resource attributes are accessible for modeling.

Feature Engineering

Feature engineering transforms raw telemetry into meaningful predictors:

Duration: Difference between end_time and start_time.
Job complexity indicators: Number of tasks, pipeline depth, and parallelism level.
Resource metrics: CPU utilization, memory footprint, and I/O latency.
Temporal features: Day of week, hour of execution, and historical averages.

Model Training Pipeline

Data extraction and flattening using BigQuery
Feature selection and cleaning
Train–test split (typically 80/20)
Model training using BigQuery ML regression
Prediction and evaluation

BigQuery ML: How It Works

BigQuery ML allows users to create and train machine-learning models directly in BigQuery using SQL syntax. It supports regression, classification, clustering, and time-series forecasting. For this use case, a linear regression model predicts job duration based on engineered features. Training and prediction occur within BigQuery, eliminating data movement and leveraging Google’s distributed infrastructure.

Advantages

No need for external ML infrastructure
Scales seamlessly with large datasets
Simple SQL-based interface

Limitations

Limited algorithm choices compared to dedicated ML frameworks
Feature engineering must be done in SQL, which can be complex for advanced transformations
Hyperparameter-tuning options are basic

Airflow and OpenTelemetry

Apache Airflow is an orchestration platform to programmatically author, schedule, and execute workflows. OpenTelemetry is used to generate, collect, and export telemetry data (metrics, logs, and traces) to help analyze software performance and behavior. OpenTelemetry traces provide insight into how the pipeline is executed in real time and how the various modules interact.

Monitoring Airflow Logs and Metrics

Monitoring Apache Airflow involves collecting:

Logs: Scheduler logs, worker logs, and DAG execution logs.
Metrics: Task execution times, DAG-run durations, queue latencies, and resource utilization.

To achieve this, Airflow is configured to send logs and metrics to an OpenTelemetry (OTel) collector, which can then be inserted into BigQuery through a Dataflow job.

OpenTelemetry (OTel) Collector

The OpenTelemetry Collector offers a vendor-agnostic implementation for receiving, processing, and exporting telemetry data. The collector gathers all Airflow metrics into a central location so the data can be sent to Pub/Sub and then stored in BigQuery using Dataflow.

Configure airflow.cfg for OpenTelemetry (OTel)

Edit the airflow.cfg file to enable OpenTelemetry for Airflow metrics:

    Python
   
   [metrics]
otel_on = True
otel_host = localhost
otel_port = 4318

Restart Airflow services after making these changes:

    Shell
   
   airflow db migrate
airflow scheduler -D
airflow webserver -D

Creating a Model on BigQuery Dataset

Since OTel spans contain a nested structure, it is important to UNNEST the data to perform operations in BigQuery. For a better model, build on a selective dataset relevant to predictions.

Sample OTel Span for Dataflow Metrics

    JSON-LD
   
 

   {
  "name": "checkout.process_payment",
  "context": {
    "trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "a90b4d2c872e8321"
  },
  "parent_id": "051581bf3cb55c13",
  "kind": "CLIENT",
  "start_time": "2025-10-24T15:45:00.123456789Z",
  "end_time": "2025-10-24T15:45:00.345678901Z",
  "status": {
    "code": "STATUS_CODE_OK",
    "message": ""
  },
  "attributes": {
    "http.method": "POST",
    "http.url": "https://payment-gateway.com/charge",
    "payment.method": "Credit Card",
    "order.id": "ORDER-12345"
  },
  "events": [
    {
      "name": "payment_request_sent",
      "timestamp": "2025-10-24T15:45:00.150000000Z",
      "attributes": {}
    },
    {
      "name": "payment_gateway_response",
      "timestamp": "2025-10-24T15:45:00.300000000Z",
      "attributes": {
        "http.status_code": 200,
        "payment.success": true
      }
    }
  ]
}
  

Model Creation: Target as Duration

The target field for the model is the duration, calculated as the difference between end_time and start_time.

BigQuery ML Model Creation

    SQL
   
 

   CREATE OR REPLACE MODEL
‘your_project.your_dataset.purchase_predictor_model‘
OPTIONS (...) AS
SELECT
-- feature selection and preprocessing here
FROM
your_project.your_dataset.your_table;
  

Using the Model for Predictions

    SQL
   
 

   SELECT *
FROM ML.PREDICT(
MODEL ‘project_id.dataset.model_name‘,
< Select the new data rows you want to predict on. >
);
  

Select and Preprocess Features

Select and preprocess the features used in the model. In this scenario, we are building a prediction model for estimating durations of Dataflow jobs, so we construct the field that contains the total duration of existing Dataflow jobs.

Dataflow Job Cost Predictor Diagram

The figure below visualizes how historical telemetry informs a BigQuery ML model that predicts runtime and guides the decision to submit or defer a Dataflow job.

career

Opinions expressed by DZone contributors are their own.

Related

Trending