Creating an End-to-End ML Pipeline With Databricks and MLflow

This tutorial shows how to build a complete ML pipeline on Databricks using Delta Lake for data management and MLflow for model tracking, registration, and deployment.

harshraj bhoite

Nov. 19, 25 · Tutorial

Likes (1)

Comment

Save

3.4K Views

Within data-centric organizations, creating an end-to-end machine learning (ML) Pipeline that is reproducible, scalable, and traceable is an essential component. The integrated ecosystem of Delta Lake, Auto Loader, and MLflow in Databricks allows organizations to simplify the ML lifecycle from unrefined data ingestion all the way to production deployment.

This tutorial provides a comprehensive guide on constructing an end-to-end ML pipeline on Databricks, utilizing MLflow for model tracking and the model registry, and leveraging Delta Lake for data management. We will demonstrate all the tasks in a unified workflow, including raw data ingestion, feature preparation, model training, and prediction serving.

Architecture Overview

Bronze → Silver → Gold → MLflow → Model Registry → Batch Scoring.

Our pipeline is structured on the classic Bronze–Silver–Gold framework.

Bronze: The unrefined data ingestion phase utilizing Auto Loader.
Silver: Cleaned data that has been deduplicated for trustworthy analytics.
Gold: Engineered features that are primed for model training and scoring.
ML: Model training and tracking with registration and batch inference.

1. Storing Data in the Bronze Layer

The first task is to collect the raw data in the most effective and stepwise manner possible. With respect to this task, Databricks Auto Loader is capable of streamlining the process as it automatically tracks new files in the designated cloud folder and loads them into Delta tables.

    Python
   
 

   catalog = "main"
schema  = "ml_demo"
raw_dir = "dbfs:/mnt/raw/churn/"
bronze  = f"{catalog}.{schema}.bronze_churn"

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema}")

from pyspark.sql.types import *
schema_hint = StructType([
    StructField("customer_id", StringType()),
    StructField("tenure", IntegerType()),
    StructField("monthly_charges", DoubleType()),
    StructField("contract_type", StringType()),
    StructField("churn", IntegerType())
])

  

Upload files directly with the auto loader.

    Python
   
 

   spark.readStream
 .format("cloudFiles")
 .option("cloudFiles.format", "csv")
 .option("header", "true")
 .schema(schema_hint)
 .load(raw_dir)
 .writeStream
 .trigger(availableNow=True)
 .option("checkpointLocation", "dbfs:/chk/bronze_churn")
 .toTable(bronze)

  

2. Data Cleansing and Transformation (Silver Layer)

Upon landing in the Bronze layer, we progress to cleaning and standardizing the data directly in the Silver table.

    Python
   
   silver = f"{catalog}.{schema}.silver_churn"

df_bronze = spark.table(bronze)

from pyspark.sql.functions import col, trim
df_silver = (df_bronze
  .dropDuplicates(["customer_id"])
  .withColumn("contract_type", trim(col("contract_type")))
  .filter(col("tenure").isNotNull() & col("monthly_charges").isNotNull()))

df_silver.write.mode("overwrite").format("delta").saveAsTable(silver)

Data Quality Checks (Best Practice): Employ checks at this stage, including null ratio checks, schema validation, and duplication metrics.

3. Feature Engineering (Gold Layer)

In the Gold layer, we perform transformations on the cleaned data to create features that will be used in machine learning models.

    Python
   
   gold = f"{catalog}.{schema}.gold_churn_features"

from pyspark.sql.functions import when

df_gold = (spark.table(silver)
  .withColumn("is_long_tenure", (col("tenure") >= 12).cast("int"))
  .withColumn("contract_monthly", (col("contract_type")=="Monthly").cast("int"))
  .select("customer_id","tenure","monthly_charges","is_long_tenure","contract_monthly","churn"))

df_gold.write.mode("overwrite").format("delta").saveAsTable(gold)

The outcome is a Gold table that is ready to be used for training the model.

4. Training and Tracking of Models Using MLflow

We now shift to MLflow, where we will manage the models and track the experiments.

    Python
   
 

   import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score
import pandas as pd

mlflow.set_experiment("/Shared/churn_ml_experiment")

pdf = spark.table(gold).toPandas()
X = pdf.drop(columns=["customer_id","churn"])
y = pdf["churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

with mlflow.start_run():
    params = {"n_estimators": 200, "max_depth": 6, "random_state": 42}
    clf = RandomForestClassifier(**params)
    clf.fit(X_train, y_train)

    preds = clf.predict(X_test)
    probas = clf.predict_proba(X_test)[:,1]

    mlflow.log_params(params)
    mlflow.log_metrics({
        "f1": f1_score(y_test, preds),
        "roc_auc": roc_auc_score(y_test, probas)
    })
    mlflow.sklearn.log_model(clf, "model", registered_model_name="churn_rf_model")

  

The following are some of the things MLflow integrates:

Parameters (e.g., number of trees)
Metrics (e.g,. F1 score, ROC AUC)
Artifacts (model files, plots)

Once a model is performing optimally, we can start registering it for basic promotion through the lifecycle stages.

5. Register and Transition Models

    Python
   
 

   from mlflow.tracking import MlflowClient
client = MlflowClient()
name = "churn_rf_model"

latest = client.get_latest_versions(name, stages=["None"])[-1]
client.transition_model_version_stage(
    name=name,
    version=latest.version,
    stage="Staging"
)

  

Tip: Automate promotion with Databricks Jobs and custom approval workflows.

6. Batch Inference for Scoring New Data

We can now load the model in order to estimate the probabilities of customer churn for new data.

This enables the storage of data in a predictions table, which can be used to create customer dashboards and retention systems.

    Python
   
 

   prod_uri = "models:/churn_rf_model/Production"
import mlflow.pyfunc
model = mlflow.pyfunc.load_model(prod_uri)

unscored = (spark.table(silver)
  .select("customer_id","tenure","monthly_charges","contract_type")
  .withColumn("is_long_tenure", (col("tenure") >= 12).cast("int"))
  .withColumn("contract_monthly", (col("contract_type")=="Monthly").cast("int"))
  .drop("contract_type"))

pdf_unscored = unscored.toPandas()
scores = model.predict_proba(pdf_unscored.drop(columns=["customer_id"]))[:,1]

import pandas as pd
out = pd.DataFrame({
    "customer_id": pdf_unscored["customer_id"],
    "churn_score": scores
})

spark.createDataFrame(out).write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.predictions_churn")

  

7. Automate and Monitor

In this aspect, Databricks aims to simplify the productionization process through:

Jobs: Ingesting and training the model is achieved through chaining the three processes into a scheduled workflow.
Repos: Control the notebook versions using Git.
Delta Live Tables (DLT): Automate checks for data quality and lineage.
Model monitoring: Measure drift and latency and monitor accuracy using MLflow metrics.
Best practice: Schedule a nightly retraining job for each new model to benchmark production metrics.

8. Cost and Performance Optimization

Combine Auto Loader with incremental listing for more efficient ingestion.
Use OPTIMIZE with ZORDER to compact small files.
Tune ML workloads with Hyperopt or RandomizedSearchCV.
For non-critical jobs, consider using spot or preemptible clusters.
Cost-aware engineering makes certain your ML pipelines are scalable and sustainable.

Conclusion

The combination of Databricks Delta Lake and its reliable data management feature, along with MLflow for experiment tracking, enables seamless operationalization of ML pipelines.

In this tutorial, I showed how to build a reproducible end-to-end workflow — from ingestion to scoring — using the Bronze–Silver–Gold architecture. You can add more with the Feature Store, real-time scoring, or MLOps automation modules for continuous delivery.

In a nutshell, Databricks changes the format of ML pipelines from manual notebooks to governed, automated, and scalable systems.

Machine learning Data (computing) Pipeline (software)

Opinions expressed by DZone contributors are their own.

Related

Trending