DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Integrate VSCode With Databricks To Build and Run Data Engineering Pipelines and Models
  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Architecting AI-Native Cloud Platforms: Signals to Insights to Actions

Trending

  • From Data Movement to Local Intelligence: The Shift from Centralized to Federated AI
  • Architecting Petabyte-Scale Hyperspectral Pipelines on AWS
  • DevOps Is Dead, Long Live Platform Engineering
  • How to Test a PATCH API Request With REST-Assured Java
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Creating an End-to-End ML Pipeline With Databricks and MLflow

Creating an End-to-End ML Pipeline With Databricks and MLflow

This tutorial shows how to build a complete ML pipeline on Databricks using Delta Lake for data management and MLflow for model tracking, registration, and deployment.

By 
harshraj bhoite user avatar
harshraj bhoite
·
Nov. 19, 25 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
2.8K Views

Join the DZone community and get the full member experience.

Join For Free

Within data-centric organizations, creating an end-to-end machine learning (ML) Pipeline that is reproducible, scalable, and traceable is an essential component. The integrated ecosystem of Delta Lake, Auto Loader, and MLflow in Databricks allows organizations to simplify the ML lifecycle from unrefined data ingestion all the way to production deployment.

This tutorial provides a comprehensive guide on constructing an end-to-end ML pipeline on Databricks, utilizing MLflow for model tracking and the model registry, and leveraging Delta Lake for data management. We will demonstrate all the tasks in a unified workflow, including raw data ingestion, feature preparation, model training, and prediction serving.

Architecture Overview

Bronze → Silver → Gold → MLflow → Model Registry → Batch Scoring.

Our pipeline is structured on the classic Bronze–Silver–Gold framework.

  • Bronze: The unrefined data ingestion phase utilizing Auto Loader.
  • Silver: Cleaned data that has been deduplicated for trustworthy analytics.
  • Gold: Engineered features that are primed for model training and scoring.
  • ML: Model training and tracking with registration and batch inference.

1. Storing Data in the Bronze Layer

The first task is to collect the raw data in the most effective and stepwise manner possible. With respect to this task, Databricks Auto Loader is capable of streamlining the process as it automatically tracks new files in the designated cloud folder and loads them into Delta tables.

Python
 
catalog = "main"
schema  = "ml_demo"
raw_dir = "dbfs:/mnt/raw/churn/"
bronze  = f"{catalog}.{schema}.bronze_churn"

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema}")

from pyspark.sql.types import *
schema_hint = StructType([
    StructField("customer_id", StringType()),
    StructField("tenure", IntegerType()),
    StructField("monthly_charges", DoubleType()),
    StructField("contract_type", StringType()),
    StructField("churn", IntegerType())
])


Upload files directly with the auto loader.

Python
 
spark.readStream
 .format("cloudFiles")
 .option("cloudFiles.format", "csv")
 .option("header", "true")
 .schema(schema_hint)
 .load(raw_dir)
 .writeStream
 .trigger(availableNow=True)
 .option("checkpointLocation", "dbfs:/chk/bronze_churn")
 .toTable(bronze)


2. Data Cleansing and Transformation (Silver Layer)

Upon landing in the Bronze layer, we progress to cleaning and standardizing the data directly in the Silver table.

Python
 
silver = f"{catalog}.{schema}.silver_churn"

df_bronze = spark.table(bronze)

from pyspark.sql.functions import col, trim
df_silver = (df_bronze
  .dropDuplicates(["customer_id"])
  .withColumn("contract_type", trim(col("contract_type")))
  .filter(col("tenure").isNotNull() & col("monthly_charges").isNotNull()))

df_silver.write.mode("overwrite").format("delta").saveAsTable(silver)


Data Quality Checks (Best Practice): Employ checks at this stage, including null ratio checks, schema validation, and duplication metrics.

3. Feature Engineering (Gold Layer)

In the Gold layer, we perform transformations on the cleaned data to create features that will be used in machine learning models.

Python
 
gold = f"{catalog}.{schema}.gold_churn_features"

from pyspark.sql.functions import when

df_gold = (spark.table(silver)
  .withColumn("is_long_tenure", (col("tenure") >= 12).cast("int"))
  .withColumn("contract_monthly", (col("contract_type")=="Monthly").cast("int"))
  .select("customer_id","tenure","monthly_charges","is_long_tenure","contract_monthly","churn"))

df_gold.write.mode("overwrite").format("delta").saveAsTable(gold)


The outcome is a Gold table that is ready to be used for training the model.

4. Training and Tracking of Models Using MLflow

We now shift to MLflow, where we will manage the models and track the experiments.

Python
 
import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, roc_auc_score
import pandas as pd

mlflow.set_experiment("/Shared/churn_ml_experiment")

pdf = spark.table(gold).toPandas()
X = pdf.drop(columns=["customer_id","churn"])
y = pdf["churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

with mlflow.start_run():
    params = {"n_estimators": 200, "max_depth": 6, "random_state": 42}
    clf = RandomForestClassifier(**params)
    clf.fit(X_train, y_train)

    preds = clf.predict(X_test)
    probas = clf.predict_proba(X_test)[:,1]

    mlflow.log_params(params)
    mlflow.log_metrics({
        "f1": f1_score(y_test, preds),
        "roc_auc": roc_auc_score(y_test, probas)
    })
    mlflow.sklearn.log_model(clf, "model", registered_model_name="churn_rf_model")


The following are some of the things MLflow integrates:

  • Parameters (e.g., number of trees)
  • Metrics (e.g,. F1 score, ROC AUC)
  • Artifacts (model files, plots)

Once a model is performing optimally, we can start registering it for basic promotion through the lifecycle stages.

5. Register and Transition Models

Python
 
from mlflow.tracking import MlflowClient
client = MlflowClient()
name = "churn_rf_model"

latest = client.get_latest_versions(name, stages=["None"])[-1]
client.transition_model_version_stage(
    name=name,
    version=latest.version,
    stage="Staging"
)


Tip: Automate promotion with Databricks Jobs and custom approval workflows. 

6. Batch Inference for Scoring New Data

We can now load the model in order to estimate the probabilities of customer churn for new data.

This enables the storage of data in a predictions table, which can be used to create customer dashboards and retention systems.

Python
 
prod_uri = "models:/churn_rf_model/Production"
import mlflow.pyfunc
model = mlflow.pyfunc.load_model(prod_uri)

unscored = (spark.table(silver)
  .select("customer_id","tenure","monthly_charges","contract_type")
  .withColumn("is_long_tenure", (col("tenure") >= 12).cast("int"))
  .withColumn("contract_monthly", (col("contract_type")=="Monthly").cast("int"))
  .drop("contract_type"))

pdf_unscored = unscored.toPandas()
scores = model.predict_proba(pdf_unscored.drop(columns=["customer_id"]))[:,1]

import pandas as pd
out = pd.DataFrame({
    "customer_id": pdf_unscored["customer_id"],
    "churn_score": scores
})

spark.createDataFrame(out).write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.predictions_churn")


7. Automate and Monitor

In this aspect, Databricks aims to simplify the productionization process through:

  • Jobs: Ingesting and training the model is achieved through chaining the three processes into a scheduled workflow.
  • Repos: Control the notebook versions using Git.
  • Delta Live Tables (DLT): Automate checks for data quality and lineage.
  • Model monitoring: Measure drift and latency and monitor accuracy using MLflow metrics.
  • Best practice: Schedule a nightly retraining job for each new model to benchmark production metrics.

8. Cost and Performance Optimization

  • Combine Auto Loader with incremental listing for more efficient ingestion.
  • Use OPTIMIZE with ZORDER to compact small files.
  • Tune ML workloads with Hyperopt or RandomizedSearchCV.
  • For non-critical jobs, consider using spot or preemptible clusters.
  • Cost-aware engineering makes certain your ML pipelines are scalable and sustainable.

Conclusion

The combination of Databricks Delta Lake and its reliable data management feature, along with MLflow for experiment tracking, enables seamless operationalization of ML pipelines.

In this tutorial, I showed how to build a reproducible end-to-end workflow — from ingestion to scoring — using the Bronze–Silver–Gold architecture. You can add more with the Feature Store, real-time scoring, or MLOps automation modules for continuous delivery.

In a nutshell, Databricks changes the format of ML pipelines from manual notebooks to governed, automated, and scalable systems.

Machine learning Data (computing) Pipeline (software)

Opinions expressed by DZone contributors are their own.

Related

  • Integrate VSCode With Databricks To Build and Run Data Engineering Pipelines and Models
  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Architecting AI-Native Cloud Platforms: Signals to Insights to Actions

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook