From DLT to Lakeflow Declarative Pipelines: A Practical Migration Playbook

Migrating from DLT to Lakeflow is mostly an API refactor, swapping DLT for pipelines, separating streaming and materialized tables, and updating CDC logic.

Mar. 19, 26 · Analysis

Likes (1)

Comment

Save

4.1K Views

Delta Live Tables (DLT) has been a game-changer for building ETL pipelines on Databricks, providing a declarative framework that automates orchestration, infrastructure management, monitoring, and data quality in data pipelines. By simply defining how data should flow and be transformed, DLT allowed data engineers to focus on business logic rather than scheduling and dependency management. Databricks expanded and rebranded this capability under the broader Lakeflow initiative. The product formerly known as DLT is now Lakeflow Spark Declarative Pipelines (SDP), essentially the next evolution of DLT with additional features and alignment to open-source Spark.

The existing DLT pipelines are largely compatible with Lakeflow; your code will still run on the new platform without immediate changes. However, to fully leverage Lakeflow’s capabilities and future-proof your pipeline, it’s recommended that you update your code to the new API. This playbook provides a practical, engineer-focused guide to migrating from DLT to Lakeflow declarative pipelines with side-by-side code examples, tips, and coverage of edge cases. We’ll focus on the migration logic, the code changes, and pipeline definition adjustments, rather than tooling or deployment, assuming you’re using Databricks with Spark/Delta Lake as before.

Recap: What Is Delta Live Tables (DLT)?

Databricks Delta Live Tables (DLT) is a declarative ETL framework for building scalable, reliable data pipelines on Delta Lake. With DLT, engineers define a series of datasets and their transformation logic in Python or SQL, and the system handles the execution order, dependency resolution, and incremental processing automatically.

Key features of DLT included support for streaming tables, materialized views, and views. DLT pipelines also integrate data quality enforcement via expectations, allowing you to declare constraints that the pipeline can enforce or use to quarantine bad data. In short, DLT lets you focus on what transformations to do, not how to schedule or scale them, bringing a declarative approach to data engineering similar to how Kubernetes brings declarative management to infrastructure.

Meet Lakeflow Declarative Pipelines (The Evolution of DLT)

Lakeflow Spark Declarative Pipelines (SDP) is essentially DLT 2.0, a unified, declarative framework for batch and streaming ETL that Databricks introduced under the Lakeflow umbrella. Lakeflow pipelines build on the lessons of DLT and align with the open-source Spark API for declarative pipelines (introduced in Apache Spark 4.1).

In practice, Lakeflow’s pipeline API is almost identical to DLT’s, but with new naming and some expanded capabilities. Notably, as of the 2025 Data and AI Summit, Databricks open-sourced the core declarative pipeline engine to Apache Spark. This means your pipeline code can, in principle, run on standard Spark, reducing vendor lock-in while still offering Databricks-specific enhancements. Lakeflow also introduced the concept of flows in pipelines.

For a data engineer, the migration from DLT to Lakeflow is mostly a find-and-replace refactor plus adopting a few new best practices. The following sections will walk through the key changes with code examples. We’ll start with the simplest updates, then tackle specific features like expectations and change data capture.

Migration Steps and Code Changes

1. Update Imports and Module References

In DLT, you typically started your notebook with import dlt. In Lakeflow, the pipeline functions are accessed via the Spark pipelines module. Replace the DLT import with:

    Python
   
   from pyspark import pipelines as dp

This import gives us a dp object analogous to the old dlt. Consequently, all references to dlt in your code should be replaced with dp. This includes decorator annotations and any function calls. For example:

@dlt.table becomes @dp.table
@dlt.view becomes @dp.temporary_view
dlt.read("some_table") becomes dp.read("some_table")

According to Databricks, the dlt module has been superseded by pyspark.pipelines and while legacy code will still run, it’s recommended to use the new module going forward. The name changes are designed to be straightforward. In fact, you can often do a simple search-and-replace on your notebooks to swap dlt for dp and add the new import.

2. Table Decorators: Distinguishing Streaming Tables vs. Materialized Views

One notable API improvement in Lakeflow is making streaming tables vs. batch tables explicit. Under DLT, you would declare all persistent tables with @dlt.table, regardless of whether they were fed by streaming sources or batch data. DLT internally figured out which tables should be streaming versus which were materialized (refreshed on each pipeline run) based on how you read the data. In Lakeflow, the syntax is more expressive:

Use @dp.table to define a streaming table.
Use @dp.materialized_view to define a materialized view.
Ephemeral in-memory views for intermediate transformations are declared with @dp.temporary_view. Temporary views are not persisted to the metastore and exist only within the pipeline’s processing graph, just as dlt.view worked previously.

Migration tip: Review each @dlt.table in your code to decide if it should be a streaming table or a materialized view in the Lakeflow world. As a rule of thumb, if the function is reading from a streaming source (for example, using spark.readStream or auto loader on a directory of files), use @dp.table.

If it’s doing a batch read (e.g., spark.read.format("delta").load(...) or joining already materialized tables), use @dp.materialized_view. In many cases, DLT pipelines had a mix of both types; now you’ll make that distinction explicit. For example, in DLT you might have:

    Python
   
 

   # DLT code (before migration)
import dlt

@dlt.table(name="raw_data")
def raw_data():
    return spark.readStream.format("cloudFiles")... .load("<path>")

@dlt.view
def aggregated():
    df = dlt.read("raw_data")
    return df.groupBy("category").count()

@dlt.table(name="report")
@dlt.expect("PositiveCount", "count > 0")
def report():
    return dlt.read("aggregated")
  

The equivalent Lakeflow pipeline code would be:

    Python
   
 

   # Lakeflow code (after migration)
from pyspark import pipelines as dp

@dp.table(name="raw_data")  # streaming source
def raw_data():
    return spark.readStream.format("cloudFiles")... .load("<path>")

@dp.temporary_view
def aggregated():
    df = dp.read("raw_data")
    return df.groupBy("category").count()

@dp.materialized_view(name="report")
@dp.expect("PositiveCount", "count > 0")
def report():
    return dp.read("aggregated")
  

In this example, we changed the first table to @dp.table because it reads a streaming source. The intermediate view became @dp.temporary_view. The final table report is derived from batch aggregation, so we mark it as a @dp.materialized_view. We also carried over the data quality expectation. By making these choices explicit, the pipeline is clearer in intent. Under the hood, Lakeflow still builds a dependency graph and manages incremental updates, but now you have more control over how tables are updated.

It’s worth noting that these changes align with Apache Spark’s emerging declarative pipeline syntax, meaning your @dp.table and @dp.materialized_view definitions mirror what vanilla Spark 4.1+ would accept.

3. Data Quality Expectations

DLT’s ability to enforce data quality constraints via expectations is preserved in Lakeflow. In DLT, you might have used @dlt.expect, @dlt.expect_or_drop, or @dlt.expect_or_fail decorators to define rules on a table. The simplest form @dlt.expect("Name", "condition") would record any rows violating the condition (and depending on pipeline settings, either drop them, fail the pipeline, or just log the metric).

In Lakeflow, the syntax is @dp.expect("Name", "condition"). The usage is the same conceptually you attach one or more expectations above a @dp.table or @dp.materialized_view function.

4. Change Data Capture (CDC) and Flows

Handling change data capture (CDC). In DLT, you might have used the Python API dlt.apply_changes() inside a table function, or the SQL syntax APPLY CHANGES INTO in a pipeline notebook, to achieve this. For instance, a DLT example for CDC looked like:

    Python
   
 

   @dlt.table
def target_table():
    return dlt.apply_changes(
        target = "LIVE.target_table",
        source = "STREAM(LIVE.cdc_feed_table)",
        keys = ["id"],
        sequence_by = col("timestamp"),
        apply_as_delete = col("operation") == "DELETE"
    )
  

In Lakeflow, the CDC capability has been refactored slightly. The new API provides functions to create CDC flows. The direct replacement for dlt.apply_changes() is dp.create_auto_cdc_flow() which has the same function signature and behavior, but is used in a different way. Rather than returning a DataFrame inside a table function, you will call dp.create_auto_cdc_flow at the pipeline definition level to link a source and target. You also need to declare the target table as a streaming table beforehand. In practice, migrating a DLT CDC pipeline might involve:

Define an empty target table using dp.create_streaming_table("target_table", schema=..., name="...") (or as a function with @dp.table if you prefer) to serve as the sink for changes.
Use dp.create_auto_cdc_flow(target="target_table", source="source_table", keys=[...], sequence_by=..., apply_as_deletes=... ) to create the CDC flow that updates the target.

This new pattern separates the declaration of the target table from the CDC application logic. The rationale is to make CDC a first-class concept rather than a special kind of table function. Under the hood, create_auto_cdc_flow will handle upserts and deletions on the target table similarly to how apply_changes did. The function parameters like keys, sequence_by, apply_as_deletes are unchanged. Databricks has simply renamed the API for clarity and future compatibility. During migration, replace any usage of dlt.apply_changes with the new dp.create_auto_cdc_flow call. If you were using the SQL APPLY CHANGES INTO syntax in a DLT SQL notebook, the equivalent Lakeflow SQL uses a CREATE FLOW ... statement with similar clauses.

Edge Cases and Considerations

Backward Compatibility

A big advantage is that you can migrate gradually. Databricks has made Lakeflow backward-compatible; your old import dlt code will still run in the Lakeflow engine. This means you can perform A/B testing or phased migration: for instance, run the pipeline as-is and then run the migrated version and compare results. Just note that new features will likely appear only in the pipelines module going forward, so to take advantage of improvements, you’ll eventually want to switch fully. Also, be aware that some system names remain prefixed with dlt for now these legacy naming artifacts do not affect functionality, but can be confusing. Don’t be alarmed if you see dlt in logs it’s just cosmetic legacy.

Mixing SQL and Python

If your DLT pipeline was defined with SQL notebooks or a mix of SQL and Python, the migration concept is similar. SQL syntax in Lakeflow supports CREATE STREAMING TABLE, CREATE MATERIALIZED VIEW, and CREATE FLOW statements aligning with the new terminology. Any APPLY CHANGES INTO clause in SQL should continue to work, but Databricks docs suggest using CREATE FLOW for consistency going forward. In Python, as we covered, use dp functions and decorators. You can even mix Lakeflow SQL and Python in one pipeline, as was possible with DLT, just ensure the naming and types line up.

Open Source Spark Pipeline Compatibility

One motivation for Lakeflow’s changes was convergence with Apache Spark’s declarative pipelines. Spark 4.x introduced a pyspark.pipelines module with similar concepts, so that you could potentially run a pipeline outside Databricks. If you plan to take a Lakeflow pipeline and run it on an open-source Spark cluster, note that not all features carry over. Core dataset definitions and basic reads will work, but features like expectations and the CDC utilities are Databricks-only. In migration, you might flag these sections if portability is a concern. For pure Databricks usage, this isn’t an issue.

Performance and Observability

Migrating to Lakeflow should not degrade performance; in fact, see improvements or new options. Lakeflow still provides an event log, data lineage visualization, and metrics as DLT did. After migration, validate that your pipeline updates and triggers still behave as expected. Lakeflow pipelines can run in triggered mode or continuous mode, depending on configuration, just like DLT. So an edge case to check is if you relied on a particular trigger, so that it remains configured, but that’s a pipeline setting outside the code.

Conclusion

Migrating from Delta Live Tables to Lakeflow Declarative Pipelines is a straightforward process that mainly involves renaming APIs and clarifying table types. The declarative, engineer-friendly approach to building pipelines remains the same, but Lakeflow’s refinements bring you better alignment with open standards and future Databricks features. By updating your imports to pyspark.pipelines, switching to @dp.table or @dp.materialized_view where appropriate, and refactoring CDC and expectations to the new syntax, you’ll ensure your pipelines are future-proof.

This migration not only preserves the benefits DLT gave you but also sets the stage for leveraging new Lakeflow enhancements. Happy migrating and enjoy the continued simplicity and power of declarative pipelines in Databricks Lakehouse!

API Apache Spark Pipeline (software)

Opinions expressed by DZone contributors are their own.

Related

Trending