Migrating Traditional Workloads From Classic Compute to Serverless Compute on Databricks
This tutorial explains the migration of Databricks workloads from Classic Compute to Serverless Compute for efficiency and cost effectiveness.
Join the DZone community and get the full member experience.
Join For FreeThis article walks us through the process of how to migrate traditional workloads using Classic Compute to Serverless Compute for efficient cluster management, cost effectiveness, better scalability and optimized performance.
Overview
As data engineering evolves, so do the infrastructure needs of enterprise workloads. With growing demands for agility, scalability, and cost-efficiency, Databricks Serverless Compute provides a compelling alternative to classic clusters. In this article, we explore a practical roadmap to migrate your pipelines and analytics workloads from classic compute (manual clusters or job clusters) to Databricks Serverless Compute, with specific attention to data security, scheduling, costs, and operational resilience.
Why Migrate to Serverless Compute?
Before dwelling into migration steps, let’s compare why serverless computing is better and efficient than Classic Compute for workloads:
|
Feature |
Classic Compute |
Serverless Compute |
|
Cluster Management |
Manual or automated |
Fully managed by Databricks |
|
Cost Control |
Prone to idle costs |
No charge for idle compute |
|
Scalability |
Manual configuration |
Auto-scales per workload needs |
|
Security Isolation |
Shared VMs unless isolated |
Secured, runtime-isolated compute |
|
Performance Optimization |
User-optimized |
Databricks-optimized runtime & IO |
For data pipeline tasks that involve scheduled ETL jobs, monthly reconciliations, or ledger computations, serverless compute offers elasticity and reduced maintenance burden—ideal for small-to-medium batch workloads with predictable patterns.
Pre-Requisites: Assess the Assets of Current Workloads
Let us start by auditing your existing classic cluster workloads:
- Identify job types: ETL pipelines, reporting scripts, reconciliation logic.
- Data sources: Delta tables, JDBC, cloud storage (e.g., S3, ADLS).
- Schedule and frequency: How often do jobs run? Nightly, monthly, ad-hoc?
- Dependencies: Are there shared libraries, secrets, or initialization scripts?
- Execution environment: Python, SQL, Scala, or notebooks?
Create an inventory and tag each workload with compute and runtime needs (e.g., memory, cores, run time).
Migration Process Flow Walkthrough
Step 1: Set Up Serverless Compute in Databricks
a. Enable Serverless in Your Workspace
- Go to Admin Console → Compute.
- Ensure Serverless Compute is enabled.
- If required, contact your Databricks support team to enable it in your workspace (may depend on cloud provider and plan).
b. Create a Serverless SQL Warehouse (Optional)
If your workloads are SQL-heavy (e.g., ledger queries, reporting dashboards):
- Navigate to SQL → SQL Warehouses.
- Click Create → Choose Serverless → Configure autoscaling, timeouts, and permissions.
For Python/Scala jobs, proceed to the next step.
Step 2: Migrate Jobs to Serverless Compute
a. Job Migration Steps (Databricks Workflows)
If you're using Job Clusters:
- Open the existing job from Workflows.
- Click Edit Job Settings.
- Under Cluster Configuration→ change the cluster type to:
- "Shared" Serverless Job Cluster, or
- Use existing serverless pool (if set up).
If you're using notebooks or workflows:
- Set the attached compute to a Serverless Job Cluster.
- Ensure libraries are installed using Init Scripts or Workspace Libraries (avoid cluster-level installs).
b. Validate Environment Compatibility
- Make sure all libraries (e.g., Pandas, PySpark) work under the Databricks Runtime supported by serverless.
- If using legacy Hive or JDBC connectors, confirm this work or migrate to Unity Catalog / native Delta connections.
- Review any init scripts or file paths that assume a VM or disk context—they may not behave identically in serverless.
Step 3: Schedule Jobs and Monitor Performance
Databricks allows job scheduling and retry logic via Workflows:
- Go to Workflows → Create Job.
- Set the notebook/script path, parameters, and schedule (e.g., "Every first of the month at 3 AM").
- Configure email/Slack alerts for success/failure.
- Enable retry policy (e.g., up to 3 retries on failure).
Use Job Metrics UI to compare performance:
- CPU and memory usage per task.
- Runtime per job before and after serverless migration.
- Cost estimation dashboards (if enabled).
Step 4: Secure Access to Data
Most data is sensitive. Make sure to:
- Enable Unity Catalog for fine-grained access control.
- Use credential passthrough or service principals for access to cloud storage.
- Store secrets using Databricks Secrets and access them securely in jobs.
Example:
python
import os
import pyspark.sql.functions as F
db_pass = dbutils.secrets.get(scope="-secrets", key="db-password")
Step 5: Optimize and Scale
Once migrated, apply these optimization steps:
- Use Delta Lake for all tables to benefit from caching and ACID compliance.
- Apply Z-Ordering on frequent columns (e.g., account_id, period).
- Use photon runtime in serverless SQL for faster computation.
- Monitor for underutilized compute—tune autoscaling thresholds accordingly.
Step 6: Example Use Case: Monthly Accounting Reconciliation
Suppose your classic cluster runs a notebook like this:
python
# Load entries
df = spark.read.table("Ledger_2024")
# Summarize per account
summary = df.groupBy("account_id").agg(F.sum("debit"), F.sum("credit"))
# Write to delta
summary.write.format("delta").mode("overwrite").save("/mnt/ledger/summary")
To migrate:
- Move this notebook to a scheduled workflow with a serverless job cluster.
- Replace paths like /mnt/... with Unity Catalog references if possible.
- Ensure access to Ledger_2024 via catalog permissions.
Key Considerations and Limitations
|
Consideration |
Notes |
|
Cold Start Time |
First request may have slight delay (~10s) |
|
External Libraries |
Prefer libraries installed via PyPI or workspace libraries |
|
Job Isolation |
No direct access to DBFS root or cluster-local files |
|
Networking Constraints |
If you rely on VPC peering or private endpoints, check compatibility with serverless network architecture |
Post-Migration Lookouts
- Cost Monitoring: Serverless charges are usage-based. Regularly monitor cost via Databricks billing dashboards.
- Audit Logging: Ensure audit logs are configured to track access and execution.
- Security Hardening: Apply appropriate workspace controls, token lifetimes, and access levels for production environments.
Conclusion
Migrating from classic compute to serverless compute in Databricks significantly improves cost efficiency, manageability, and scalability especially for structured workloads like Accounting. By following a structured migration path starting with inventory, compute setup, job conversion, and optimization you can ensure a smooth transition without sacrificing performance or security.
This migration is a strategic step toward modernizing your data and AI infrastructure. As the transition introduces architectural and operational changes, the benefits in agility, cost savings, and scalability are significant. By following the prerequisites and adopting a methodical migration strategy, your team can fully leverage the power of Databricks Serverless Compute.
We should approach the migration incrementally and strategically by starting with non-critical workloads at first and expanding serverless usage to core and critical data pipelines and jobs.
Opinions expressed by DZone contributors are their own.
Comments