DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  • Beyond Big Data: Designing Agentic Data Pipelines for AI Workloads
  • AI-Driven DevOps for SaaS: From Reactive to Predictive Pipelines
  • Designing AI-Assisted Integration Pipelines for Enterprise SaaS

Trending

  • Optimizing Databricks Spark Pipelines Using Declarative Patterns
  • YOLOv5 PyTorch Tutorial
  • A Comprehensive Guide to Prompt Engineering
  • Leveraging Apache Flink Dashboard for Real-Time Data Processing in AWS Apache Flink Managed Service
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Setting Up DBT and Snowpark for Machine Learning Pipelines

Setting Up DBT and Snowpark for Machine Learning Pipelines

This guide describes how to install and set up DBT and Snowpark for machine-learning model pipelines to streamline data processing and feature engineering tasks.

By 
Sevinthi Kali Sankar Nagarajan user avatar
Sevinthi Kali Sankar Nagarajan
·
Dec. 11, 24 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
13.5K Views

Join the DZone community and get the full member experience.

Join For Free

AI/ML workflows excel on structured, reliable data pipelines. To streamline these processes, DBT and Snowpark offer complementary capabilities: DBT is for modular SQL transformations, and Snowpark is for programmatic Python-driven feature engineering.

Here are some key benefits of using DBT, Snowpark, and Snowflake together:

  1. Simplifies SQL-based ETL with DBT’s modularity and tests.
  2. Handles complex computations with Snowpark’s Python UDFs.
  3. Leverages Snowflake’s high-performance engine for large-scale data processing.

Here’s a step-by-step guide to installing, configuring, and integrating DBT and Snowpark into your workflows.

Step 1: Install DBT

In Shell, you can use Python’s pip command for installing packages. Assuming Python is already installed and added to your PATH, follow these steps: 

Shell
 
# Set up a Python virtual environment (recommended):
python3 -m venv dbt_env
source dbt_env/bin/activate

# Install DBT and the Snowflake adapter:
pip install dbt-snowflake

# Verify DBT installation
dbt --version


Step 2: Install Snowpark

Shell
 
# Install Snowpark for Python
pip install snowflake-snowpark-python

# Install additional libraries for data manipulation
pip install pandas numpy

# Verify Snowpark installation
python -c "from snowflake.snowpark import Session; print('successful Snowpark installation')"


Step 3: Configuring DBT for Snowflake

DBT requires a profiles.yml file to define connection settings for Snowflake.

Locate the DBT Profiles Directory

By default, DBT expects the profiles.yml file in the ~/.dbt/ directory. Create the directory if it doesn’t exist:

Shell
 
mkdir -p ~/.dbt


Create the profiles.yml File

Define your Snowflake credentials in the following format:

YAML
 
my_project:
  outputs:
    dev:
      type: snowflake
      account: your_account_identifier
      user: your_username
      password: your_password
      role: your_role
      database: your_database
      warehouse: your_warehouse
      schema: your_schema
  target: dev


Replace placeholders like your_account_identifier with your Snowflake account details.

Test the Connection

Run the following command to validate your configuration:

Shell
 
dbt debug


If the setup is correct, you’ll see a success message confirming the connection.

Step 4: Setting Up Snowpark

Ensure Snowflake Permissions

Before using Snowpark, ensure your Snowflake user has the following permissions:

  • Access to the warehouse and schema.
  • Ability to create and register UDFs (User-Defined Functions).

Create a Snowpark Session

Set up a Snowpark session using the same credentials from profiles.yml:

Python
 
from snowflake.snowpark import Session

def create_session():
    connection_params = {
        "account": "your_account_identifier",
        "user": "your_username",
        "password": "your_password",
        "role": "your_role",
        "database": "your_database",
        "warehouse": "your_warehouse",
        "schema": "your_schema",
    }
    return Session.builder.configs(connection_params).create()

session = create_session()
print("Snowpark session created successfully")


Register a Sample UDF

Here’s an example of registering a simple Snowpark UDF for text processing:

Python
 
def clean_text(input_text):
    return input_text.strip().lower()

session.udf.register(
    func=clean_text,
    name="clean_text_udf",
    input_types=["string"],
    return_type="string",
    is_permanent=True
)
print("UDF registered successfully")


Step 5: Integrating DBT With Snowpark

You have a DBT model named raw_table that contains raw data.

raw_table DBT Model Definition 

SQL
 
-- models/raw_table.sql

SELECT * 
FROM my_database.my_schema.source_table


Use Snowpark UDFs in DBT Models

Once you’ve registered a UDF in Snowflake using Snowpark, you can call it directly from your DBT models.

SQL
 
-- models/processed_data.sql

WITH raw_data AS (
    SELECT id, text_column
    FROM {{ ref('raw_table') }}
),
cleaned_data AS (
    SELECT
        id,
        clean_text_udf(text_column) AS cleaned_text
    FROM raw_data
)
SELECT * FROM cleaned_data;


Run DBT Models

Execute your DBT models to apply the transformation:

Shell
 
dbt run --select processed_data


Step 6: Advanced AI/ML Use Case

For AI/ML workflows, Snowpark can handle tasks like feature engineering directly in Snowflake. Here’s an example of calculating text embeddings:

Create an Embedding UDF

Using Python and a pre-trained model, you can generate text embeddings:

Python
 
from transformers import pipeline

def generate_embeddings(text):
    model = pipeline("feature-extraction", model="bert-base-uncased")
    return model(text)[0]

session.udf.register(
    func=generate_embeddings,
    name="generate_embeddings_udf",
    input_types=["string"],
    return_type="array",
    is_permanent=True
)


Integrate UDF in DBT

Call the embedding UDF in a DBT model to create features for ML:

SQL
 
-- models/embedding_data.sql

WITH raw_text AS (
    SELECT id, text_column
    FROM {{ ref('raw_table') }}
),
embedded_text AS (
    SELECT
        id,
        generate_embeddings_udf(text_column) AS embeddings
    FROM raw_text
)
SELECT * FROM embedded_text;


Best Practices

  • Use DBT for reusable transformations: Break down complex SQL logic into reusable models.
  • Optimize Snowpark UDFs: Write lightweight, efficient UDFs to minimize resource usage.
  • Test Your Data: Leverage DBT’s testing framework for data quality.
  • Version Control Everything: Track changes in DBT models and Snowpark scripts for traceability.

Conclusion

By combining DBT’s SQL-based data transformations with Snowpark’s advanced programming capabilities, you can build AI/ML pipelines that are both scalable and efficient. This setup allows teams to collaborate effectively while leveraging Snowflake’s computational power to process large datasets.

Whether you’re cleaning data, engineering features, or preparing datasets for ML models, the DBT-Snowpark integration provides a seamless workflow to unlock your data’s full potential.

AI Data processing Pipeline (software)

Opinions expressed by DZone contributors are their own.

Related

  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  • Beyond Big Data: Designing Agentic Data Pipelines for AI Workloads
  • AI-Driven DevOps for SaaS: From Reactive to Predictive Pipelines
  • Designing AI-Assisted Integration Pipelines for Enterprise SaaS

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook