DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Upgrading Spark Pipelines Code: A Comprehensive Guide
  • Query Federation in Data Virtualization and Best Practices
  • Cloud Database Services Compared: AWS, Microsoft, Google, and Oracle
  • 9 Best Free and Open Source Tools for Reporting

Trending

  • Contextual AI Integration for Agile Product Teams
  • Optimizing Integration Workflows With Spark Structured Streaming and Cloud Services
  • Power BI Embedded Analytics — Part 2: Power BI Embedded Overview
  • Mastering Advanced Aggregations in Spark SQL
  1. DZone
  2. Data Engineering
  3. Data
  4. Snowflake Data Processing With Snowpark DataFrames

Snowflake Data Processing With Snowpark DataFrames

Snowpark is a new developer library in Snowflake that provides an API to process data using programming languages like Scala (and later on Java or Python), instead of SQL.

By 
Istvan Szegedi user avatar
Istvan Szegedi
·
Jan. 04, 22 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
15.1K Views

Join the DZone community and get the full member experience.

Join For Free

Introduction

Snowpark is a new developer library in Snowflake that provides an API to process data using programming languages like Scala (and later on Java or Python), instead of SQL. The core concept in Snowpark is DataFrame that represents a set of data, e.g. a number of rows from a database table that can be processed using object oriented or functional programming techniques with our favorite tools.  The notion of Snowpark DataFrames is similar to Apache Spark or Python Panda DataFrames.

Developers can also create user-defined functions (UDFs) that are pushed down to the Snowflake server where they can operate on the data. The code execution is using lazy evaluation; this will reduce the amount of data to be exchanged between Snowflake warehouse and the client. 

The current version of Snowpark runs on Scala 2.12 and JDK 8, 9, 10 or 11. It is in public preview now, available for all accounts.

Architecture

From architecture perspective a Snowflake client is similar to Apache Spark Driver program. It executes the client code then pushes the generated SQL query to Snowflake warehouse and once the Snowflake compute server processed the data, it receives the result back in a DataFrame format.


Figure 1: Snowpark Architecture


Broadly speaking, Snowflake operations can be put into two categories: transformations and actions. Transformations are lazily executed so they do not cause the DataFrames to be evaluated. Examples for transactions are select, filter, sort, groupBy, etc.  Actions are the opposite; they will trigger the evaluation of the DataFrame so Snowpark sends the SQL query for the DataFrame to the server for evaluation and the result will be downloaded to the client's memory. Example for actions include show, collect, take, etc.

Snowpark in Action

Before we can execute any Snowpark transformations and actions, we need to connect to the Snowflake warehouse and establish a session.

Scala
 
object Main {
  def main(args: Array[String]): Unit = {
    // Replace the <placeholders> below.
    val configs = Map (
      "URL" -> "https://<SNOWFLAKE-INSTANCE>.snowflakecomputing.com:443",
      "USER" -> "<USERNAME>",
      "PASSWORD" -> "<PASSWORD>",
      "ROLE" -> "SYSADMIN",
      "WAREHOUSE" -> "SALESFORCE_ACCOUNT",
      "DB" -> "SALESFORCE_DB",
      "SCHEMA" -> "SALESFORCE"
    )
    val session = Session.builder.configs(configs).create
    session.sql("show tables").show()
  }
}


From Snowflake Admin UI perspective, we have a SALESFORCE_DB database with a schema SALESFORCE that ha3 tables: SALESFORCE_ACCOUNT table represents account from our Salesforce instance, SALESFORCE_ORDER table stores orders initiated by these accounts and SALESFORCE_ACCOUNT_ORDER is a joined table that will store the joined query result (we come back to this later on in this post):


Figure 2: Snowflake database tables


To retrieve the first 10 rows from the Salesforce_Account table we can just simply execute the following DataFrame methods:

Scala
 
    // Create a DataFrame from the data in the "salesforce_account" table.
    val dfAccount = session.table("salesforce_account")

     // To print out the first 10 rows, call:
     dfAccount.show()


Snowpark will convert the code into SQL and push it to Snowflake warehouse for execution:

Scala
 
[main] INFO com.snowflake.snowpark.internal.ServerConnection - Execute query [queryID: XXXX]  SELECT  *  FROM ( SELECT  *  FROM (salesforce_account)) LIMIT 10


The output in our VSCode IDE looks like this:

Figure 3: VSCode Output for Salesforce_Account table


We can also filter certain rows and execute transformation (e.g. select columns) of the DataFrame:

Scala
 
     val dfFilter = session.table("salesforce_account").filter(col("type") === "Customer - Direct")
     dfFilter.show()

     val dfSelect = session.table("salesforce_account").select(col("accountname"), col("phone"))
     dfSelect.show()


Snowpark will generate the corresponding SQL queries and push them down to Snowflake compute server:

 
[main] INFO com.snowflake.snowpark.internal.ServerConnection - Execute query [queryID: XXXX]  SELECT  *  FROM ( SELECT  *  FROM ( SELECT  *  FROM (salesforce_account)) WHERE ("TYPE" = 'Customer - Direct')) LIMIT 10
  
 [main] INFO com.snowflake.snowpark.internal.ServerConnection - Execute query [queryID: XXXX]  SELECT  *  FROM ( SELECT "ACCOUNTNAME", "PHONE" FROM ( SELECT  *  FROM (salesforce_account))) LIMIT 10


And the VSCode output:

Figure 4: VSCode output for filter and select methods


Snowflake DataFrame API also allows joining DataFrames.  In our example, we have the Salesforce_Order table in Snowflake containing orders executed by these Salesforce accounts. We can pull that data into our DataFrame and join them with account records:

Scala
 
     val dfOrder = session.table("salesforce_order")
     dfOrder.show()

     val dfJoin = dfAccount.join(dfOrder, col("sfdcid") === col("accountid")).select(col("accountname"), col("phone"),col("productname"), col("amount"))
     dfJoin.show()


Snowflake converts the DataFrame methods into SQL and pushes them to the Snowflake warehouse. The output will look like this in VSCode:

Figure 5: VSCode output for join method

If we want to persist our calculated results, we can do it by using saveAsTable method:

Scala
 
 dfJoin.write.mode(SaveMode.Overwrite).saveAsTable("salesforce_account_order")


The generated SQL query looks like this:

Scala
 
[main] INFO com.snowflake.snowpark.internal.ServerConnection - Execute query [queryID: XXXX]  CREATE  OR  REPLACE  TABLE salesforce_account_order AS  SELECT  *  FROM ( SELECT "ACCOUNTNAME", "PHONE", "PRODUCTNAME", "AMOUNT" FROM ( SELECT  *  FROM (( SELECT "ACCOUNTNAME" AS "ACCOUNTNAME", "PHONE" AS "PHONE", "TYPE" AS "TYPE", "SFDCID" AS "SFDCID" FROM ( SELECT  *  FROM (salesforce_account))) AS SNOWPARK_TEMP_TABLE_UKKLR6UCHN6POXL INNER JOIN ( SELECT "ACCOUNTID" AS "ACCOUNTID", "PRODUCTNAME" AS "PRODUCTNAME", "AMOUNT" AS "AMOUNT" FROM ( SELECT  *  FROM (salesforce_order))) AS SNOWPARK_TEMP_TABLE_36DEOZXTQJUYKLD ON ("SFDCID" = "ACCOUNTID"))))


And as a result, Snowpark will create a new table or replace the existing one and store the generated data:

Figure 6: Snowflake Salesforce_Account_Order table

Conclusion

Snowpark opens up a wide variety of operations and tools to be used in the data processing. It allows creating very complex and advanced data pipeline. It is a powerful feature that can push the custom code down to the Snowflake warehouse and execute it close to the data by reducing the need for unnecessary data transfer.

Database Data processing sql Apache Spark Scala (programming language)

Opinions expressed by DZone contributors are their own.

Related

  • Upgrading Spark Pipelines Code: A Comprehensive Guide
  • Query Federation in Data Virtualization and Best Practices
  • Cloud Database Services Compared: AWS, Microsoft, Google, and Oracle
  • 9 Best Free and Open Source Tools for Reporting

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!