DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Split-Brain in Distributed Systems
  • Easy Oracle Database Migration with SQLcl
  • On-Premises Legacy Oracle Database to PostgreSQL Database as a Service
  • Connecting an Autonomous Data Warehouse With Python

Trending

  • Rust, WASM, and Edge: Next-Level Performance
  • Chat With Your Knowledge Base: A Hands-On Java and LangChain4j Guide
  • Monolith: The Good, The Bad and The Ugly
  • The Future of Java and AI: Coding in 2025
  1. DZone
  2. Data Engineering
  3. Data
  4. Best Practices and Hacks: Data Loading in Snowflake From Stages

Best Practices and Hacks: Data Loading in Snowflake From Stages

This second article of the series will help you to understand the best practices for file-based data loading in Snowflake.

By 
Kedhar Natekar user avatar
Kedhar Natekar
·
Apr. 24, 23 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
4.4K Views

Join the DZone community and get the full member experience.

Join For Free

Continuing our Snowflake blog series, after learning about setting up a  Snowflake account using System-defined roles, we will explore the best practices for loading data from a file into Snowflake.

Let’s begin!

Snowflake supports file-based data ingestion through internal and external stages. However, there are various factors to consider when performing data ingestion, including the frequency of data arrival, file sizes, and the data loading techniques used such as copy command, external tables, and Snowpipe, among others. Additionally, the file format used, such as CSV, Parquet, JSON, etc., also plays a critical role in choosing the right approach. 

Making the wrong choice can result in increased costs and slower performance. 

This blog provides insights into these approaches to help you select the best one while loading data into Snowflake.

Faster Loading of Parquet Files From S3

If you want to load Parquet files from S3 faster into Snowflake, you should not use the COPY command, which is suitable for CSV format files placed in external stages. 

Instead, it is recommended to use external tables on top of the Parquet file and enable the “vector scan performance flag” scan property. 

However, to enable this property, you need to contact Snowflake support and have it enabled at the account level.

Parallel Operations

To improve the efficiency of loading data from stages into Snowflake, it is recommended to create files in the range of 100-250MB with compression. 

By default, each node can handle 8 threads in parallel. Therefore, if you have 32 files, a Medium warehouse with 4 nodes can process all of them simultaneously, running 32 threads in parallel. 

It’s important to keep in mind that the performance of parallel operations in Snowflake can also be affected by the complexity of the operations being performed. Additionally, Snowflake may not always run the threads as defined in parallel due to various factors such as resource availability and load balancing. Therefore, it’s important to monitor and adjust the parallelism accordingly to ensure optimal performance.

It is important to ensure that the warehouse used for data loading runs for at least a minute to fully utilize the cost for active time.

Purging Files From Stage

To optimize performance, it is recommended to remove files from the stage after successful loading using the COPY command with the PURGE=True option. 

This ensures that the staged files are deleted once loaded, which not only improves performance but also eliminates the need for COPY commands to scan the entire bucket path to check for files to load.

Loading Large Files in Snowflake

Suppose you need to load a single large file in gigabytes, which may or may not contain bad data in some of its rows. In such cases, it is advisable to load the data by ignoring the errors. Failing to do so can result in the wastage of credits. For example, if the data is being processed using a warehouse and an error row is identified after an hour, the entire operation will be aborted, leading to credit wastage.

Using “ON_ERROR” as “CONTINUE” will load the good data and ignore the bad rows. However, it is always recommended to load large files by splitting them into smaller chunks so that parallel processing can be utilized using a warehouse. If that’s not possible and you still want to load a large file, it’s recommended to check if it’s okay to continue loading by ignoring the bad data.

Best Practice While Loading JSON Data

When loading JSON data into Snowflake, it’s important to note that the output from some applications can be a single large array consisting of multiple rows combined together. However, due to the limitations of the VARIANT datatype (which has a maximum size of 16 MB), such a huge JSON output cannot be loaded in its entirety. 

Moreover, it’s not appropriate to load a single huge array as a single row. Instead, the array should be split into multiple rows. To achieve this, use the STRIP_OUTER_ARRAY=TRUE option in the FILE_FORMAT parameter of the COPY command. This option excludes the outer array and identifies individual rows, allowing for efficient loading of JSON data into Snowflake.

Snowpipe File Loading Best Strategies

Loading data using Snowpipe can result in charges for the time taken to load the file placed on the stage and the overhead of maintaining the queue for files waiting to be loaded. This means that if a large number of files with different sizes are continuously loaded into the stage location, it could lead to slower loading of the data and increased expenses.

To ensure effective loading, it is best to ensure that the file sizes are within the limits of 100-250MB and that there is no sudden surge of files in the stage location, which could increase the queue. 

You can optimize the loading process by placing the files in the stage at a certain frequency, which is close to the time it takes to load a single file using Snowpipe. This will help ensure efficient utilization of resources and reduce costs.

Make Use of the Internal Stage Wherever Applicable

To optimize data loading in Snowflake, it is recommended to use the internal stage instead of external stages such as S3, as it results in faster loading to the Snowflake table. 

It is also important to compare the costs of storing data in the stage location of Snowflake with that of the object storage of cloud providers. To reduce costs, it is advisable to purge the data from the stage when it is no longer needed.

Thank you for reading this article. In the upcoming installment of this series, we will explore the best practices for optimizing the use of Snowflake warehouses and tables.

Data (computing) Cloud database

Published at DZone with permission of Kedhar Natekar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Split-Brain in Distributed Systems
  • Easy Oracle Database Migration with SQLcl
  • On-Premises Legacy Oracle Database to PostgreSQL Database as a Service
  • Connecting an Autonomous Data Warehouse With Python

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!