Best Practices and Hacks: Data Loading in Snowflake From Stages
This second article of the series will help you to understand the best practices for file-based data loading in Snowflake.
Join the DZone community and get the full member experience.Join For Free
Continuing our Snowflake blog series, after learning about setting up a Snowflake account using System-defined roles, we will explore the best practices for loading data from a file into Snowflake.
Snowflake supports file-based data ingestion through internal and external stages. However, there are various factors to consider when performing data ingestion, including the frequency of data arrival, file sizes, and the data loading techniques used such as copy command, external tables, and Snowpipe, among others. Additionally, the file format used, such as CSV, Parquet, JSON, etc., also plays a critical role in choosing the right approach.
Making the wrong choice can result in increased costs and slower performance.
This blog provides insights into these approaches to help you select the best one while loading data into Snowflake.
Faster Loading of Parquet Files From S3
If you want to load Parquet files from S3 faster into Snowflake, you should not use the COPY command, which is suitable for CSV format files placed in external stages.
Instead, it is recommended to use external tables on top of the Parquet file and enable the “vector scan performance flag” scan property.
However, to enable this property, you need to contact Snowflake support and have it enabled at the account level.
To improve the efficiency of loading data from stages into Snowflake, it is recommended to create files in the range of 100-250MB with compression.
By default, each node can handle 8 threads in parallel. Therefore, if you have 32 files, a Medium warehouse with 4 nodes can process all of them simultaneously, running 32 threads in parallel.
It’s important to keep in mind that the performance of parallel operations in Snowflake can also be affected by the complexity of the operations being performed. Additionally, Snowflake may not always run the threads as defined in parallel due to various factors such as resource availability and load balancing. Therefore, it’s important to monitor and adjust the parallelism accordingly to ensure optimal performance.
It is important to ensure that the warehouse used for data loading runs for at least a minute to fully utilize the cost for active time.
Purging Files From Stage
To optimize performance, it is recommended to remove files from the stage after successful loading using the COPY command with the PURGE=True option.
This ensures that the staged files are deleted once loaded, which not only improves performance but also eliminates the need for COPY commands to scan the entire bucket path to check for files to load.
Loading Large Files in Snowflake
Suppose you need to load a single large file in gigabytes, which may or may not contain bad data in some of its rows. In such cases, it is advisable to load the data by ignoring the errors. Failing to do so can result in the wastage of credits. For example, if the data is being processed using a warehouse and an error row is identified after an hour, the entire operation will be aborted, leading to credit wastage.
Using “ON_ERROR” as “CONTINUE” will load the good data and ignore the bad rows. However, it is always recommended to load large files by splitting them into smaller chunks so that parallel processing can be utilized using a warehouse. If that’s not possible and you still want to load a large file, it’s recommended to check if it’s okay to continue loading by ignoring the bad data.
Best Practice While Loading JSON Data
When loading JSON data into Snowflake, it’s important to note that the output from some applications can be a single large array consisting of multiple rows combined together. However, due to the limitations of the VARIANT datatype (which has a maximum size of 16 MB), such a huge JSON output cannot be loaded in its entirety.
Moreover, it’s not appropriate to load a single huge array as a single row. Instead, the array should be split into multiple rows. To achieve this, use the STRIP_OUTER_ARRAY=TRUE option in the FILE_FORMAT parameter of the COPY command. This option excludes the outer array and identifies individual rows, allowing for efficient loading of JSON data into Snowflake.
Snowpipe File Loading Best Strategies
Loading data using Snowpipe can result in charges for the time taken to load the file placed on the stage and the overhead of maintaining the queue for files waiting to be loaded. This means that if a large number of files with different sizes are continuously loaded into the stage location, it could lead to slower loading of the data and increased expenses.
To ensure effective loading, it is best to ensure that the file sizes are within the limits of 100-250MB and that there is no sudden surge of files in the stage location, which could increase the queue.
You can optimize the loading process by placing the files in the stage at a certain frequency, which is close to the time it takes to load a single file using Snowpipe. This will help ensure efficient utilization of resources and reduce costs.
Make Use of the Internal Stage Wherever Applicable
To optimize data loading in Snowflake, it is recommended to use the internal stage instead of external stages such as S3, as it results in faster loading to the Snowflake table.
It is also important to compare the costs of storing data in the stage location of Snowflake with that of the object storage of cloud providers. To reduce costs, it is advisable to purge the data from the stage when it is no longer needed.
Thank you for reading this article. In the upcoming installment of this series, we will explore the best practices for optimizing the use of Snowflake warehouses and tables.
Published at DZone with permission of Kedhar Natekar. See the original article here.
Opinions expressed by DZone contributors are their own.
Microservices: Quarkus vs Spring Boot
What Is mTLS? How To Implement It With Istio
Write a Smart Contract With ChatGPT, MetaMask, Infura, and Truffle
AI Technology Is Drastically Disrupting the Background Screening Industry