DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Databricks: An Understanding Inside the WH
  • How VAST Data’s Platform Is Removing Barriers To AI Innovation
  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • Why Blockchain Technology Is the Future of Data Storage

Trending

  • Evolution of Cloud Services for MCP/A2A Protocols in AI Agents
  • It’s Not About Control — It’s About Collaboration Between Architecture and Security
  • Issue and Present Verifiable Credentials With Spring Boot and Android
  • How to Practice TDD With Kotlin
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Optimal Use of Snowflake Warehouse and Tables

Optimal Use of Snowflake Warehouse and Tables

The third article of this series presents a deep-dive into how efficiently we can use Snowflake tables and warehouses and how data sharing occurs across accounts.

By 
Kedhar Natekar user avatar
Kedhar Natekar
·
May. 26, 23 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
4.0K Views

Join the DZone community and get the full member experience.

Join For Free

In the previous blog, we discussed the best practices to be followed while data loading into Snowflake from stages.

Continuing the Snowflake blog series lets us understand how to use Snowflake Warehouse and Tables optimally.

Snowflakes Virtual Warehouses

Virtual Warehouses are one of the critical components in Snowflake architecture, and deciding the correct configurations for the same can save a lot of Snowflake credits.

Below are some of the best practices you need to consider while selecting the configurations for the warehouse.

How To Decide Which Warehouse Size Is Optimal

  • For simple queries, it’s preferable to use x-small or small warehouses.
  • For complex queries or if large datasets need to be scanned, use larger-sized warehouses.
  • Query performance improves linearly while increasing the warehouse size until optimal performance is reached. After that, there wouldn’t be any significant difference in performance.
  • To determine the optimal warehouse for complex queries, it is recommended to test with different warehouses and note the query execution time for each.
  • If you are running a single query on the warehouse, it is best to choose a warehouse that runs for at least a minute to optimize the cost.

Auto Suspension of Warehouse

Auto Suspend allows warehouses to suspend automatically when they are not in use and hence saving cost.

Auto Suspend property can be disabled when:

  • If there are frequent, steady workloads on the Warehouse 24*7.
  • If you need the virtual warehouse readily available all the time for faster query results from the cache. 

Note that disabling auto suspend can lead to heavy billing, and hence choose wisely.

General practice is to keep the auto suspension enabled so that you only pay for the active time and not for the ideal time.

By default AUTO_SUSPEND time interval is 600 seconds.

This cannot be optimal if, say, you run the queries once in 10 minutes and the query execution time is 1 min. 

In such cases, it’s always better to set the AUTO_SUSPEND time interval according to the needs.

The time interval for Auto Suspend can be decided on below factors:

  1. Time Interval between two subsequent queries executed in the warehouse.
  2. Average query execution time.

Cost vs. Performance While Setting the Auto_Suspend Limit

Let’s assume that there is a recurring pattern of executing similar queries every 75 seconds, with an average query execution time of 10 seconds, and the warehouse has been set to AUTO_SUSPEND after 60 seconds.

In such cases, below is what would happen:

           0 th sec  → Query is initiated and starts the warehouse

          10th sec  → Query is executed successfully

          70th sec  → Warehouse auto suspends

          75th sec  → Query is initiated and starts the warehouse

          85th sec  → Query is executed successfully

        145th sec  → Warehouse auto suspends

        150th sec  → Query is initiated and starts the warehouse

        160 th sec → Query is executed successfully

         220th sec → Warehouse auto suspends

And so on…

Here if you notice, the AUTO_SUSPEND of 60 sec is not benefiting us when we consider the cost vs. performance factor.

The total uptime of the warehouse in the above case is 210 sec.

Total uptime, if AUTO_SUSPEND was disabled, would have been 220 sec.

But the benefit of disabling AUTO_SUSPEND in this given scenario would have been faster query processing time.

Every time the warehouse is restarted, the data would be fetched from the remote disk to the local cache, and then the query will undergo processing.

But in the case of disabling AUTO_SUSPEND, since the queries were similar, it just had to process over the Local disk cache and would result in faster query performance. Maybe a couple of seconds instead of 10 secs.

And if the same query was reissued and there was no data change, the output would have been in milliseconds directly from the result cache.

So, always consider the trade-off between saving credits by suspending a warehouse versus maintaining the cache of data from previous queries to help with performance.

Now that we understand the costs associated with warehouses in Snowflake let’s take a look at how data storage in Snowflake affects the overall billing.

Data Storage Costs Over Snowflake Internal Tables

Data storage costs are frequently overlooked in Snowflake as they are considered inexpensive. However, it is crucial to carefully consider the type of tables to create in Snowflake, taking into account the costs associated with time travel, fail-safe, data sharing, and cloning strategies. 

This understanding will aid in developing effective approaches for managing internal tables.

If you come from an RDBMS background, you may assume that executing “create table” in Snowflake will create a normal table. However, this is not the case. Instead, it will create a table with time travel enabled, which can lead to increased costs if such a table is not needed. Any inserts, updates, and deletes on these tables are accounted for data storage, and in case of frequent DML operations, the size of tables with time travel data can grow within no time.

Hence if you have not decided on which kind of table you need to create, then always use:

CREATE TRANSIENT TABLE instead of CREATE TABLE

This has to be communicated to developers as the general habit is to always use Create Table.

For a normal table, if it is large in size along with high churn, the costs can grow exponentially.

Note: all tables don’t need to have time travel features, and hence use CREATE TABLEwisely.

For Example

Suppose we have a table that is 200GB in size and receives frequent updates. This table is configured for time travel, specifically for 90 days, and it’s assumed that each record in the table undergoes an update at least 20 times during this period. After the 90-day period, the table will be moved to Fail Safe storage by Snowflake, where it will be stored for 7 days.

Hence below will be the stats for storage for the table:

storage stats

Although the table is of size 0.2TB, the cost incurred is 32.2 TB when time travel is enabled.

And below is the case if the same table would have been a Transient Table with 0 days of time travel:

time travel

Although you can enable time travel for up to 90 days, choose the no of days which suits your needs.

E.g., In production, if I know that any data-related issues exist and can be figured out and fixed within 7 days, then I’d choose the time travel days as 7 days.

Even in the worst case, if the issue persists for more than 7 days, you can contact Snowflake support and get a copy of Fail Safe data.

If you are going with the transient table approach for critical tables, then the best practice is to always keep a backup at periodical intervals.

Although the backup table also cost the same as the actual table, the total cost of both combined would always be much lesser than the table with time travel. 

Use Cloning Instead of CTAs In Snowflake

Cloning in Snowflake is a powerful concept and will save a lot of costs if used.

Use cases would be:

  • Create a copy of the table. This could be for any bug debugging.
  • Creating a backup copy of the existing table.

When a table is cloned, the micro partitions are shared between the current main table and cloned table at that particular time of cloning.

CTAS queries which are used in many databases, would duplicate the data, but in the case of cloning, the underlying data in the form of micro partitions will remain the same and hence saving the storage cost.

If any DML are done on the main table and cloned table after cloning, then the new micro partitions are not shared.

Hence the best practice is to clone the table wherever needed and not to use CTAS queries. Similarly, cloning can be done at database and schema levels, also saving a lot of costs.

Data Sharing To Share the Data Across Accounts

Cloning an object is not possible across accounts, and we tend to opt for the replication of objects between accounts.

Some use cases here could be:

The production database is in Snowflake Account A1 under Org O.

The dev database is in Snowflake Account A2 under the same Org O.

And you need to test the dev data engineering pipeline with the same source tables as in the Production account.

Now since cloning across databases for source tables is not possible, in such cases, we can opt for data sharing between accounts.

Let's see how this works:

In the production account, let’s assume that we have a database PAYER_PROD and a schema PAYER_ANALYTICS_MASTER inside which we have a source table AWSCUR_MAIN, which needs to be shared with the development account.

Then follow the below steps.

In the production account:

SQL
 
-- Use AccountAdmin role
use role ACCOUNTADMIN;

-- Creates a Share object
create share AWSCUR_MAIN_NON_PROD;

-- Grants necessary privileges to the share

grant usage on database PAYER_PROD to share AWSCUR_MAIN_NON_PROD;
grant usage on schema PAYER_PROD.PAYER_ANALYTICS_MASTER to share AWSCUR_MAIN_NON_PROD;
grant select on table PAYER_PROD.PAYER_ANALYTICS_MASTER.AWSCUR_MAIN to share AWSCUR_MAIN_NON_PROD;


-- Add accountid of the Development snowflake to the share
alter share AWSCUR_MAIN_NON_PROD add accounts=<dev_account_id>;


In the development account:

SQL
 
-- Use AccountAdmin role
use role ACCOUNTADMIN;

-- Create a database out of the shared object
CREATE DATABASE PAYER_PROD FROM SHARE <<orgid>>.<<accountid>>."AWSCUR_MAIN_NON_PROD";

-- Grant the database to respective roles in development account
GRANT IMPORTED PRIVILEGES ON DATABASE PAYER_PROD TO ROLE "<<rolename>>";


Since data is shared between accounts and not replicated, there are no additional storage costs associated with it. Only metadata is shared.

Any updates on production data will reflect in the dev Snowflake database at no additional cost.

Cost is only for querying the data in the dev environment, which is per warehouse usage.

Data storage Database Data (computing) Data warehouse

Published at DZone with permission of Kedhar Natekar. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Databricks: An Understanding Inside the WH
  • How VAST Data’s Platform Is Removing Barriers To AI Innovation
  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • Why Blockchain Technology Is the Future of Data Storage

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!