Choosing Between Modern Data Warehouses
Choosing Between Modern Data Warehouses
Consider data volume, on-premises vs. cloud, scalability, and pricing in this comparison of Redshift, BigQuery, and Snowflake.
Join the DZone community and get the full member experience.Join For Free
Download the Scale-Out and High Availability whitepaper. Learn why leading enterprises choose the Couchbase NoSQL database over MongoDB™ after evaluating side by side.
When our customers ask us what the best data warehouse is for their growing company, we consider the answer based on their specific needs. Usually, they need near-real-time data for a low price without the need to maintain data warehouse infrastructure. In this case, we advise them to use modern data warehouses such as Redshift, BigQuery, or Snowflake.
Most of the modern data warehouse solutions are aimed at working with raw data that makes them big data-compatible. This approach allows you to redefine your analytics on the fly without needing to reingest your data stored in a warehouse.
In this post, we would like to dig deeper into the factors to consider while choosing a data warehouse. They are:
- Volume of data
- Dedicated human resources for the support and maintenance
- Scalability (horizontal vs. vertical)
- Pricing models
You need to know the estimates of the volume of data that you will be dealing with. If it’s a clear-cut scenario where you are working with datasets ranging in the range of hundreds of TBs or in petabytes, then it is strongly advised to go for non-relational databases. The architecture of such databases that supports working with huge datasets is engrained in their DNA.
On the other hand, many relational databases have really great time-proven query optimizers. You can consider them as an option for an analytics warehouse as soon as your dataset fits into one single node.
Let’s look at some mathematics relevant to dataset size:
- The sweet spot of Postgres, MySQL, MSSQL, and many other RDBMSs is up to 1TB of data involved in analytics. If this size is exceeded, you may experience degraded performance.
Amazon Redshift, Google BigQuery, Snowflake, and Hadoop-based solutions support a dataset size up to multiple petabytes in an optimal manner.
On-Premises vs. Cloud
Another important aspect to evaluate is whether you have any dedicated resources for the maintenance, support, and fixes for your database, if any. This facet plays an important role in the comparison.
If you have dedicated resources for support and maintenance, you have earned yourself a lot more options in choosing the database.
You can opt for creating your own big data warehousing options based on something like Hadoop or Greenplum. These systems do require significant setup, maintenance engineering resources, and skilled personnel.
But if you don’t have any dedicated resources for maintenance, then you limit your options a bit. We advise going for a modern data warehouse solution like Redshift, BigQuery, or Snowflake. As an administrator or user, you don’t need to worry about deploying, hosting, resizing VMs, handling replication, or encrypting. You can start using it by issuing SQL commands.
When you start working with a database, you expect it to be scalable enough to support your further growth. Broadly, database scalability can be achieved in two ways: horizontally or vertically.
Horizontal scalability refers to the addition of more machines, whereas vertical scalability means the addition of resources into a single node to increase its capability.
Redshift provides easy scalability options. With a few mouse clicks, you can increase the number of nodes and configure them to meet your needs. Redshift scales very well until you hit about 100TB of data that is processed at once in a query. The computing capacity of your Redshift cluster will always rely on the count of nodes in your cluster, unlike some other data warehouse options.
We reach a point where solutions such as BigQuery come into play. There’s virtually no cluster capacity, as BigQuery can allocate up to 2,000 slots, which is the equivalence of nodes in Redshift. Also, due to this multi-tenancy strategy, even when customers’ concurrency demands grow, BigQuery scales seamlessly with those demands and can go over this limitation of 2,000 slots if required.
BigQuery relies on Colossus, which is Google’s latest generation distributed file system. Colossus allows BigQuery users to scale to dozens of petabytes in storage seamlessly without paying the penalty of attaching much more expensive computing resources.
Snowflake is built on Amazon S3 cloud storage and its storage layer holds all the diverse data, tables, and query results. Because this storage layer is engineered to scale completely independently of computing resources, it ensures that maximum scalability can be achieved effortlessly for big data warehousing and analytics.
In addition to this, Snowflake offers multiple virtual warehouses, at nearly any scale and concurrency, which can simultaneously operate on the same data while fully enforcing global system-wide transactional integrity and keeping it scalable.
If you go with the self-hosted option like Hadoop, your pricing will consist mostly of VM or hardware bills. AWS provides an EMR solution, which is a cost-effective option to consider when going with Hadoop stuff.
Drilling down further into RedShift vs. BigQuery vs. Snowflake, all of them offer on-demand pricing, but each one comes with its own unique pricing model flavor.
Amazon RedShift offers three pricing models:
- On-demand pricing: No upfront commitments and cost; you simply pay an hourly rate depending upon the types and number of nodes in your cluster. Here, one important factor often ignored is that the rates vary depending upon region. The rates cover both computation and data storage.
- Spectrum pricing: You merely pay for the bytes scanned while querying against Amazon S3.
- Reserved instance pricing: If you are sure that you will be running on RedShift for at least a few years, then you can save up to 75% over on-demand rates by opting for reserved instance pricing.
Google BigQuery offers scalable, flexible pricing options and charges for data storage, streaming inserts, and querying data, but loading and exporting data are free of charge. The pricing strategy of BigQuery is quite unique, as it is based on a rate per GB for storage and a bytes-scanned rate for queries. Also, it offers cost control mechanisms that enable you to cap your daily costs to an amount that you choose. It offers a long-term pricing model, as well.
Snowflake offers on-demand pricing, which is similar to BigQuery and Redshift Spectrum. Unlike BigQuery, computational usage is billed on a per-second basis rather than bytes scanned, with a minimum of 60 seconds. Snowflake decouples data storage from computation, and hence the billing is individual for both of them.
The storage rate starts at around $40/TB/month for the standard edition and remains the same for other editions. On the other hand, for compute the rate starts at $2.00 per hour for standard and goes up to $4.00 per hour for the enterprise edition.
The general advice for choosing a data warehouse, which we usually provide to our customers, is the following:
- Use an index-optimized RDBMS such as Postgres, MySQL, or MSSQL when data volumes are much less than 1TB of data in total and much less than 500M rows per analyzed table, and the whole database can fit into a single node.
- Use modern data warehouses like Redshift, BigQuery, or Snowflake when your data volume is between 1TB and 100TB. Also consider Hadoop with Hive, Spark SQL, or Impala as a solution if you have access to this expertise and you can allocate dedicated human resources to support that.
- When your data volume is over 100TB, use BigQuery, Snowflake, Redshift Spectrum, or the self-hosted Hadoop equivalent solution.
Published at DZone with permission of Pavel Tiunov . See the original article here.
Opinions expressed by DZone contributors are their own.