Criteria for Selecting a Data Warehouse Platform
Criteria for Selecting a Data Warehouse Platform
There are many possible data warehouse choices, and introducing the correct platform really does set the parameters for your company’s information culture.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I loved this line from an article I recently stumbled upon:
“Choosing between the different types of data warehouse platforms can be simplified once you know which deployment option best meets your project requirements.”
It’s not that it isn’t true. But kid yourself not: simplified and simple are not the same thing.
It remains mind-bogglingly complex and tedious to squeeze actionable insights out of the abundant wealth of data that is presently available. That’s why it is essential to make good choices from the outset when selecting a data warehouse platform. As the knight told Indiana Jones when he had to select the Holy Grail: “Choose wisely.” Whether implementing a new data warehouse solution or expanding on your existing one, you need to choose the best option available.
If you are expanding your existing data warehouse, you need to compare your current solution to its competitors’ to see if other vendors offer more relevant features or are better in terms of performance. The choice is even more complicated if you are a first-time user, as you have no prior experience to base your selection on.
In any case, the magic happens in that sweet spot where cost, performance, and simplicity are perfectly balanced according to your needs. Keep in mind, though, that — as the case is with most technologies — whatever you choose today will probably be outdated sooner than you expect, so make sure to reassess your options on the ongoing basis.
Criteria for Selecting Your Perfect Data Warehouse
While there is no one, generic “right” answer, there are better and worse choices for every specific use case. And there is much to be lost by choosing poorly. To avoid excruciating pain of being stuck with a poorly fitted solution, I recommend using the following criteria for evaluating data warehouse platforms and vendors.
First of all, let’s get the cloud vs. on-prem question out of the way. Many companies mistakenly believe that DWaaS (data warehouse as a service) should be low on your list because speed constraints are imposed by network latency with cloud access. This leads many to mistakenly go for on-premises deployment. In fact, many issues from security to scalability and flexibility in changing node types are inherently suboptimal in an on-premise solution.
For most (especially mid-sized users), superior performance and availability can be achieved with leading cloud data warehouse providers. I truly believe that unless strict compliance requirements prohibit the DWaaS option, most companies are better of with the cloud vendor when it comes to their data warehouse and general analytics infrastructure needs.
However, it is a common mistake to believe that cloud solutions do not require extensive in-house tuning and administration. Anyone who's ever dealt with data management in the cloud knows that the tasks involved are complicated and ongoing. That being said, relative to on-premise solutions, it’s a walk in the park.
Cloud Vendors: Redshift Is in the Lead
Panoply, Periscope Data, and many others have run extensive performance testing between different cloud technologies. In most cases, AWS Redshift came out on top, but in some categories, Google BigQuery or Snowflake had the upper hand.
Panoply ran a performance benchmark test comparing Redshift vs. BigQuery. We discovered that contrary to previous findings, which didn’t take optimization into account, when reasonably optimized, Redshift outperforms BigQuery in 9 out of 11 use-cases hands down. The only instance where BigQuery had demonstrated superior performance was in big join operations.
After investigating Redshift, Snowflake, and BigQuery, Periscope data also declared Redshift as a clear winner when it comes to both price and performance. They found that Redshift is the best choice for real-time query speeds on their customers’ typical data volumes.
For a company building on a massive growth, the infrastructure scalability in the cloud should be measured in terms of costs, resources, and simplicity. Most infrastructure cloud providers provide an "easy" way to scale your cluster, while some, like Google BigQuery, scale seamlessly in the background.
BigQuery’s most significant advantage, in my opinion, is the seamless and fast resizing of a cluster up to petabyte scale. Unlike Redshift, there is no need to constantly track and analyze the cluster size and growth in an effort to optimize its size to fit the current dataset requirements. However, looking at the analysis performed by Panoply and Periscope data, when the cluster is properly optimized, Redshift shows a highly competitive pricing in comparison to BigQuery:
“At seven cents a query, it would cost around $70/day for each customer. We could get faster speeds at a lower price point using an 8 node dc1.large Redshift cluster for each customer at $48/day, so migrating to BigQuery would not be cost-effective for us.”
In addition, Redshift scalability allows users to enjoy an increase in performance when resources including memory and I/O capacity increase. Panoply seamlessly scales Redshift users’ cloud footprints according to the amount of data and queries and the complexity of the queries. It scales the cluster on demand, ensuring the data warehouse performance is well balanced with costs.
Panoply analysis shows that estimating query and data volume costs with BigQuery is complex. This leads to unpredictable fees feeding into users’ uncertainty regarding the costs involved, leading them to try to limit queries and data volume, all of which can negatively impact on the organization’s data analytics capabilities.
This complexity of cost calculation was somewhat solved with Snowflake's solution of bundled CPU pricing, but again, foreseeing your query needs in advance is a challenge remaining unanswered.
The leaders in cloud infrastructure technologies Amazon, Google, and Microsoft are all generally reliable, especially when compared to the on-premise option where many more factors in the chain rely on you. That being said, no matter how reputable the vendor, as the recent AWS S3 outage shows, even the best vendors can have bad days. You need to look not only at the frequency of such incidents (obviously, the less the better) but also, how quickly and thoroughly the vendor responded to the downtime.
Reliable and professional support is one of the main criteria to consider when choosing your DWaaS platform. In my opinion, none of the vendors truly provide a good enough SLA for solving today's on-demand support needs for data savvy customers. This shortcoming is one of the reasons Panoply provides a data architect that is dedicated to each account; a real person responsible for taking care of your real data needs.
Usability, Security, and Integration
As your data grows, the number of data sources increases and data logic becomes more complex, you’ll also want to add management features and functions such as DBA productivity tools, monitoring utilities, locking schemes, and other security mechanisms, remote maintenance capabilities, and user chargeback functionality into your infrastructure. The ability to change data types and implement new tables and indexes at will can sometimes be a lengthy process and taking this into account beforehand can prevent agonizing pain in the future.
When ingesting data into your analytics architecture it is important to assess the type of methodology to do so. The difference between the correct ingestion methodology and the wrong one can be the difference between data loss and rich data and the difference between a well-organized schema and a data swamp.
For instance, Snowflake supports simultaneous users’ queries through different virtual warehouses. According to Periscope Data, you can:
“...have your overnight ETL processes run on slower and less expensive warehouse resources, and then enable real-time ad-hoc queries through a more powerful warehouse during business hours.”
However, with Redshift scale and operational efficiency, ETL can be referred to as a rigid and outdated paradigm.
That is why Panoply follows the ELT process, whereby all of the raw data is instantly available in real-time and the transformations happen asynchronously on query time. This makes Panoply both a data lake and a data warehouse, allowing users to have constant and real-time access to their raw data. This means they can iterate their transformations in real-time, with updates instantly applied on newly inserted data as well. Finally, customized, advanced transformations are also possible via the Panoply UI console and take just minutes to set up and run.
Supported Data Types
Consider your needs carefully. A polyglot approach involves multiple data platform types. These range from relational and analytical databases to NoSQL DBMSs and new platforms, such as Spark and Hadoop. While this adds complexity, it also provides data warehouse users with the ability to couple historical BI with more forward-looking predictive analytics and data mining. Very important from a BI perspective.
Backup and Recovery
BigQuery automatically replicates data to ensure its availability and durability. However, complete loss of data due to a disaster is less common than the need for fast, instant restore of a specific table or even a specific record. For both purposes, Redshift automatically stores backups to S3 and enables you to revisit data at any point in time over the previous ninety days. In all cases, retrieval includes a series of actions that can make the instant recovery a cumbersome and lengthy operation.
Since Panoply is powered by Redshift, backup to S3 is obvious, but we take it one step further. Leveraging Panoply’s Revision History tables, users keep track of every change to any of their database rows right within their data warehouse, making it immediately available to analysts with simple SQL queries. This makes file uploading to S3 and database extraction redundant when needing to travel back to any point in time and quickly see how the data has changed.
It is frequently beneficial to stay within a common ecosystem. For midsize organizations where a mixture of flexibility and simplicity is important, it is often worth staying with a single supplier that offers compatible technology across different platforms.
Google Amazon and Microsoft all have amazing ecosystems. This is the main reason why you rarely see a company using Redshift combined with Google infrastructure and why major providers are spending so much money and effort trying to migrate companies from their current providers into their ecosystems
Regarding the foundational decision of data warehouse platform, it should be clear that there are many possible choices, and introducing the correct platform really does set the parameters for your company’s information culture. Good luck and choose wisely!
Published at DZone with permission of Yaniv Leven . See the original article here.
Opinions expressed by DZone contributors are their own.