A Cloud Data Warehouse Comparison
A Cloud Data Warehouse Comparison
Find the right cloud data warehouse platform for your big data management needs.
Join the DZone community and get the full member experience.Join For Free
Snowflake, Panoply, and Repods are cloud services that allow you to ingest, process, store, and access data in a managed cloud infrastructure. In particular, each service provides compute and storage resources for data. These are considered the defining features of what we call a Cloud Data Warehouse Platform and distinguishes them from other cloud data services that allow you to present or process data in the cloud, but do not allow you to store data in large amounts.
Although storing and processing data is at the heart of a data warehouse, it doesn’t stop there. For a data warehouse to serve as a long term manageable basis for analytics, more functionality is required. To get a good overview of the overall data management tasks, we start with a rather complete list of criteria concerning the data engineering and analytics life cycle. The three platforms are not targeting the exact same audiences and feature sets, and as such, are not directly comparable in all aspects. Regarding Panoply and Snowflake, the comparison is based only on the information they are sharing publicly on the Internet.
Panoply uses the Amazon Redshift Data Service together with the Elasticsearch Database and Amazon S3 Storage and Spark Compute Architecture under the hood. Amazon redshift is a scalable database with roots in the Postgres database architecture and added cluster abilities; it runs solely as an Amazon Web Service.
This architecture allows for online scaling by adding more nodes to a cluster. Different Panoply Clients share the same infrastructure so highly demanding query loads of one client could affect the query performance of another client. (In fact, Panoply manages the position of your database under the hood and could potentially locate your database on a separate Redshift cluster.) With Panoply, you should be able to create multiple databases, which then are just Amazon Redshift databases. Databases in this sense have separate storage areas but share the same query engine (i.e. DBMS system).
Snowflake’s underlying architecture layers are not disclosed in detail by Snowflake, but overall they also use an online scaling platform with a clear separation of storage and compute resources. Snowflake allows you to create and manage multiple data warehouses in one account. You can configure your compute cluster sizes per warehouse in detail and even configure online auto-scaling for each warehouse. In Snowflake, you can scale-up (more resources on one machine) as well as scale-out (more machines), without any interruption of service. Data Warehouses in Snowflake do not share compute resources to ensure stable performance of each warehouse. Snowflake provides direct database access with external tools, as if they were local to the database.
Repods underlying architecture consists of native PostgreSQL (version>10) and TimescaleDB for huge time partitioned data and many additional data warehouse-related services. Storage is managed and scaled through dedicated storage clusters delivering solid IO speed and online petabyte scaling. Scaling the compute resources currently requires a few seconds of downtime of the data warehouse and is not elastic. You can create and manage and share multiple data warehouses per account. Different data warehouse instances in the platform rely on dedicated resources not shared with any other instance in the cluster for stable query performance.
We categorize import interfaces into four different sections.
Files — still the most common form of data.
Web Services — Plenty available on the net with relevant data.
Databases — Although data is stored in traditional d databases in many organizations, in most cases, direct database access is not exposed to the internet and therefore not available for Cloud Data Platforms. Web Services can be placed in between on-premise databases and cloud services to handle security aspects and access control. Another alternative is the use of ssh-tunneling over secure jump hosts.
Realtime streams — Realtime data streams are delivered by messaging routers and are not used very much today but are going to become more important with the rise of IoT.
Panoply features plenty of import options in all four categories. Based on our information, however, Panoply can neither pull files from a cloud bucket or SFTP according to an automated schedule nor request a RESTful URL according to a schedule.
Snowflake focuses on loading only files (cat. II), but it allows you to load files from cloud storage, including Amazon S3 or Microsoft Azure. Snowflake lets you observe the sources for the arrival of new files and automatically load them.
In Repods, you can upload files, load them from S3 Buckets, or load data from external SFTP Servers including automatic load of new files. Repods does not provide you with a separate Interface for all possible services on the Web, but it gives you a generic API to schedule Web Requests from any kind of service (e.g. Socrata). Repods does not allow you to import data from databases. With Repods you can subscribe and listen to topics on message routers (currently WAMPonly) to ingest data in configurable micro-batches.
Data Transformation ETL
Data imported into the data platform usually has to undergo some data transformations to be usable for analysis down the stream. This process is traditionally called ETL (Extract, Transform, Load). These processes usually create a table from raw data, assign data types, filter values, join existing data, create derived columns/rows, and apply all kinds of custom logic to the raw data.
Creating and managing ETL processes is sometimes called data engineering and is the most time-consuming task in any data environment. In most cases, this task takes up 80% of the overall human efforts. Larger data warehouses can contain thousands of ETL processes with different stages, dependencies, and processing sequences.
In Panoply, you use code to create data transformations. These transformations either provide virtual data results that are recomputed each time you access the data (“Views”) or materialized to save the recompute effort for each access. If the data is materialized, it has to manually be refreshed to update the information (as to what we know). E.g. if new data is available in the sources, you have to hit refresh to execute the transformation.
Depending on the size of the sources and complexity of the transformation, it can be prohibitive to not store intermediate results in a dedicated and managed result table. A very common pattern is the correct incremental load of new data into the already existing data in a table. Depending on the requirements, this can involve very complex transformations. Panoply does not offer specific support for historization. (See the section on historization below.)
Snowflake has a similar approach to Panoply in how it handles data transformations. You can use the Snowflake SQL dialect to implement data transforms in form SQL queries and then materialize them in new tables as you see fit. In Snowflake, you get low-level control over your data object, which makes it comparable to working with pure databases. You can create tables, indexes, views, queries, partitions, etc. as you can do in many other traditional database systems like Postgres, MySQL, Oracle, or DB2.
Snowflake includes an advanced time travel feature that allows you to query a table as it was seen at a specific point in time (up to 90 days back). In the Postgres database, this feature was discontinued long ago due to high-performance overhead and complexity. Snowflake does not come with support for automated historization of data out of the box.
In Repods you can create so-called “Pipes” to transform raw data into a specific data warehouse table (Evo Table). Transformations are also created using Postgres SQL queries. In these queries, you do not have to reimplement data insert strategies for each transformation, since the actual insert into the target table is always handled by an integrated post-process that applies technicalities like deduplication, key generation, historization, and versioning.
Larger data warehouse systems can easily contain hundreds of tables with hundreds of automated ETL processes managing data flow. Errors at runtime are almost unavoidable, and many of them have to be taken care of by manual intervention. With this amount of complexity, you definitely need a way to monitor what is going on in the platform.
In Panoply, you can view a history of executed queries and executed jobs. You get an alert-view listing problems in your sources, services, and other assets. The query log polls the server for updates every few seconds (no realtime refresh).
Snowflake comes with monitoring reports giving you aggregated information on the number of active queries in a certain time range and their resource consumption. In Snowflake you can access the internal metadata tables with SQL to extract any kind of information on query activity in the system. Snowflake also shows historic query execution in the web interface.
Repods offers a realtime updating graphical overview showing details on every pipe that is currently and was historically executed. This overview allows you to drill through thousands of pipe executions in a convenient historical context. You can also search for specific pipe executions in a search drawer. Repods also allows you to analyze the system log with SQL to extract any kind of information about pipe execution.
The usability of the tools depends very much on the targeted audience of each tool. Usability is a very broad and subjective category and considers things such as how easy it is to create and manage objects (users, data warehouses, tables, transformations, reports, etc.) in the platform. Often, there exists a trade-off between the level of control a user gets and the level of simplicity.
The simpler the less control. All three platforms by their nature live in a cloud environment and provide user access using a browser-based web application. All three platforms also follow a hybrid approach regarding a point and click interface and using code.
Snowflake provides the user with a significant level of control similar to what you would expect from a barebone database system. Most of this control is handled by writing code in a code panel (imports, tables, views, indexes, etc.); major tasks like the creation and deletion of whole data warehouses and users can be handled through web forms. The web interface does not respond to activities on the platform and handles updates by refreshing its views in regular time intervals. (The query history refreshes every 10 seconds.)
In Panoply, the creation of objects is handled via web forms. The setup of imports can be handled through web forms as well, while transforms and analytics are entered through code panels. Panoply is only available in English.
In Repods custom transformation logic is handled in code panels and the creation of objects is handled through web forms. For analytics in Repods, you can use either a workbook-style approach to create custom analytical queries, or you can use a built-in OLAP tool to drill through the data model with point and click for quick insight.
This category evaluates the support for user interaction and sharing of work and data. All three platforms allow multiple users to work together in a data environment. There is no discussion or chat functionality in all three platforms.
Panoply has one infrastructure environment for all users, but users can create multiple Amazon Redshift databases on this infrastructure. Panoply has no communication features between users on the platform, so there are few opportunities to provide documentation for data objects and transformations in the platform. You can manage teams with the roles “Admin” and “Editor” on Panoply.
In Snowflake, multiple data warehouses with fine-grained access control can be managed per account. Similar to access control you commonly have on database architectures, you can create custom roles with custom privileges on all objects in data warehouses.
In Repods, you can manage multiple data warehouses (“Data Pods”) per user and provide access for other users similar to how the GitHub platform is organized. The platform is realtime in the sense that user interactions are instantly visible to all other users. There are the access roles “Viewer,” “Reporter,” “Developer,” “Admin,” and “Owner.” Each user can be assigned one of these roles in each pod. Inside the pod, tables can be grouped using “Labels,” and each user can individually be authorized on labels.
The need to manage longer histories of data is at the core of each data warehouse effort. In fact, the data warehousing task itself could be summarized as the task to merge separate chunks of data into homogenous data history. Data is naturally generated over time, and as such, there arises the need to increment an existing data stock with new data.
To robustly manage data histories data historization is used. Technically, time ranges in tables are tracked using dedicated time range columns. Managing these time ranges efficiently when new data arrives is called data historization. This is different from data versioning in the sense that historization is concerned with real-life timestamps, whereas versioning is usually concerned with technical insert timestamps (See the section below.)
Repods provides support for the historization of data out of the box. Historization will be applied after the data transformation before insert into the data warehouse table. The algorithm is maximally space efficient in the sense that it minimizes the number of records to represent the histories. For slowly changing data this approach can greatly reduce the size of tables by orders of magnitude and therefore can have a large beneficial impact on overall query performance.
In the other two platforms, historization time ranges are managed by user-provided transformation logic. Panoply provides a feature they call “history tables,” which according to our definition, will be discussed in the “Data Versioning” section below.
By versioning data, you can track data corrections over time for later recovery of old analysis. Versioning allows you to apply nondestructive corrections to existing data. When comparing versioning capabilities, you have to consider the ease of creating versions and the ease of recovering or querying versions. Versioning can be handled on different levels of the system:
Create version snapshots on the storage subsystems — similar to backups.
The underlying database system might come with support for version tracking.
Versioning might be handled by the data warehouse system.
Versioning can be implemented as a custom transformation logic in user space.
The final variant is a technical and complex process but could be implemented in all three systems.
Panoply provides continuous backups as a built-in versioning option. You can recover a point-in-time snapshot at any time in the backup time frame, but you cannot query versions in the active system using this method. In Panoply, you can also create “history-tables” that insert companion tables to the original tables.
When updates occur on the original table, the system automatically inserts the same records into the companion table enriched with the time ranges representing the changes. We consider this process as a versioning process because the managed time ranges are based on technical insert timestamps and not on business validity time stamps. These History Tables allow you to query different versions of the same data with SQL.
With Snowflake, you can easily create snapshots of all data and get the time travel feature provided by the database system. The later allows you to flexibly query data with SQL, as seen at a certain point in time. This feature is available only in the enterprise edition and covers 90 days of history. Longer-term versioning logic has to be implemented by users.
Repods does not yet provide continuous backups and comes with version tracking designed for each data warehouse use case. You can specify “freeze” timestamps to ensure recoverability of the data, as seen at the time of the freeze. You can flexibly query the data, as seen at old freeze times using simple SQL (infinite days back).
Versioning often comes with excess creation of version records during load times, blowing up your table sizes with a negative impact on query performance. To avoid this, you can specify a second date to prevent versioning of data newer than another date. With this feature, you can, for example, freeze all data before October while loading October data.
Analysis / Reporting
A data platform's purpose is to prepare raw data for analysis and store this data for longer histories. Analysis can be conducted in many different ways.
There are many tools, referred to as Business Intelligence Tools, concerned only with creating analytical and human-readable extracts of data. To prepare consumable data chunks for presentation, a data platform provides features to create data extracts and aggregates from a larger data stock.
Panoply and Snowflake provide you with SQL code editors to create analytical aggregates that you use for creating data transformations. Snowflake saves query results for 24 hours. Furthermore, these two platforms allow you to access the platform with other tools via ODBC (and similar) secured by username and password.
So, you can use all kinds of dedicated analysis tools like Jupyter workbooks or PowerBI to analyze your data. You can however not use the platform resources to run Python or train machine learning models. Also, you can not integrate the results of these workbooks into your data workflow inside the platform.
Repods currently does not allow access to the platform with external tools. However, Repods provides you with their own workbooks style to create data stories and analytical extracts. Workbooks are sheets consisting of many SQL Code panels together with documentation panels (using “Markdown”). Besides workbook-style analysis, Repods also contains an OLAP (online analytical processing) interface that lets you drill through data models using a point and click approach to create report results.
Nowadays, creating machine learning models is a new requirement data platforms have to serve. Most methods are implemented using Python or R with a wide variety of specialized libraries like numpy, pandas, scikit learn, tensor flow, and pytorch.
None of the platforms currently provides a way to conduct data science tasks directly on the platform. The current strategy is to access the platform with specified tools and conduct all data science-related tasks outside of the platform. While this opens up a great choice of tools for you to pick from, you again are facing the challenge to host and manage compute resources to back the potentially very demanding machine learning jobs.
External Access and APIs
Here we want to compare how the three platforms present their content to external consumers. Possible channels that are considered here are
- SQL Access.
- API Access (REST Request).
- Notifications via Text Push or Email.
- File Exports.
Panoply provides direct ODBC (or similar) access to the platform secured by username and password. This enables almost all standard Business Intelligence tools like Looker or Tableau to connect to Panoply and use the data. Panoply tracks system alerts but does not send notifications to your phone or email for that.
Snowflake also provides SQL Access for other tools to consume the data inside of Snowflake. Additionally, Snowflake provides the possibility to export files into Cloud buckets (AWS S3 or MS Azure). Snowflake allows users to set up email notifications for general Snowflake service availability only. You can neither set up system push notifications nor content based notifications that alert you if, for example, the number of customers in France exceeds 1,000.
In Repods, you can create API Access Keys and control access to prepared data extracts of the platform. You can then hand out these access keys to external consumers to access those resources via a standard REST request in any programming language. In this way, you can, for example, connect the Business Intelligence tool PowerBI to the Repods platform to create dashboards. File exports of unlimited size are supported via direct downloads from the browser. No kind of push notification is currently supported.
All three platforms require a certain level of proficiency by the user. Data engineering is no casual task. Proper documentation on the details of the platform features is therefore required for professional use of the platform.
Snowflake provides the most extensive online documentation. It is comparable to the way databases or programming languages are documented. For Snowflake, this is necessary because most of the functionalities are provided through the use of Snowflake’s own dialect of SQL. Therefore, large parts of the documentation are concerned with the Snowflake SQL capabilities. Additionally, Snowflake provides many guides and YouTube videos to get you started on the overall usage of the platform.
Panoply also offers online documentation, but it is not as detailed as Snowflake’s. In part, this is also not necessary because Panoply exposes the underlying Amazon Redshift tables directly to users for SQL-Access, so the Amazon Redshift SQL Documentation is a valid reference for the Panoply's SQL dialect. Panoply also offers general guidance concepts for data warehousing and a few tutorials on YouTube.
Detailed explanations of objects are positioned at places in the web app where the objects are managed. Additionally, Repods has a series of articles on Medium.com and a few YouTube explainer videos.
Security, in general, can be separated in the security of storage (data in rest), interaction (data in transport), and access control. All three platforms encrypt data in rest and in transport. Regarding access control, all three platforms provide password-based authentication of users. Snowflake additionally provides the most secure Two Factor Authentication mechanism of the three platforms.
Panoply and Snowflake allow you to access underlying databases through ODBC-like interfaces with direct server connections. Exposing database ports on the internet can be considered a security risk, and to mitigate this, you can (and should) use secure ssh-tunneling with a dedicated jump-host to access these platforms in my opinion.
To build a complete data warehouse platform, Snowflake and Panoply need to be combined with additional tools to cover the missing aspects of data warehousing. Most notably, the automation of data transformations is an important requirement of every data warehouse architecture. Depending on the organizational context, often one limiting resource in data warehousing is the technical proficiency of the team that needs to deliver information.
Requiring more tools that need to be administered and used in coordination is often not an ideal solution in these scenarios. All three platforms require basic data engineering know-how from users, since all three platforms allow users to use SQL code at important points in the platform.
Snowflake may be a good choice if you are looking for an online database architecture that is used primarily in an analytics context and you do not want to invest in the system and hardware administration. Snowflake clearly requires database administrators to manage the platform and is addressing larger data environments.
As stated above, Snowflake misses many important features that distinguish a data warehouse from a database. Since Snowflake's usage patterns seem to be quite similar to a general-purpose database, it may also be an option to use a managed database on AWS instead if you do not have special requirements that are only met by the Snowflake offering.
Panoply is a simpler platform than Snowflake. Neither system, hardware, nor database administration know-how is required to use Panoply. Like Snowflake, Panoply misses many aspects that distinguish a data warehouse from a database. It seems that currently, Panoply is better suited for less complex data environments than the other two platforms.
Repods is simpler than Snowflake because it does not require any administrative hardware, system, or database know-how from the user. Repods also covers the major data warehousing aspects of automation, historization, surrogate key maintenance, and analytics. The Repods management interface is ideal for a project-oriented approach to data warehouses as opposed to a big central data warehouse architecture; Repods could be described as a data catalog consisting of data warehouses.
Published at DZone with permission of Marko Petzold . See the original article here.
Opinions expressed by DZone contributors are their own.