Data Warehouse Architecture: On-Premise vs. Cloud
Setting up a data warehouse is often necessary for decision-makers who have to deal with a lot of data, so should it be setup on-premise or in the cloud? We take a look.
Join the DZone community and get the full member experience.
Join For FreeFor organizations that want to ask big questions and get big answers in return, data warehousing is magic. Centralizing data into a single source of truth makes business intelligence, strategy, and decision-making easy. Data goes in, insights come out. And whether you do that with a traditional, on-premise warehouse or a cloud-based one, both can be beneficial for realizing business goals and objectives, albeit in slightly different ways.
But choosing to implement a traditional warehouse over a modern, cloud-based one brings more than just surface-level differences in usability. The underlying architecture of on-premise vs. cloud can be a significant factor in how your organization allocates resources and budget for data management and intelligence gathering over time.
In short, the prescribed structure of an on-premise solution has to be configured and managed on site by your own IT team. Those solutions are often more expensive, rigid, and complicated than cloud-based alternatives, which may give you reason for pause when evaluating what’s right for your company.
Conversely, cloud-based warehouses tend to be more adaptable. They can be configured in any number of ways, based on who’s providing the platform. Regardless, they all utilize a self-service model and support various integrations and add-ons to scale data analysis as your business changes. As a result, cloud data warehousing tends to be cheaper, flexible, and easier for IT teams to manage.
To better understand how architecture plays a role in determining the right data warehouse solution, let’s take a closer look at how on-premise and cloud-based warehouses are built and the level of upfront investment in people and resources that are required.
On-Premise Data Warehouse Architecture
Virtually all on-premise data warehouses sit on top of a three-tier, nine-layer architecture.
The tiers provide the general structure for how data is collected, stored, and used. At the bottom tier, a database server collects data from multiple sources such as financial, sales and marketing, customer, and inventory systems — while an OLAP (online analytical processing) server in the middle tier makes the data usable for analysis. In the top tier, users can then query, access, and manipulate the data through a variety of tools.
Within each tier are layers where the action happens.
- Data enters the database (data source) where it’s then extracted and cleaned (data extraction).
- A staging area further cleans the data and holds it before it’s funneled into the data warehouse.
- An ETL process (extraction, transformation, load) changes the data from its original state into a form in which it can be analyzed, often through a third-party ETL tool.
- Then the data is loaded, or stored, in the actual data warehouse (data storage).
- Data logic is applied, which sets rules for how the data will be reported on, and data presentation makes the data presentable for users in tables, graphs, emails, alerts, and other forms.
- Together, metadata — think of it as "information about the information" — and system operations give sys admins knowledge about how the data is stored and how the warehouse operates.
For each tier and layer, your IT team, and possibly data scientists, must make sure the system functions properly and that the data is secure and being processed correctly. Any change within the business will also change the data it collects and uses. It’ll require manual reconfiguring, testing, and a number of other time-consuming steps to adapt data warehousing and associated processes.
Maintaining an on-premise data warehouse is not for the faint of heart, and is successful when organizations can invest heavily in both hardware and staff who have the right skills, training, and expertise.
Cloud Data Warehouse Architecture
Data warehouses in the cloud are built differently. Each warehouse provider offers its own unique structure, distributing workloads and processing data across several physical servers, networks, or software tools while making data easily accessible — and more powerful — for users.
Here’s a look at how some of the more popular cloud data warehousing options are built:
- Amazon Redshift is structured like a traditional data warehouse, but lives in the cloud. It uses compute clusters that feed data through a leader node, which communicates between all clusters and the users. The leader node assigns queries, datasets, and applications to different clusters, then filters the information and results back to the users.
- Similar to Redshift, Snowflake is also built on a cluster architecture and delivered as a service. It integrates data ingestion, storage, and analysis into one system, but separates storage and compute to enable rapid scale and more efficient resource utilization.
- Meanwhile, Google BigQuery uses a serverless, data-warehouse-as-a-service model. It integrates with several third-party tools and services so users can run interactive queries on datasets at high speed. The basic architecture is hidden from users but the essential service manages machine resources to scan columns and rows of data and return results from queries.
- Microsoft Azure is an SQL data warehouse that combines SQL relational databases with massively parallel processing that allows users to run complex queries. Along with other Azure tools, users can easily integrate data storage, transmission, and processing services into automated data pipelines.
Because cloud data warehouses are self-service, they are built to be user-friendly. They’re intended to allow your existing rank-and-file IT staff to customize and manage workflows and to get expert help when needed as part of the service. That’s why the initial investments and ongoing expenses in a cloud warehouse are significantly less than a traditional one — no need to hire a raft of data scientists or purchase and maintain expensive hardware.
Published at DZone with permission of Garrett Alley, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments