What Is Data Warehousing?
What Is Data Warehousing?
A look at the benefits of data warehousing for big data professionals, and the various ways data warehouses can aid with the ingestion and analysis of data.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Today's enterprise relies on the effective collection, storage, integration, and analysis of data. These activities have moved to the heart of revenue generation, cost containment, and profit optimization. As such, it's no surprise that the amounts of data generated - as well as the number and types of data sources - have exploded.
Data-driven companies require heavy-duty solutions for managing and analyzing large quantities of data across their organizations. These systems must be scalable, reliable, and secure enough for regulated industries, as well as flexible enough to support a wide variety of data types. These requirements go way beyond the capabilities of any traditional database. That's where the data warehouse comes in.
Breaking it Down: What Is a Data Warehouse, Anyway?
A data warehouse is a large-capacity repository that sits on top of multiple databases. Whereas the conventional database is optimized for a single data source, such as payroll information, the data warehouse is designed to handle a variety of data sources, such as sales data, data from marketing automation, real-time transactions, SaaS applications, SDKs, APIs, and more.
There are other differences, as well. For example, single-source databases are built for speed, employing online transactional processing (OLTP) to insert and edit small transactions. However, due to their structure, they do not lend themselves to advanced analytics. In contrast, a data warehouse uses online analytical processing (OLAP), which is designed for fast, sophisticated analysis.
Databases and data warehouses do have some similarities, however. Besides the fact that they are both repositories for large amounts of data, both can be queried. And they both have the ability to store data in tables (although databases only store data in two-dimensional tables; data warehouses contain multidimensional tables with layers of columns and rows).
On-Premise and in the Cloud
Companies are increasingly moving away from on-premise data warehouses to the cloud, leveraging the cost savings and scalability managed services can provide. The architecture of these cloud-enabled data warehouses differs from that of their traditional, on-premise counterparts.
Traditional data-warehouse architecture is separated into three tiers: one for the database server that extracts data from multiple data sources, one for the OLAP server (which transforms the data), and one for the client level.
Cloud-based data warehouses are an entirely different animal. Their architecture varies tremendously among vendors. For example, Amazon’s Redshift is essentially a cloud-based representation of on-premise data warehouses. BigQuery is serverless so it manages computing resources dynamically and hides resource-management decisions from the user.
The cloud offers some distinct advantages:
- It's managed. Instead of hiring your own data-warehousing team, a cloud data warehouse lets you outsource the management hassle to professionals who must meet service level agreements (SLAs).
- It outperforms on-premise data warehouses. Cloud-based solutions offer superior reliability and speed. They are generally more secure than on-premise data warehouses, making them a good choice for the enterprise.
- It's built for scale. Cloud-based data warehouses are elastic, so you can instantly add capacity.
- It's more cost-effective. With cloud, you pay for what you use. Some providers charge by throughput. Others charge per hour per node. In every case, you avoid the mammoth costs incurred by an on-premise data warehouse that runs 24 hours a day, seven days a week.
Check out this guide to selecting the right cloud-based data warehouse for your environment.
Do You Need a Data Warehouse?
Some businesses and industries require more data analysis than others. For example, Amazon uses real-time data to adjust prices three or four times a day. Insurance companies track policies, sales, claims, payroll, and more. They also use machine learning to predict fraud. Gaming companies must track and react to user behavior in real-time to enhance the player's experience. Data warehouses make all of these activities possible.
If your organization has or does any of the following, you're probably a good candidate for a data warehouse:
- Multiple sources of disparate data.
- Big-data analysis and visualization - both asynchronously and in real-time.
- Custom report generation/ad-hoc analysis.
- Data mining.
- Machine learning/AI.
- Data science.
These activities and assets require more than the traditional single-source database can provide. They require an "industrial-strength" data warehouse.
Published at DZone with permission of Garrett Alley , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.