Database vs. Data Warehouse
Looking to make data-driven decisions but aren't sure whether you need a database or a data warehouse? Here are some pros and cons for each, as well as some advice for your use case.
Join the DZone community and get the full member experience.Join For Free
You've got data. Your competitors have data. We all have data, and lots of it. It's a data-driven world and that leaves us with a question: what do we do with it all?
The short answer to our question of what to do with all that data is to put it in a database. A database is the basic building block of your data solution. Data has to live somewhere, and for most applications, that's a database. It's basically an organized collection of data. Typically, the type of database used for this is an OLTP (online transaction processing) database.
But there's more to the picture than storing information from one source or application. Today's business is built on data and OLTP databases aren't typically designed to excel at running analysis across very large data sets consisting of multiple data sources. As you begin to accumulate more and more data from multiple sources, and need to do things like transform and perform analysis on it, having the data from your multiple, disparate sources stored in and across multiple OLTP databases can become a liability. Performing separate analysis on each data source is inefficient and costly at best.
You'll need a better place to keep data from all of those data sources — a place that allows you to maintain a single repository of, and run analytics on, all your data sources and streams simultaneously.
Data Warehouse Overview
A better answer to our question is to centralize the data in a data warehouse. A data warehouse is basically a database (or group of databases) specially designed to store, filter, retrieve, and analyze very large collections of data. Data warehouses are OLAP (Online Analytical Processing) based and designed for analysis. The modern approach is to put data from all of your databases (and data streams) into a monolithic data warehouse. This allows you to perform visualization and analysis one time — on the bulk of your data simultaneously rather than multiple times on smaller chunks — without having to merge or reconcile the results.
For data warehouses, the choice is between on-premise and cloud-based solutions. On-premise data warehouses (think Oracle, IBM, Teradata, etc.) typically excel at flexibility and security. You have more control over management and configuration when you host the servers or have direct access to them.
Cloud-based data warehouses (such as Amazon Redshift, Google BigQuery, Snowflake, etc.) provide more scalability and lower entry and maintenance costs. You can spin up (and pay for) additional computing power and storage only when you need it, for example. Further, the resources are always available so you can get up and running quickly, without having to wait for new hardware or capacity to be purchased, installed, and brought online. We talk about how to pick a data warehouse in A Guide to Selecting the Right Cloud Data Warehouse.
How They Stack Up
Used for storing data from one or a limited number of applications or sources.
Pros: Processing digital transactions, established technology
Cons: Reporting, visualization, and analysis cannot be performed across a very large integrated set of data sources and streams
Used for aggregating data from many different data sources, and make that data available for visualization, reporting, and analysis. Purpose-built for analysis.
Pros: Better support for reporting, analysis, big data, data retrieval, and visualization, designed to store data from any number of data sources
Cons: Costly compared to a single database, preparation/configuration of data prior to ingestion, (for cloud data warehouses) less control over access and security configuration
What Works Best for You?
If you're dealing with more than one (or just a few) applications and data sources, you'll likely find that OLTP databases and RDBMSs are not a good solution. Here's the thing: the number of data sources and data streams is growing every day. The proliferation of new cloud and SaaS offerings is resulting in a flood of data crucial to your business. Keeping all of that data in their siloed sources causes problems with analysis. How can you know what you have? How can you find what you need? How can you analyze it all?
Once you start having to sync data from multiple databases, you've reached the point where you should consider implementing some kind of extract, transform, load (ETL) process to move your data from your databases and data sources/streams to a single data warehouse.
Ultimately, today's data-driven business environment relies on speedy, thorough analysis. For many companies, that means getting your data quickly and accurately from potentially many different databases (and other data sources/streams) into a powerful, cloud-based, data warehouse — possibly with some transformation along the way.
Published at DZone with permission of Garrett Alley, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.