Data Warehouses: Past, Present, and Future
Data warehouses are increasing in importance as the amount of data at our disposal grows exponentially. We look at their history, where they are, and where they're going.
Join the DZone community and get the full member experience.Join For Free
In today’s world, data is being generated at a rapid pace, especially as enterprises across virtually every industry undergo digital transformation. We’re also seeing unprecedented demand to equip every business decision maker with access to real-time data so that they can make the best-informed decisions for the business. More than ever, global companies are dispersing virtual teams across the world, empowering them with the ability and tooling to make informed business decisions using all available data.
For instance, retailers seek to make purchasing recommendations not just on past purchases and browsing history, but by using all publicly available information about the customer, such as their profession and employer, their viewing and listening interests, sports and hobbies, travel patterns and restaurants frequented. But providing this holistic view of the customer requires bringing a variety of data from a multitude of sources together, and managing and analyzing that data at scale can be challenging.
In order to make data actionable and useful for business, companies need a way to store, label, and analyze it in an efficient and cost-effective way. Enter the data warehouse.
The term “data warehouse” was coined in the 1970s by American computer scientist Bill Inmon. The first data warehouses were born as on-premise servers, designed to perform at gigabyte scale. Modern mobile phones have more storage and processing power than the earliest data warehouses. Today’s data warehouses have to be architected from the ground up to accommodate petabytes of data with fairly interactive response times. As the earliest generations of cloud data warehouses are showing their age and creaking under the strain of today’s analytic workloads, we see hardware being traded in for more agile software and data warehouses have made the move to the cloud. This evolution has resulted in three distinct generations of cloud data warehouses.
The Past – Gen I
The first generation of cloud data warehouses removed the complexity of setting up the rather intricate infrastructure required to support a clustered MPP data warehouse. Not only was the hardware and operating system environment preconfigured, so too was the data warehouse. Technologies such as Amazon Redshift were heralded as changing the way data warehouses would be deployed in the future and adoption grew rapidly.
First-generation data warehouses provide many benefits beyond the simplicity of deployment: they can scale up and down as business needs change, they are part of an ecosystem of data integration and application development which enables building new classes of applications, and they are built on a platform designed for resiliency and security.
However, they can be limiting. Since the first-generation cloud data warehouses provide only a cloud version, adopters of this technology need to find an alternative solution for data that must reside on-premise for compliance or sensitivity reasons. The promise of integration to other related services comes with a price – plumbing them together can be difficult and time consuming.
The tight coupling of storage and compute in first-generation cloud data warehouses means that it’s not possible to shut down compute without impacting storage, so the meter is always running. This can become cost-prohibitive for those use cases where business hours access to analytics would suffice. While pricing may start low, production workloads at scale can get expensive.
First-gen cloud data warehouses were delivered by cloud platform providers and their deployment is typically limited to their own cloud platform (RedShift is limited to AWS, Big Query is limited to Google Cloud, etc). This can be a challenge when business needs require an alternative.
The Present – Gen II
The massive demand for first-generation cloud data warehouses clearly illustrated that there was a significant market opportunity and room for more than just the cloud platform providers.
Second-generation cloud data warehouses have all of the benefits of cloud infrastructure such as scalability, security, and robustness, but they’re not tied to a single cloud provider in the same way that the native first-gen warehouses are. They also address some of the shortcomings of the first generation. They provide a fully managed cloud data warehouse service where the customer can focus on growing their business and not on roll-your-own data infrastructure. They also provide for truer cloud economics, where the expectation is that you only pay for what you use. The second-generation cloud data-warehouses, a generation defined by the likes of Snowflake, have changed the economics of enterprise data warehouses forever.
But second-gen data warehouses have their own limitations. They are cloud native solutions which, like the first generation, means that a second technology needs to be selected to meet on-premise data needs. Additionally, their costs start low but rise quickly as additional compute clusters are spun up to meet growing user needs.
The Future – Gen III
A critical requirement for many organizations is that software services offer an on-premise equivalent that enables the same technologies, skills, and applications to run in the cloud as well as on-premise for sensitive data that may be subject to regulatory requirements. There should also be the ability to join data from cloud data warehouses with their on-premise alternatives seamlessly from within a single query. This is where the third-generation data warehouse comes in.
Third-generation data warehouses solve this challenge by allowing data to be simultaneously stored on-premise and in the cloud, connecting all data to the broader data ecosystem regardless of location and allowing organizations to leverage the real-time insights provided by their data – all of their data. This hybrid capability is one of its main differentiators. This is especially important for industries with regulatory compliance requirements (such as financial services, healthcare, and pharma) that want to leverage the same technologies for their on-premise and cloud analytics needs, as well as seamlessly join on-premise and cloud-resident data. This also allows for the use of the same skills, technology, and applications for both cloud and on-premise deployment, greatly reducing the staff required to administer hybrid deployments and ultimately reducing costs.
Earlier generations also struggled with large user volumes running mixed workloads at enterprise scale. To address this, third-generation cloud data warehouses aim to provide a service that is ideally suited for use cases where data volumes are high and query complexity is varied, and where the organization wants to provide every business decision-maker with real-time access to all of the relevant data. The third-generation is designed to be a component in a broader cloud strategy and is delivered with integration with hundreds of data sources including popular SaaS solutions like Salesforce, NetSuite, Workday, and ServiceNow, so data from those services can be seamlessly blended to provide insights.
Additionally, previous cloud data warehouses experienced issues with concurrency. The third-generation solutions were designed for the top tier enterprises that need to accommodate hundreds of users querying data in parallel. Executing complex queries for large numbers of users and massive data volumes is child’s play for third-generation data warehouses. Having robust concurrency capabilities, without costs scaling as users are provided with access to the data, means that organizations can truly harness the data across every business function at scale to provide actionable insights.
Data warehouses have transformed over the years, beginning as on-premise workhorse solutions and then transitioning into the cloud – with each new iteration addressing problems raised by the previous.
In the third-generation, we see data warehouses working as hybrid solutions, combining the capabilities of on-premise and cloud data to harness real-time insights. Third-gen warehouses are designed for large enterprises who want to give their business decision makers all of the information they need to make an informed decision, regardless of where that information may live. Moving forward, we expect to see more organizations adopt the power of the third-generation.
Opinions expressed by DZone contributors are their own.