Four Stages that Revolutionized Database Architecture
This article describes the traditional database and machine architectures from Oracle, Microsoft, and IBM and provides a high-level description of the main solutions.
Join the DZone community and get the full member experience.Join For Free
This article describes the traditional database and machine architectures from Oracle, Microsoft, and IBM. It provides a high-level description of the main solutions, including shared memory, shared disk, and shared nothing. It describes how each option attempts to address the challenges of performance and scalability and describes the advantages and drawbacks of each.
It goes on to explain how a Data Warehouse startup has revolutionized database design with a solution designed for the cloud.
Don’t worry if you don’t have a Ph.D. in hardware architecture — this is a Plain English article.
Types of Workloads
Broadly speaking, any given database application needs to deal with:
- A Transactional Workload: Also called an Online Transaction Processing (OLTP) workload, this involves hundreds or even thousands of users executing short-lived online transactions. Typical applications are e-commerce (eg. Amazon) or flight reservation systems, whether online or on the web.
- An Analytic Workload: Also referred to as Online Analytic Processing (OLAP). Whereas OLTP applications exist to run the business, OLAP applications help analyze what’s happening. These workloads tend to be executed by back-office analysts and managers intent upon understanding trends in customer behavior, including what’s selling, where and (ideally) why.
- A Mixed Workload: Sometimes referred to as Hybrid Transaction and Analytic Processing (HTAP). This is a relatively new challenge that involves running a combined OLTP and OLAP workload on the same platform, often in real time. Typically, data is captured, processed and analyzed within milliseconds of a real-world event. It is used (for example), to detect in-transaction credit card fraud based upon past behavior, or to support a real-time dynamic auction for advertisers as web pages are displayed.
Whereas OLTP applications tend to insert or update a few rows at a time, OLAP applications load and query millions or even billions of rows at a time, and the workload profile is significantly different.
Despite the distinctly different workload, the major database vendors (Oracle, Microsoft, and IBM) have stubbornly insisted upon extending their existing product line to meet all requirements, which left a window of opportunity for specialized database solutions.
As user numbers and data volumes grow and analytic queries demand likewise, it is increasingly difficult to deliver raw data for analysis, let alone produce business insights in a timely fashion. To make matters worse, many businesses are working on a digital transformation program, which often involves installing analytic solutions within the operational data pipeline. This means near real-time business intelligence is no longer a should-have requirement but a must-have solution, with analytic results delivered within seconds of a real-world event.
As a result, it becomes increasingly important to be agile and respond quickly to changes in demand, and a fixed on-premises infrastructure is no longer an acceptable approach.
As any database architect will testify, the key non-functional requirements of almost any database platform include:
- Maximizing Query Performance: Which is achieved by improving the speed to deliver results to users by reducing the latency — the time between a request and a response.
- Maximizing Batch Throughput: Which involves batch processing data and reports as fast as possible to deliver fresh insights as quickly as possible to real-world events.
- Scalability and Elasticity: Which implies the ability to quickly react to a sudden increase in processing demands without a significant impact on performance, conversely, during quiet times, scaling down the compute resources to match the demands and control costs, and perhaps even suspending the compute resources when they are no longer needed — assuming you pay for compute resources on a pay-as-you-use basis.
- Resilience and High Availability: Which implies a resilient solution that has no single point of failure, and the loss of a single component or even an entire node does not threaten the entire system. In addition, High Availability means even if the primary site fails, operations can continue, perhaps using a hot standby.
Some of the above requirements are mutually exclusive, as, for example, it’s possible to tune a system to support millisecond performance for thousands of users, but difficult to maximize batch query throughput on the same platform.
Traditional Database Architecture
Until recently, the most common options available were:
- Shared everything SMP systems: Symmetric Multi-processing systems use a single platform in which multiple processes execute in a shared-everything environment. The system load can be increased until throughput is constrained by either CPU, memory, or disk access, and the only real option available is to scale up by adding additional or faster components or migrating to a larger machine. Database solutions have been available from Oracle, Microsoft, and IBM, among others, since the 1980s.
- Shared Disk – Real Application Clusters (RAC): Is a solution deployed by Oracle to scale up a single SMP machine into a cluster of two or more servers, which act as a single database.
- Share Nothing systems: Massively Parallel Processing (MPP) solutions deploy a number of independent nodes to form a single cluster, which gives the benefit of near linear scalability. When more processing or storage is needed, simply add additional nodes. MPP database vendors include HP Vertica, Greenplum, and Teradata.
The Shared Everything (SMP) Solution
The diagram below illustrates a typical shared-everything solution. It shows a single database server that executes queries, with a hot-standby system accepting synchronously replicated changes to provide high availability. In the event of primary system failure, all operations are switched to the standby system.
The above architecture has the advantage that it is battle-hardened and has delivered a reasonable general-purpose platform since the 1980s. It’s a simple architecture that, for many use cases, is a good option. It does, however, have the following drawbacks:
- Performance and Throughput: Can be constrained. Since memory, disk, and processes are all shared by every user, the load can be increased until one component becomes the bottleneck, after which query performance and batch throughput decline. During busy periods or as data volumes grow, users may wait a considerable amount of time for queries to complete, leading to frustration. The only short-term solution available is the increasing use of workload management (query queuing), which can be even more frustrating.
- Scalability: Options are limited in the short term. The only options available involve adding memory, faster processors and disks, and experience that demonstrates that a 100 percent increase in processor speed seldom results in the doubling of overall performance, as the bottleneck is simply moved to another component.
- Elasticity: It is not an option. This means you typically need to size the machine for the maximum anticipated workload and hope you never exceed it. This is, of course, inefficient use of an expensive resource, as it means you seldom run the server at 100 percent capacity. It also means you start with an over-sized capacity server and then gradually experience reduced performance as workload increases over time.
- Resilience and High Availability: It can be remarkably expensive as both a primary and standby server need dual redundant components and power supplies to avoid a single point of failure. Finally, you’ll need a duplicate standby server for high availability, which adds to the cost of this solution.
"You cannot put CPU in the bank and save it for later. So, if you are running with idle cycles you should be looking for more ways to use it." — Tom Kyte (Oracle)
In addition to the issues described above, the standard database architecture used by Oracle, Microsoft, and IBM involves caching data in a buffer pool to reduce disk I/O, logging changes to a write-ahead buffer, and locking and latching to ensure data consistency and ACID compliance. This architecture was driven by the need to make the best use of expensive and relatively slow CPU and memory resources. However, although disk bandwidth and latency are still a problem, memory capacity and CPU performance are over a thousand times bigger and faster.
This has huge implications for system throughput, and the chart below illustrates a surprising feature of this traditional (perhaps legacy) architecture:
Taken from Dr. Michael Stonebraker’s research paper, OLTP Through the Looking Glass, it shows the latency built into the traditional database architecture. It illustrates a 93 percent latency overhead resulting from locking, latching, and buffer management. To demonstrate the problem and deliver a potential solution, his team built the H-Store prototype, an in-memory OLTP database, which was eventually implemented as VoltDB.
Dr. Stonebraker’s achievement was remarkable, as the H-Store database achieved a TCP-C benchmark of over 70,000 transactions per second compared to an unnamed commercial rival that barely managed 850tps on the same dual-core hardware despite significant DBA tuning. That’s an 82x performance gain compared to a commercially available database on the same platform.
To put this further in context, the H-Store prototype achieved 35,000 transactions per CPU core, which is around 35 times better than the highest performing TCP-C record at the time, which achieved around 1,000 transactions per second per CPU core.
Dr. Stonebraker then went on to build and launch a commercial database management system, VoltDB, which a benchmark demonstrated was nearly twice as fast as the leading NoSQL database, Cassandra, while exceeding the level of ACID compliance provided by almost every database management system, including Oracle.
"If you want to do OLTP, you want a row-based memory store, and if you want to do data warehousing, you want a disk-based column store. Those are fundamentally different things." — Dr Michael Stonebraker (MIT)
Real Application Clusters (RAC)
First launched in 2001 by Oracle, the RAC architecture attempted to tackle some of the scalability limitations of a single database server, which were being addressed by the MPP architecture described further below.
The diagram below illustrates how this shared-disk architecture works:
In this solution, two or more database servers are closely clustered to operate as a single server connected by fast InfiniBand connection sharing access to a common data store on disk or SSD. All machines in the cluster synchronize memory using a technique called cache fusion, which can reduce disk I/O if data is available in another node. This does, however, limit scalability, as additional nodes increase the corresponding network coordination needed to synchronize the cache, and most clusters never exceed 3 or 4 large servers.
Although the RAC architecture provides a potential scalability option for Oracle databases, the limit on the number of cluster nodes means it’s been less successful than the MPP architecture. It also needs some careful application design, as any attempt to modify the same data block on more than one node will result in pinging whereby the block is transferred between the participating nodes to complete update operations.
It is sometimes possible to segment an application, for example, using one node for batch loading and the other for queries, but this may not be feasible in many cases. As a result, most RAC clusters tend to be made up a small number of very large SMP servers, and this architecture does not provide the incremental and top-line scalability of the MPP option.
Although still commercially available and supported by Oracle, the Exadata Appliance has tended to take center stage, and this is the preferred solution for cloud deployment.
The Shared Nothing: MPP Solution
The diagram below illustrates the shared-nothing massively parallel processing (MPP) architecture in which a number of independently running servers are connected to form a single compute cluster.
Unlike the SMP system, the MPP solution is potentially highly scalable, as additional nodes can be added, which extends both compute and storage capacity. Some solutions (eg. Vertica) support deployment on inexpensive commodity hardware to control costs, while maintaining system resilience using data replication. Some MPP-based solutions can also continue running with a reduced performance despite the loss of one or more nodes and can even be scaled up without machine downtime.
On these systems, as data is loaded, it is sharded (distributed) across the cluster, so each node is responsible for a subset of the data, which makes them a great solution for VLDBs (Very Large Databases) and analytic query platforms.
Typically, the table designer selects one or more columns as the partition distribution key, and this is used to distribute data across the cluster. This has the advantage that queries that filter rows by the distribution key are executed only on the relevant nodes. Any queries that aggregate results across the entire data set are executed in parallel on every node, with potentially huge performance gains when compared to an SMP solution.
The diagram below illustrates two of the most common sharding mechanisms. Replication (used for small volume reference data) duplicates the data on each node, whereas the Hash Distribution method uses a consistent hashing algorithm to evenly distribute rows — most often, large volume transaction tables across nodes in the cluster.
This architecture does, however, come with some drawbacks:
- Performance and Throughput:Can be impressive, especially against very large data volumes that execute queries in parallel across multiple nodes. Equally, if the query includes a filter by the partition key, it not only reduces the volume of data scanned, but the query is only executed on the relevant nodes, which leaves free capacity for other users. However, there are several potential issues:
- Data Placement: Is critical to maximize performance and throughput. If the selection of the partition key and physical data placement is not optimal, performance will suffer as data is copied at run time between nodes to complete join operations. For example, when attempting to join two large tables where each is distributed using a different key, the database will copy the smaller table to the relevant nodes to complete the operation.
- Potential Data Skew: Related to data placement, this highlights the risk of limited throughput as a result of skewed data values, which affect data placement. In this scenario, if a sub-optimal partition method is selected, a majority of data and, therefore, queries, will be sent to a single node, which will limit concurrency. Again, this highlights the need for careful physical database design on MPP systems, as a full database reorganization will be processor and time intensive.
- Capped Throughput: Queries that use a secondary index (that cut across the partition key) or include the entire data-set are potentially executed on every node in the cluster. This will limit overall system concurrency, as (for example), if each node in the cluster can handle 10 concurrent queries, then that sets a maximum for these query types. Of course, not every query will hit every node in the cluster, but it’s a potential limitation to be aware of.
- Update Scalability: While inserts are executed in parallel at speed, updates against distributed tables or multiple rows across the cluster require coordination between nodes which can limit scalability. Likewise, frequent inserts or updates to reference data replicated across the cluster can limit performance and scalability.
- Scalability and Elasticity: MPP databases provide a level of scalability (the ability to cope with ever larger data volumes), which is simply not possible on single node SMP systems. However, adding nodes to a pure MPP solution often requires downtime or a resource-intensive data redistribution. New hardware must be procured and installed, and the data re-distributed to the additional nodes. Deploying new hardware can take days or even weeks to order and deploy, and on some applications, downtime is simply not an option. Likewise, elasticity (the ability to quickly reduce the number of active nodes to match machine resources to current demand) is not an option on most MPP solutions — even in the cloud.
- Over Provisioning Storage: As each MPP node adds both compute and storage capacity, the result is an arithmetically scalable solution that can handle more concurrent queries against larger data sets. However, when data volumes grow, (for example to support a Data Lake where much of the data is seldom queried), there is a risk of significant over-provisioning. Research by CTO Tripp Smith demonstrated a variance as much as 300:1. At the extreme, he found some clients had 300 times the compute capacity they needed to support their analytic workload, which is a hugely inefficient use of machine resources.
- Data Replication: Although disk storage is relatively inexpensive, MPP systems rely on data replication (duplication) of reference data to support local join operations and reduce inter-node data transfers, and this additional storage cost adds to the total cost of ownership.
In conclusion, the MPP architecture provides huge benefits in terms of its ability to scale concurrent queries and data volumes to hundreds of terabytes, but the limitations mean they tend to peak at about a petabyte. They can, however, be deployed using commodity servers and are equally found on premises and in the cloud.
EPP – Elastic Parallel Processing
Recently, a completely new and unique cloud-based architecture has started to emerge. One that separates compute processing from storage and has all the advantages of the MPP architecture, with the ability to deploy entire new hardware within seconds, and even size the cluster up or down on demand as needed.
The diagram below illustrates the Snowflake Computing architecture in which an unlimited number of MPP clusters can be deployed over a shared storage (deployed on either Amazon S3 or Microsoft Azure Blob storage). The entire system is coordinated and controlled using a stateless Service Layer that handles concurrency, transactions, and ACID compliance.
The Elastic Parallel Processing technology deployed by Snowflake has all the performance advantages of MPP solutions but without the drawbacks, including:
- Data Placement: It's no longer needed. Unlike MPP systems, which permanently store data in each cluster, Snowflake stores all data in a common Storage Service that is available to all clusters. When a query is executed, the Service Layer determines the necessary data placement at run-time, leaving the designer and developer free to concentrate on the business challenges.
- Data Skew: It is also not an issue, as again, data is permanently stored separately from the compute clusters. This means if one node in the cluster is overloaded, the other free nodes can simply pick up the work needed. Furthermore, there’s no need to transfer gigabytes of data between nodes, only the pointers to data need be transferred, and data is cached in SSD storage in the compute clusters for performance.
- Secondary Indexes: These don’t place a cap on throughput because Snowflake doesn’t have traditional B-Tree indexes. Instead, data is stored in variable length Micro-Partitions of between 1-100Mb in size, and the min/max value of each column is held which supports automatic partition elimination on every column.
- Over provisioning of storage: This is not an issue, as you simply pay for the storage used on a fixed monthly charge per terabyte, with compute processing charged by the second. This, along with the ability to natively handle structured and semi-structured data in the database, makes it a great solution for a Data Lake.
The diagram below illustrates one the greatest benefits of the EPP architecture — the ability to run multiple parallel compute clusters, each sized to a different work-load but potentially running against the same data. This means ELT loads can be run separately from data-intensive data science tasks and avoid impacting the sub-second response times needed on corporate dashboards.
As a potential Enterprise Analytics Platform, the Snowflake Data Warehouse is an outstanding solution, and currently, there’s nothing on the market to compare.
Conclusion and Outlook
Since the early 1980s, the traditional database architectures reigned supreme, and as recent as 2016, Gartner estimated that Oracle, Microsoft, and IBM held around 89 percent of this $36b market. However, those headline figures hide significant disruption. The Database Engine Ranking web site ranks over 300 databases of which 50 percent are open source, and nine of the top 20 are non-relational (NoSQL) databases.
The wholesale move to cloud computing is further challenging the status with special OLTP and OLAP databases on the rise, some of which run both on-premises and in the cloud.
In conclusion, although the traditional shared-memory, shared-disk, and shared-nothing solutions have been hugely successful over the past 40 years, the development of Hadoop, NoSQL, and cloud-based solutions are challenging the dominance of Oracle, Microsoft, and IBM.
Finally, new innovations from HTAP vendors (also known as NewSQL technology) and from vendors including VoltDB and MemSQL are delivering millisecond performance with full ACID transactions, while Snowflake stands out as the best Enterprise Analytics Platform.
All in all, there’s never been a better time to work in the database industry, as it’s moved from slow evolution to rapid revolution.
Published at DZone with permission of John Ryan, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.