HTAP: One Size Fits All?
In this post, I briefly explain the concept of HTAP and discuss how we at PingCAP designed TiDB to support OLTP and OLAP workloads.
Join the DZone community and get the full member experience.Join For Free
An important idea in the database world is that specialized databases will outperform general-purpose databases. Michael Stonebraker, an A. M. Turing Award Laureate and one of the most influential people in the database world, also discussed this in his paper, One Size Fits All: An Idea Whose Time Has Come and Gone.
This is a rational judgment because it's tough enough to build a database that supports either Online Transactional Processing (OLTP) or Online Analytical Processing (OLAP) workloads, let alone one that supports both at the same time. But the dilemma is, that today, many users are facing increasing demands with mixed OLTP and OLAP workloads. How do we crack this then?
HTAP ≠ OLTP + OLAP
Yes, an HTAP is NOT a straight integration of OLTP and OLAP.
A good analogy is a motorhome. It's sometimes called "a home on wheels," but it really isn't a combination of a car and house. Instead, a motorhome is a unique experience—a special product to meet special needs. So is HTAP.
HTAP is designed for special scenarios, not solely OLTP, OLAP, or a combination of the two.
The Rise of HTAP Scenarios
In recent years, the demand for real-time data processing and analytics has grown rapidly. Traditional databases that specialize more in offline data processing are failing to meet users' growing needs. There are two major reasons behind this.
First, the technology stacks for real-time data processing have been constantly developing and maturing. Take the big data ecosystem as an example. The real-time computation framework has evolved from Apache Storm with simple semantics, to Apache Storm with Trident on top of it, and then to Apache Flink with complex semantics and supplemented by built-in state storage. Only now, after all these changes, the stream processing framework has been widely adopted in many complicated real-time analytical scenarios. These frameworks are paired with downstream sinks with different characteristics, which in turn accelerate the innovation of real-time applications.
In addition, users keep trying new ideas to digitize their business operations in real-time; technology stacks become easier to use; the development of database technology also stimulates the prevalence of real-time applications.
Second, the digital transformation process is speeding up in many traditional industries, generating new demands. Processing tasks that were once impossible are now requirements for a well-run business.
Take China's express delivery industry as an example. As its market size continues to expand, delivery orders have grown enormously. The real-time monitoring and analytics of those orders has become a must and can help optimize all aspects of operations such as real-time delivery route optimization and penalty management. Traditional offline analytics can't meet these demands—especially during large shopping carnivals when peak transactions occur.
Today, more and more users are facing scenarios with mixed workloads, rather than pure OLTP or OLAP. We call these HTAP scenarios. Traditional OLAP solutions are too cumbersome to meet the new demands, as are pure OLTP databases. What users really want is a solution that is in between OLTP and OLAP databases.
HTAP: PingCAP's Solution
At PingCAP, our product strategy is that TiDB is an OLTP-oriented database supplemented with an OLAP capability. That is to say, our ambition is in the fields of OLTP and HTAP. Since we are discussing HTAP today, I will skip the OLTP part and focus on HTAP.
I explained earlier why HTAP is not a straight combination of OLTP and OLAP. I can explain a bit more from our own experience.
Previously, we faced scenarios with many data hub applications. Users intended to converge data from data silos on different business lines to the same real-time centralized data store, and then deliver data services and analytics on top of it. In another case, users planned to build a read replica replicated from their OLTP database. This replica was used to support separate analytics and data serving workloads and respond to unlimited queries and analytical services.
The scenarios above require the database to:
- Have a distributed architecture similar to traditional data warehouses to uphold data aggregation with a similar scale.
- Ensure data consistency and real-time performance as transactional databases do, and also provide index-based data recall and large-scale analytics based on columnstore.
- Connect smoothly with offline data warehouses.
That is to say, the database has to focus on mixed OLTP and OLAP workloads. It can also be lighter than traditional data warehouses because it does not need to:
- Have complicated computing models inside for offline scenarios.
- Support petabytes of data storage; the amount of real-time data usually does not reach the limit of a data warehouse's cold storage.
We have also met with scenarios where the major task was transactional processing but real-time analytics was occasionally required.
HTAP is what the scenarios above are all about. TiDB's HTAP capability is designed for those requirements, and it is real-time, agile, and light-weight:
- Users can get access to transactions and make analytical queries at the same time through a unified front-end.
- Rowstore and columnstore are kept consistent in real-time.
- The row and column resources are isolated, and the replication mechanism ensures load balancing and automatic fault recovery.
- Transactional services are processed in a stateless, stand-alone service node group, while analytical queries are processed in the vectorization accelerated, Massively Parallel Processing (MPP) mode.
The diagram below shows how TiDB handles OLTP and OLAP workloads simultaneously and independently.
In addition, in scenarios with mixed workloads, it is impossible to clearly split complicated tasks and then use different types of databases to cope with them. But, by adopting TiDB, everything is different.
TiDB can be used as a data hub for users to make high-concurrency short queries with complex indexing just like they did with traditional databases. They can also use TiDB's columnstore and MPP technology on the same logical data to accelerate large-scale, real-time analytics, and its performance is never inferior to traditional specialized OLAP databases. What's more, TiDB's cost-based optimizer (CBO) can automatically allocate different types of queries to a different storage or computing engines.
In this post, I briefly explained the concept of HTAP and introduced how we designed TiDB to support OLTP and OLAP workloads.
Published at DZone with permission of Xiaoyu Ma. See the original article here.
Opinions expressed by DZone contributors are their own.