How Doris + Hudi Turned the Impossible Into the Everyday
Want to know how this "data giant ship" navigates the waves? Follow this article to uncover the amazing story of Doris and Hudi, the "dream team" of data.
Join the DZone community and get the full member experience.
Join For FreeIn the world of big data, there's a legend that goes like this: A data scientist, constantly worried about query performance and working late every night to optimize SQL, suddenly discovered the "perfect match" of Doris and Hudi, and immediately kicked into "supersonic" mode — query speeds so fast that even the boss couldn't believe it!
Today, this legend is widely circulated in the data community. Many data engineers jokingly say that processing data used to be like crossing a river in a canoe — slow and risky. Now, with the "giant ship" of Doris + Hudi, they can not only sail smoothly but also elegantly travel through time to view every exciting moment of historical data.
Want to know how this "data giant ship" navigates the waves? Follow this article to uncover the amazing story of Doris and Hudi, the "dream team" of data.
Exploring the Perfect Integration of Doris and Hudi
Have you heard the saying, "Haste makes waste"? In the big data realm, this couldn't be more true. To analyze PB-level data both quickly and accurately, relying solely on a database is no longer sufficient. Just as a martial arts master needs to "cultivate both internal and external skills," modern data architecture also needs the perfect combination of data lakes and data warehouses.
The combination of Apache Doris and Apache Hudi is like the "double swords" of a martial arts duo. One focuses on high-performance queries, while the other excels at real-time data management. Together, they are rewriting the rules of big data analytics.
Smart Query Optimization
Imagine you're a librarian. If someone is looking for a book, would you search every single shelf from the first to the last? Of course not. You'd look at the index card and directly locate the corresponding shelf. Doris does the same when reading Hudi data.
For CoW (Copy-on-Write) tables, which are like neatly arranged new books, Doris directly uses the native Parquet reader to achieve "one-step access." For MoR (Merge-on-Read) tables, which are like books with updated records, Doris adapts its approach accordingly.
Let's look at a real-world example: An e-commerce customer's order analysis system with daily data volumes reaching the TB level. After adopting the Doris + Hudi solution, over 90% of queries achieved "lightning-fast" responses — millisecond-level latency. The secret lies in Doris' smart data access strategy:
-- See how fast this query is
SELECT * FROM customer_mor WHERE c_custkey = 32;
The Explain execution plan shows:
# hudiNativeReadSplits indicates how many
# split files are read using the parquet native reader
hudiNativeReadSplits=66/101
This means that out of 101 data splits, 66 were read using the high-speed native reader. It's like finding the correction in just 1 out of 66 books — performance naturally "soars."
The Elegant Evolution Beyond Speed
Doris's support for Hudi is not just about speed. It's more like a versatile "data artist" that can present data in various forms:
Time Travel
Remember time travel in sci-fi movies? In the world of Doris + Hudi, this is not fiction. Every data change leaves a "time mark," allowing you to go back to any point in time to view the data state.
Financial user Xiao Zhang recently had a profound experience:
"After the system update last Friday, a transaction data changed mysteriously. When the boss asked for an explanation, my heart sank. Fortunately, with the time travel feature, a single SQL query took me back to before the update, and I immediately found the issue. This move impressed the boss, who exclaimed 'six six six'!"
-- FOR TIME AS OF statement:
-- Read historical version data based on the snapshot time (consistent with the format on the Hudi official website)
-- Time travel to view historical data state
SELECT * FROM financial_trans
FOR TIME AS OF '2024-12-18 22:00:00';
It's like having a "time machine" that can go back to any historical version of the data.
Incremental Awareness
It's like having a "sixth sense" that can precisely capture every data movement:
-- Doris provides @incr syntax support for Incremental Read
-- Get the most recent data changes
SELECT * FROM customer_mor@incr('beginTime'='xxx');
With Doris' enhancement, Hudi data tables have transformed into "all-round players." Whether it's real-time analysis, historical review, or incremental processing, they can handle it with ease. It's like upgrading an ordinary sword to a divine weapon that can "fly by itself."
The Art of Lakehouse Integration With Doris + Hudi
At a technical salon of a major internet company, architect Old Wang told an interesting story: "Back when we first used Hudi, it was like a marathon runner — fast but feeling something was missing. Since we met Doris, it's like taking a high-speed train — faster and more efficient!"
This analogy resonated with everyone in the room. Indeed, the combination of Doris + Hudi is shining in various fields, for example:
Ad Click Analysis
An ad platform processes hundreds of millions of click data daily. Previously, running a conversion rate analysis would take until midnight, but now the results are available before lunch. The key is that the data is immediately available for querying, making strategy adjustments more flexible.
-- Real-time view of ad click conversion in the last hour
SELECT ad_id,
click_count,
convert_count,
convert_count/click_count as cvr
FROM ad_stats@incr('beginTime'='earliest')
WHERE event_time >= date_sub(now(), interval 1 hour)
GROUP BY ad_id;
The Lakehouse architecture is changing the game in the data world. As Old Wang said, "In the past, we built bridges between data lakes and data warehouses; now, Doris + Hudi have built a highway."
As a tech guru said, "The data world is always changing, but the pursuit of ultimate performance remains constant." Let's continue on this journey of data exploration!
Stay tuned for more interesting, useful, and valuable content in the next issue!
Opinions expressed by DZone contributors are their own.
Comments