DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Integrating Apache Doris and Hudi for Data Querying and Migration
  • A Deep Dive into Apache Doris Indexes
  • Problem Analysis in Apache Doris StreamLoad Scenarios
  • Modern Data Processing Libraries: Beyond Pandas

Trending

  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You
  • Transforming AI-Driven Data Analytics with DeepSeek: A New Era of Intelligent Insights
  • AI, ML, and Data Science: Shaping the Future of Automation
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Lakehouse: Starting With Apache Doris + S3 Tables

Lakehouse: Starting With Apache Doris + S3 Tables

This is the first article in the “Lakehouse: What’s the Big Deal?” series, where I will periodically discuss Lakehouse. Your comments and discussions are welcome.

By 
Mingyu Chen user avatar
Mingyu Chen
·
Mar. 25, 25 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
2.4K Views

Join the DZone community and get the full member experience.

Join For Free

Among the numerous AI products launched at the AWS re:Invent 2024 conference, Amazon S3 Tables might have seemed inconspicuous. However, for database professionals, the emergence of S3 Tables signifies that the era of modular data analytics has arrived.

S3 Tables are a special type of S3 Bucket with built-in support for Apache Iceberg. They stand alongside the previously introduced General Purpose Bucket and Directory Bucket, collectively referred to as Table Bucket. This means you can now create a table bucket as easily as you would create a standard S3 Bucket.

S3 Tables come with the Iceberg table format pre-integrated and are compatible with Iceberg’s APIs externally. Beyond the table format, they also include built-in table management services that automatically optimize and organize table data. For instance, tasks like merging small files, which previously required Spark, can now be directly handled by S3 Tables (though, of course, at a cost). If we consider standard S3 Buckets as rough, unfinished houses, then table buckets are like fully furnished, move-in-ready homes.

According to the official website, S3 Tables offers the following features:

  • Scalability. With a simple purchase, you can acquire a table storage system that rivals the scalability and availability of S3.
  • Fully-managed. Built-in table management services, such as compaction, snapshot management, garbage file cleanup, and more, automatically optimize query efficiency and cost over time.
  • Enhanced performance. Compared to storing Iceberg tables in general-purpose S3 buckets, query performance can be up to 3 times faster, and transactions per second can be up to 10 times higher.
  • Seamless integration. Seamless integration with AWS Glue and commonly used query and analytics engines.

It seems like things have suddenly become much simpler:

  1. Purchase a table bucket.
  2. Launch a query engine.

“Congratulations, you now have a Lakehouse!”

Using S3 tables

“However, is it really that simple?”
 “Yes, but not entirely.”

Next, I’ll walk you through how to seamlessly build a simplified version of a Lakehouse using Apache Doris + S3 Tables. After that, we’ll circle back to our main topic: “Lakehouse: What’s the Big Deal?”

Hands-on Guide of Apache Doris + S3 Tables

S3 Tables are compatible with the Iceberg API, and Apache Doris already has robust support for the Iceberg table format. Therefore, only minimal code modifications are required. However, due to the limited documentation available for S3 Tables, I encountered a few unexpected issues along the way.

Official support for S3 Tables in Apache Doris is scheduled to be released in the Q1, 2025.

1. Purchase a Table Bucket

This step is very simple. Click to create, give it a name, and you’re done.

Purchase a table bucket

Once created, you will receive an ARN, which will later serve as the warehouse address for Iceberg.

Currently, almost no operations can be performed on a table bucket through the AWS console. You can’t even delete it or view its contents. All operations must be carried out via the AWS CLI or analytical components like Amazon EMR, Athena, or Redshift (and some of these only support read operations, not table creation or writes).

However, we will soon have another option: Apache Doris.

2. Launch an Apache Doris Cluster

There is a quick-start guide for Apache Doris. But the supporting for S3 Tables is not released yet, so here I run Doris in my own dev branch.

3. Embark on the S3 Tables Journey

From here on, things become incredibly straightforward. All operations can be completed using SQL within Doris.

Creating Catalog

SQL
 
CREATE CATALOG iceberg_s3 PROPERTIES (
    'type' = 'iceberg',
    'iceberg.catalog.type' = 's3tables',
    'warehouse' = 'arn:aws:s3tables:us-east-1:169698000000:bucket/doris-s3-table-bucket',
    's3.region' = 'us-east-1',
    's3.endpoint' = 's3.us-east-1.amazonaws.com',
    's3.access_key' = 'AKIASPAWQE3ITEXAMPLE',
    's3.secret_key' = 'l4rVnn3hCmwEXAMPLE/lht4rMIfbhVfEXAMPLE'
);


Creating Database

SQL
 
SWITCH iceberg_s3;
CREATE DATABASE my_namespace;
SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| my_namespace       |
| mysql              |
+--------------------+


Creating Table, Inserting, and Querying

SQL
 
CREATE TABLE partition_table (
 `ts` DATETIME COMMENT 'ts',
 `id` INT COMMENT 'col1',
 `pt1` STRING COMMENT 'pt1',
 `pt2` STRING COMMENT 'pt2'
)
PARTITION BY LIST (day(ts), pt1, pt2) ();


SQL
 
INSERT INTO partition_table VALUES
("2024-01-01 08:00:00", 1000, "us-east", "PART1"),
("2024-01-02 10:00:00", 1002, "us-sout", "PART2");

SELECT * FROM partition_table;
+----------------------------+------+---------+-------+
| ts                         | id   | pt1     | pt2   |
+----------------------------+------+---------+-------+
| 2024-01-02 10:00:00.000000 | 1002 | us-sout | PART2 |
| 2024-01-01 08:00:00.000000 | 1000 | us-east | PART1 |
+----------------------------+------+---------+-------+


Time Travel

We can insert another batch of data and then use the iceberg_meta() table function to view the snapshots in Iceberg:

SQL
 
INSERT INTO partition_table VALUES
("2024-01-03 08:00:00", 1000, "us-east", "PART1"),
("2024-01-04 10:00:00", 1002, "us-sout", "PART2");


SQL
 
SELECT * FROM iceberg_meta(
  'table' = 'iceberg_s3.my_namespace.partition_table',
  'query_type' = 'snapshots'
)\G
*************************** 1. row ***************************
 committed_at: 2025-01-15 23:27:01
  snapshot_id: 6834769222601914216
    parent_id: -1
    operation: append
manifest_list: s3://80afcb3f-6edf-46f2-7fhehwj6cengfwc7n6iz7ipzakd7quse1b--table-s3/metadata/snap-6834769222601914216-1-a6b2230d-fc0d-4c1d-8f20-94bb798f27b1.avro
      summary: {"added-data-files":"2","added-records":"2","added-files-size":"5152","changed-partition-count":"2","total-records":"2","total-files-size":"5152","total-data-files":"2","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0","iceberg-version":"Apache Iceberg 1.6.1 (commit 8e9d59d299be42b0bca9461457cd1e95dbaad086)"}
*************************** 2. row ***************************
 committed_at: 2025-01-15 23:30:00
  snapshot_id: 5670090782912867298
    parent_id: 6834769222601914216
    operation: append
manifest_list: s3://80afcb3f-6edf-46f2-7fhehwj6cengfwc7n6iz7ipzakd7quse1b--table-s3/metadata/snap-5670090782912867298-1-beeed339-be96-4710-858b-f39bb01cc3ff.avro
      summary: {"added-data-files":"2","added-records":"2","added-files-size":"5152","changed-partition-count":"2","total-records":"4","total-files-size":"10304","total-data-files":"4","total-delete-files":"0","total-position-deletes":"0","total-equality-deletes":"0","iceberg-version":"Apache Iceberg 1.6.1 (commit 8e9d59d299be42b0bca9461457cd1e95dbaad086)"}


Use the FOR VERSION AS OF syntax to query a specific snapshot:

SQL
 
SELECT * FROM partition_table FOR VERSION AS OF 5670090782912867298;
+----------------------------+------+---------+-------+
| ts                         | id   | pt1     | pt2   |
+----------------------------+------+---------+-------+
| 2024-01-04 10:00:00.000000 | 1002 | us-sout | PART2 |
| 2024-01-03 08:00:00.000000 | 1000 | us-east | PART1 |
| 2024-01-01 08:00:00.000000 | 1000 | us-east | PART1 |
| 2024-01-02 10:00:00.000000 | 1002 | us-sout | PART2 |
+----------------------------+------+---------+-------+


SQL
 
SELECT * FROM partition_table FOR VERSION AS OF 6834769222601914216;
+----------------------------+------+---------+-------+
| ts                         | id   | pt1     | pt2   |
+----------------------------+------+---------+-------+
| 2024-01-02 10:00:00.000000 | 1002 | us-sout | PART2 |
| 2024-01-01 08:00:00.000000 | 1000 | us-east | PART1 |
+----------------------------+------+---------+-------+


Of course, the data written by Doris can be queried by other engines like Spark:

SQL
 
scala> spark.sql("SELECT * FROM s3tablesbucket.my_namespace.`partition_table` ").show()
+-------------------+----+-------+-----+
|                 ts|  id|    pt1|  pt2|
+-------------------+----+-------+-----+
|2024-01-02 10:00:00|1002|us-sout|PART2|
|2024-01-01 08:00:00|1000|us-east|PART1|
|2024-01-04 10:00:00|1002|us-sout|PART2|
|2024-01-03 08:00:00|1000|us-east|PART1|
+-------------------+----+-------+-----+


So far, we have completed the initial experience of Doris + S3 Tables. Regarding more features of S3 Tables, such as auto-compression, snapshot management, etc., I will share them with you after further exploration.

As can be seen in this scenario, Doris only serves as a stateless query engine. The metadata and data (referring to that of S3 Tables here) are all stored in the table bucket. Not only do we not need to worry about cumbersome control actions and status information like storage space, replicas, and data balancing, but we can also easily share this data with other systems (such as Spark) to perform different query load tasks.

Lakehouse: What Is the Big Deal? (1)

OK, after this brief experience, let’s get back to the topic: “Lakehouse: What’s the Big Deal?”

Since Databricks proposed this concept, Lakehouse has almost become a common commodity. Virtually all analytical databases and query engines claim to be Lakehouse (including Doris). So, what exactly is Lakehouse? What problems is Lakehouse actually solving? I hope to share with you my understanding of Lakehouse by summarizing my observations and thoughts in the relevant fields over the years.

With the launch of S3 Tables, let’s first talk about two features of Lakehouse: Data Sharing, Diverse Workloads and Collaboration.

Data Sharing

From distributed systems to cloud-native, the most significant change in architecture is the shift from complete Shared-Nothing to Shared-Data. The core idea of Shared-Nothing is to divide and conquer. It makes full use of multiple “small” nodes to form a large distributed cluster. At the same time, it moves computing tasks closer to data storage nodes. Nodes process data orthogonally, avoiding network transmission, reducing mutual dependencies, and maximizing concurrent data processing. In the big data era, Shared-Nothing solved the problems of high-priced and non-scalable computing power. 

However, with the improvement of cloud infrastructure, especially the emergence of object storage represented by S3, which features low cost, 99.999999999% of availability, and nearly unlimited storage expansion capabilities, it has become the standard storage infrastructure in the cloud era.

Data can be shared, which benefits from two aspects. First, storage systems represented by S3 serve as the foundation, enabling One Data, Access Everywhere. Object storage provides a unified access endpoint, allowing any system to access the same data from any location. 

Second, the diversity of storage formats enables different types of data to be stored on S3. From using the most basic PUT and GET operations to read and write files with object semantics, to file system semantics encapsulated based on S3 such as the S3A protocol, and now S3 Tables providing services with table semantics, S3 fully meets the storage requirements of data lakes for data diversity. 

Therefore, whether it is a single AI training file, audio or video data file, log, or two-dimensional relational table data, it can be stored in a unified system and provided with shared services.

Data sharing

Data sharing addresses issues such as data consistency among different systems, as well as the conversion costs and timeliness of data transfer back and forth. It enables the achievement of a Single Source Of Truth.

Diverse Workloads and Collaboration

When we talk about Iceberg, we usually discuss its features as an upgraded replacement for Hadoop. Naturally, it is also related to how distributed computing engines such as Spark, Doris, and Trino handle large-scale data. However, as we encounter more Iceberg application scenarios, we find that Iceberg is no longer limited to the storage and processing of distributed and massive data in the big data era. Instead, by combining data-sharing capabilities, it makes it possible for diverse analytical workloads and collaborative processing.

For example, in big data processing, users can use Spark/Hive to perform batch data processing on Iceberg; use Flink/RisingWave to stream OLTP database changes into Iceberg through CDC or conduct stream processing; and use Doris/Trino to perform interactive query analysis on Iceberg. 

On the other hand, users can also use DuckDB/Daft to access Iceberg on their Macs at any time and explore the data they are interested in. They can also process AI training data on Iceberg through the PyIceberg project. Excitingly, all of this happens on the same dataset. We no longer need to rebuild clusters, create tables, and import data for temporary data analysis needs. Nor do we need to worry about data credibility and inconsistent data quality. Everything becomes quite natural and flexible.

Diverse workloads and collaboration

On the other hand, in addition to organizing originally scattered and low-quality data in a standardized way through table semantics, Iceberg further enhances the ability of collaborative data processing through version management. Readers of this article should be familiar with Git and have surely used GitHub. GitHub is the most typical case of achieving multi-person collaborative development through version management. Through branches, tags, and MVCC (Multi Version Concurrency Control), we can collaboratively complete the same project without interfering with each other. 

Iceberg also provides functions similar to Branch and Tag, allowing users to create snapshots and branches on tabular data (instead of redundantly storing multiple data copies), and complete different data processing on branches without affecting the main data. 

For example, a DBA can create a branch to independently verify the performance of different data indexes, and another can create a snapshot that only retains data for 30 days for historical data analysis.

Branch and Tag

To Be Continued

Based on data sharing, diverse workloads, and collaboration, Lakehouse has brought us a new paradigm for data analysis. S3 Tables further simplify the setup and maintenance of underlying storage facilities on this basis, achieving a nearly out-of-the-box effect. 

In the follow-up, I will continue to discuss more features of Lakehouse, as well as the positioning of real-time data warehouses and real-time query engines like Doris in this architecture. Everyone is welcome to leave comments and communicate.

Big data Data processing Data sharing Apache

Opinions expressed by DZone contributors are their own.

Related

  • Integrating Apache Doris and Hudi for Data Querying and Migration
  • A Deep Dive into Apache Doris Indexes
  • Problem Analysis in Apache Doris StreamLoad Scenarios
  • Modern Data Processing Libraries: Beyond Pandas

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!