DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • Best Practices for Syncing Hive Data to Apache Doris :  From Scenario Matching to Performance Tuning
  • A New Era of Unified Lakehouse: Who Will Reign? A Deep Dive into Apache Doris vs. ClickHouse
  • System Coexistence: Bridging Legacy and Modern Architecture
  • Building an AI/ML Data Lake With Apache Iceberg

Trending

  • My Dive into Local LLMs, Part 2: Taming Personal Finance with Homegrown AI (and Why Privacy Matters)
  • 11 Best Practices for Developing Secure Web Applications
  • Dashboards Are Dead Weight Without Context: Why BI Needs More Than Visuals
  • Exploring Data Redaction Enhancements in Oracle Database 23ai
  1. DZone
  2. Data Engineering
  3. Data
  4. Guide to Repairing Damaged Apache Doris Tablets

Guide to Repairing Damaged Apache Doris Tablets

Learn how to identify and repair damaged tablets in Apache Doris using built-in tools. Covers replica validation, recovery steps, and handling missing rowsets.

By 
Darren Xu user avatar
Darren Xu
·
Jun. 03, 25 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
669 Views

Join the DZone community and get the full member experience.

Join For Free

Doris's Tablet is damaged. Can it be repaired? Will data be lost?

It's really hard to say.

Why is it hard to say?

This is mainly due to the following reasons:

Apache Doris's data high-availability is based on multiple replicas. That is, when you create a table, if you specify three replicas, similar to the following parameters:

Plain Text
 
// specify 3 replicas
"replication_allocation" = "tag.location.default: 3"
//or
"replication_num"="3"

If one replica is damaged, users will hardly notice. Doris has an automatic repair function.
However, if two replicas are damaged, the table can no longer be read from or written to, and manual repair is required.

But these are all based on the high-availability scenario. What if there is only one replica?
Doris has a default of three replicas. That is, if not specified during table creation, it is still three replicas. Only when the user specifically designates 1 replica will the above - mentioned situation occur (but sometimes, due to cost - effectiveness considerations or test scenarios, there are indeed single - replica situations).

How to Judge Whether a Tablet Is Damaged?

Generally, when the following error occurs during query:
Plain Text
 
Failed to get scan range, no queryable replica found in tablet: xxxx

Or the following situation:
Plain Text
 
Failed to initialize storage reader,..., fail to Find path in version_graph

Note: The reason for the following situation: The version may be lost during the replica migration process, which was fixed in 2.0.3. (It is recommended that users of old versions upgrade as soon as possible.)

At this time, some tablets in the corresponding table are in an abnormal state, and need to be repaired according to the methods in the following sections.

How to Repair a Damaged Tablet?

When the above-mentioned situation occurs, the corresponding error message will carry a series of numbers of the tablet_id. Suppose the tablet_id is 606202, you can repair it in the following way. (When actually implementing, replace it with your own damaged tablet_id).

Query Failure Situation

1, Show tablet xxxx (here, it's 606202) and get the detail cmd.


2. Execute the output of the detail cmd and find the replica where the BE is located (the compact status url contains the ip of the BE).


Output of the detail cmd


3. Execute curl <the compact status url in step 3>, in this example, it is curl   http://be_ip:http_port/api/compaction/show?tablet_id=606202.

Image of curl command execution

Check the rowset and missing_rowset of this replica. Focus on the maximum version of the rowset (here it is 34) and missing_rowsets. From this, it can be seen that the rowset of this replica is 0 ~ 34, and there is no missing version in the middle (missing_rowsets is empty).

Note: The special version here is actually the visible version of the partition. It can also be viewed through show partitions from <table - name xxx> where PartitionName = '';

The special version in the query statement is [0, 35], and this BE does not contain version 35. So version 35 needs to be added to this BE.

If the missing version in the result of step 3 is not empty, for example, in the following:

Result of missing version of command.

This indicates that some versions are indeed lost. If it is a three-replica scenario, check whether the other BEs are in the same situation. If they are all lost and the following information is in the logs of the corresponding BEs:

An image of BE logs

It means that the three replicas are indeed damaged. This situation indicates that data is indeed lost. The safest way is to re-import data for the corresponding partition.

If you really think that losing a little data doesn't matter for subsequent use, you can refer to the content in the following sections for repair.


4. First, confirm whether automatic repair is possible.

If it is a multi-replica scenario, check whether there are healthy replicas. A healthy replica means version >= special version && last failed version = -1 && isBad = false, and when curling its compaction status, missing rowsets is empty.

If there is such a replica, set the query - error - reporting replica as bad. Refer to the command: https://doris.apache.org/docs/sql-manual/sql-statements/table-and-view/data-and-status-management/SET-REPLICA-STATUS

Wait for a while (it may take a minute or two), and then execute the detail cmd in step 2. If all replicas are healthy (version >= special version && last failed version = -1 && isBad = false), and when curling its compact status, missing rowsets is empty, it means the repair is successful. Execute "select count (*) from table" to check if it is OK.

If there is no problem, the automatic repair is successful, and you don't need to read further. If there are still problems, continue reading.

5. Methods for filling empty rowsets

If all three replicas are damaged or it is a single - replica situation, the method of filling empty rowsets can be used for repair.

In this example, in the repair url, start_version = 35, end_version = 35;

This example only lacks one rowset. In reality, there may be more missing (missing rowset, from the maximum version + 1 ~ special version). For however many rowsets are missing, call the repair method that many times;

Refer to the command: https://doris.apache.org/docs/admin-manual/open-api/be-http/pad-rowset

This kind of missing version can make the data queryable through the above - mentioned method, but this part of the data is lost, and there will be a situation of less data.

6. After repair, judge whether the last fail version needs to be modified.

After the repair, execute "show tablet xxx" again. Check whether the last fail version of this replica is equal to -1. If its version is all filled, but last fail version = version + 1, the last fail version also needs to be manually changed to -1.

Refer to the command: https://doris.apache.org/docs/sql-manual/sql-statements/table-and-view/data-and-status-management/SET-REPLICA-VERSION

Lower - version Doris may not include this SQL. If this SQL is not supported and it is a single - replica or all multi - replicas are damaged, it cannot be recovered.

If there is no problem, use "select count(*) from table_xx" to check whether it is readable. If it is readable, it is normal.

Special Scenario Handling

If it is a logging scenario—single-replica storage is used, but a certain tablet is damaged—losing some data is acceptable as long as it can be queried, and no separate repair is required. What should be done?

Just set the variables skip_missing_version and skip_bad_tablet to true. The default is false.

My SQL variables

Summary

Well, the above are the more common solutions. What if it still can't be fixed or you don't know how to do it?

You need to take the initiative and find the Doris community members. They are all very enthusiastic!

If you have repaired it through the above methods but still feel that it is unreasonable, why did the tablet damage occur? At this time, you can also bring the corresponding logs to the community members and let them assist in the analysis.

CURL Data analysis Apache

Published at DZone with permission of Darren Xu. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Best Practices for Syncing Hive Data to Apache Doris :  From Scenario Matching to Performance Tuning
  • A New Era of Unified Lakehouse: Who Will Reign? A Deep Dive into Apache Doris vs. ClickHouse
  • System Coexistence: Bridging Legacy and Modern Architecture
  • Building an AI/ML Data Lake With Apache Iceberg

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: