Guide to Repairing Damaged Apache Doris Tablets
Learn how to identify and repair damaged tablets in Apache Doris using built-in tools. Covers replica validation, recovery steps, and handling missing rowsets.
Join the DZone community and get the full member experience.
Join For FreeDoris's Tablet is damaged. Can it be repaired? Will data be lost?
It's really hard to say.
Why is it hard to say?
This is mainly due to the following reasons:
Apache Doris's data high-availability is based on multiple replicas. That is, when you create a table, if you specify three replicas, similar to the following parameters:
// specify 3 replicas
"replication_allocation" = "tag.location.default: 3"
//or
"replication_num"="3"
Doris has a default of three replicas. That is, if not specified during table creation, it is still three replicas. Only when the user specifically designates 1 replica will the above - mentioned situation occur (but sometimes, due to cost - effectiveness considerations or test scenarios, there are indeed single - replica situations).
How to Judge Whether a Tablet Is Damaged?
Failed to get scan range, no queryable replica found in tablet: xxxx
Failed to initialize storage reader,..., fail to Find path in version_graph
Note: The reason for the following situation: The version may be lost during the replica migration process, which was fixed in 2.0.3. (It is recommended that users of old versions upgrade as soon as possible.)
At this time, some tablets in the corresponding table are in an abnormal state, and need to be repaired according to the methods in the following sections.
How to Repair a Damaged Tablet?
When the above-mentioned situation occurs, the corresponding error message will carry a series of numbers of the tablet_id
. Suppose the tablet_id
is 606202
, you can repair it in the following way. (When actually implementing, replace it with your own damaged tablet_id)
.
Query Failure Situation
1, Show tablet xxxx (here, it's 606202
) and get the detail cmd.

2. Execute the output of the detail cmd
and find the replica where the BE is located (the compact status url
contains the ip
of the BE).
3. Execute curl <the compact status url in step 3>, in this example, it is curl http://be_ip:http_port/api/compaction/show?tablet_id=606202
.

Check the rowset
and missing_rowset
of this replica. Focus on the maximum version of the rowset
(here it is 34
) and missing_rowsets
. From this, it can be seen that the rowset
of this replica is 0 ~ 34
, and there is no missing version in the middle (missing_rowsets
is empty).
Note: The special version here is actually the visible version of the partition. It can also be viewed through show partitions from <table - name xxx> where PartitionName = ''
;
The special version in the query statement is [0, 35]
, and this BE
does not contain version 35
. So version 35
needs to be added to this BE
.
If the missing version in the result of step 3 is not empty, for example, in the following:

This indicates that some versions are indeed lost. If it is a three-replica
scenario, check whether the other BEs are in the same situation. If they are all lost and the following information is in the logs of the corresponding BEs:

It means that the three replicas are indeed damaged. This situation indicates that data is indeed lost. The safest way is to re-import data for the corresponding partition.
If you really think that losing a little data doesn't matter for subsequent use, you can refer to the content in the following sections for repair.
4. First, confirm whether automatic repair is possible.
If it is a multi-replica
scenario, check whether there are healthy replicas. A healthy replica means version >= special version && last failed version = -1 && isBad = false
, and when curling its compaction status
, missing rowsets
is empty.
If there is such a replica, set the query - error - reporting
replica as bad
. Refer to the command: https://doris.apache.org/docs/sql-manual/sql-statements/table-and-view/data-and-status-management/SET-REPLICA-STATUS
Wait for a while (it may take a minute or two), and then execute the detail cmd
in step 2. If all replicas are healthy (version >= special version && last failed version = -1 && isBad = false
), and when curling its compact status, missing rowsets is empty, it means the repair is successful. Execute "select count (*) from table
" to check if it is OK.
If there is no problem, the automatic repair is successful, and you don't need to read further. If there are still problems, continue reading.
5. Methods for filling empty rowsets
If all three replicas are damaged or it is a single - replica situation, the method of filling empty rowsets can be used for repair.
In this example, in the repair url, start_version = 35
, end_version = 35
;
This example only lacks one rowset. In reality, there may be more missing (missing rowset, from the maximum version + 1 ~ special version
). For however many rowsets are missing, call the repair method that many times;
Refer to the command: https://doris.apache.org/docs/admin-manual/open-api/be-http/pad-rowset
This kind of missing version can make the data queryable through the above - mentioned method, but this part of the data is lost, and there will be a situation of less data.
6. After repair, judge whether the last fail version needs to be modified.
After the repair, execute "show tablet xxx
" again. Check whether the last fail version
of this replica is equal to -1
. If its version is all filled, but last fail version = version + 1
, the last fail version also needs to be manually changed to -1
.
Refer to the command: https://doris.apache.org/docs/sql-manual/sql-statements/table-and-view/data-and-status-management/SET-REPLICA-VERSION
Lower - version Doris may not include this SQL. If this SQL is not supported and it is a single - replica or all multi - replicas are damaged, it cannot be recovered.
select count(*) from table_xx
" to check whether it is readable. If it is readable, it is normal.
Special Scenario Handling
If it is a logging scenario—single-replica
storage is used, but a certain tablet is damaged—losing some data is acceptable as long as it can be queried, and no separate repair is required. What should be done?
Just set the variables skip_missing_version
and skip_bad_tablet
to true
. The default is false
.

Summary
Well, the above are the more common solutions. What if it still can't be fixed or you don't know how to do it?
You need to take the initiative and find the Doris community members. They are all very enthusiastic!
If you have repaired it through the above methods but still feel that it is unreasonable, why did the tablet damage occur? At this time, you can also bring the corresponding logs to the community members and let them assist in the analysis.
Published at DZone with permission of Darren Xu. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments