Problem Analysis in Apache Doris StreamLoad Scenarios

Troubleshoot common Apache Doris StreamLoad import errors, including missing partitions, data type mismatches, and special character issues, for smooth syncing.

li yuanyuan

Apr. 03, 25 · Analysis

Likes (1)

Comment

Save

1.4K Views

Apache Doris provides multiple ways to import data, including StreamLoad, HdfsLoad (gradually replacing BrokerLoad), RoutineLoad, MySQLLoad, and others. StreamLoad is the most commonly used method, as many data synchronization tools like Flink, Spark, and DataX use it under the hood.

Since StreamLoad is the underlying mechanism for Flink Doris Connector, Spark Doris Connector, and DataX, most data import issues tend to occur with StreamLoad. This article will discuss common import errors and their solutions.

Several Common Data Import Errors

1. Partition Not Created

Schema

    SQL
   
 

   CREATE TABLE IF NOT EXISTS tb_dynamic_partition_test2 (
    `sid` LARGEINT NOT NULL COMMENT "Student ID",
    `name` VARCHAR(50) NOT NULL COMMENT "Student Name",
    `class` INT COMMENT "Class",
    `age` SMALLINT COMMENT "Age",
    `sex` TINYINT COMMENT "Gender",
    `phone` LARGEINT COMMENT "Phone",
    `address` VARCHAR(500) NOT NULL COMMENT "Address",
    `date` DATETIME NOT NULL COMMENT "Date of Data Entry"
)
ENGINE=olap
DUPLICATE KEY (`sid`, `name`)
PARTITION BY RANGE(`date`)()
DISTRIBUTED BY HASH (`sid`) BUCKETS 4
PROPERTIES (
    "dynamic_partition.enable"="true",
    "dynamic_partition.start"="-3",
    "dynamic_partition.end"="1",
    "dynamic_partition.time_unit"="DAY",
    "dynamic_partition.prefix"="p_",
    "dynamic_partition.replication_num"="1",
    "dynamic_partition.buckets"="4"
);
  

StreamLoad Command

    Asterisk
   
   curl --location-trusted -u root -H "column_separator:,"  -T /mnt/disk2/test.csv http://ip:8030/api/test/tb_dynamic_partition_test2/_stream_load

Error

    Plain Text
   
   Reason: no partition for this tuple. tuple=+---------------+---------------+--------------------+--------------------+--------------------+------------------+-----------------+----------------+

Solution

The error occurs when there is no partition for the data. To resolve this, create the missing partition:

    SQL
   
   -- Disable dynamic partition
ALTER TABLE tb_dynamic_partition_test2 SET ("dynamic_partition.enable" = "false");

-- Add partition
ALTER TABLE tb_dynamic_partition_test2 ADD PARTITION p_20240426 VALUES [("2024-04-26 00:00:00"), ("2024-04-27 00:00:00")) ("replication_num"="1");

-- Re-enable dynamic partition
ALTER TABLE tb_dynamic_partition_test2 SET ("dynamic_partition.enable" = "true");

After adding the partition, data should import successfully.

2. Data and Column Type Mismatch

Schema

    SQL
   
 

   CREATE TABLE IF NOT EXISTS test (
    `sid` LARGEINT NOT NULL COMMENT "Student ID",
    `name` VARCHAR(5) NOT NULL COMMENT "Student Name",
    `class` INT COMMENT "Class",
    `age` SMALLINT COMMENT "Age",
    `sex` TINYINT COMMENT "Gender",
    `phone` LARGEINT COMMENT "Phone",
    `address` VARCHAR(5) NOT NULL COMMENT "Address",
    `date` DATETIME NOT NULL COMMENT "Date of Data Entry"
)
ENGINE=olap
DUPLICATE KEY (`sid`, `name`)
DISTRIBUTED BY HASH (`sid`) BUCKETS 4
PROPERTIES (
    "replication_num"="1"
);
  

StreamLoad Command

    Shell
   
   curl --location-trusted -u root -H "column_separator:,"  -T /mnt/disk2/liyuanyuan/data/test.csv http://10.16.10.6:18739/api/test/test/_stream_load

Data

    Plain Text
   
   1, lisixxxxxxxxxxxxxxxxxxxx, 1001, 18, 1, 1008610010, bj, 2024-04-26

Error

    Plain Text
   
   Reason: column_name[name], the length of input is too long than schema. first 32 bytes of input str: [lisixxxxxxxxxxxxxxxxxxxx] schema length: 5; actual length: 24;

Solution

The name column's data length exceeds the schema definition. To fix this, increase the length of the name column.

    SQL
   
   ALTER TABLE test MODIFY COLUMN name VARCHAR(50);

Data should now import successfully.

3. Mismatched Columns Between Data and Schema

Schema

    SQL
   
 

   CREATE TABLE IF NOT EXISTS test2 (
    `sid` LARGEINT NOT NULL COMMENT "Student ID",
    `name` VARCHAR(50) NOT NULL COMMENT "Student Name",
    `class` INT COMMENT "Class",
    `age` SMALLINT COMMENT "Age",
    `sex` TINYINT COMMENT "Gender",
    `phone` LARGEINT COMMENT "Phone",
    `address` VARCHAR(50) NOT NULL COMMENT "Address",
    `date` DATETIME NOT NULL COMMENT "Date of Data Entry"
)
ENGINE=olap
DUPLICATE KEY (`sid`, `name`)
DISTRIBUTED BY HASH (`sid`) BUCKETS 4
PROPERTIES (
    "replication_num"="1"
);
  

Data

    Plain Text
   
   1, xxxxxxxxxxxxxxxxxxxxxxx, 1001, 18, 1, 1008610010, beijing, 2024-04-26, test_column

StreamLoad Command

    Shell
   
   curl --location-trusted -u root -H "column_separator:,"  -T /mnt/disk2/liyuanyuan/data/test2.csv http://10.16.10.6:18739/api/test/test2/_stream_load

Error

    Plain Text
   
   Reason: actual column number in CSV file is more than schema column number. Actual number: 9, schema column number: 8.

Solution

To fix this, add the extra column to the schema:

    SQL
   
   ALTER TABLE test2 ADD COLUMN new_col VARCHAR(50);

4. Special Characters in CSV Causing Import Failure

Special characters like commas within the data can cause issues during import, especially when columns contain delimiters. A good solution is to use the JSON format instead of CSV for such cases.

If the spark or flink engine is used for import, set the following parameters.

Solution

    Java
   
   properties.setProperty("format", "json");

properties.setProperty("read_json_by_line", "true");

Alternatively, use the proper escape sequences for handling special characters in CSV files.

Handling Special Characters in CSV Files

1. Data Containing Quotation Marks

When dealing with CSV files where data is enclosed in quotation marks, it’s important to configure StreamLoad with the appropriate settings.

Example Schema

    SQL
   
 

   CREATE TABLE IF NOT EXISTS test3 (
    `sid` LARGEINT NOT NULL COMMENT "Student ID",
    `name` VARCHAR(50) NOT NULL COMMENT "Student Name",
    `class` INT COMMENT "Class",
    `age` SMALLINT COMMENT "Age",
    `sex` TINYINT COMMENT "Gender",
    `phone` LARGEINT COMMENT "Phone",
    `address` VARCHAR(50) NOT NULL COMMENT "Address",
    `date` DATETIME NOT NULL COMMENT "Date of Data Entry"
)
ENGINE=olap
DUPLICATE KEY (`sid`, `name`)
DISTRIBUTED BY HASH (`sid`) BUCKETS 4
PROPERTIES (
    "replication_num"="1"
);
  

Data

    Plain Text
   
   "1","xxxxxxx","1001","18","1","1008610010","beijing","2024-04-26"

StreamLoad Command

    Shell
   
   curl --location-trusted -u root -H "column_separator:," -H "enclose:\"" -H "trim_double_quotes:true" -T /path/to/test3.csv http://ip:8030/api/test/test3/_stream_load

Solution

    Plain Text
   
   enclose: specifies a enclose character.
trim_double_quotes: to true when cutting the CSV file for each field of the outermost double quotation marks.

2. Data Containing Partial Quotes

Data

    Plain Text
   
   "1","xx,x,x,xxx",1001,18,"1",1008610010,"bei,jing",2024-04-26

StreamLoad Command

    Shell
   
   curl --location-trusted -u root -H "column_separator:,"  -H "enclose:\"" -H "trim_double_quotes:true"  -T /mnt/disk2/liyuanyuan/data/test4.csv http://10.16.10.6:18739/api/test/test4/_stream_load

Handling Windows Line Endings

If data is imported from Windows (with \r\n line endings) causes issues where queries do not return expected results, check if the Windows line endings are present.

Solution

Use od -c to check for \r\n and specify the correct line delimiter during import:

    Shell
   
   -H "line_delimiter:\r\n"

Using Expression in StreamLoad

Example 1

    SQL
   
 

   CREATE TABLE test_streamload (
    user_id BIGINT NOT NULL COMMENT "User ID",
    name VARCHAR(20) COMMENT "User Name",
    age INT COMMENT "User Age"
)
DUPLICATE KEY (user_id)
DISTRIBUTED BY HASH (user_id) BUCKETS 10
PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
);
  

StreamLoad Command

    Shell
   
   curl --location-trusted -u "root:" -T /path/to/data.csv -H "format:csv" -H "column_separator:," -H "columns:user_id,tmp,age,name=upper(tmp)" http://ip:8030/api/test/test_streamload/_stream_load

Example 2

    SQL
   
 

   CREATE TABLE test_streamload2 (
    c1 INT,
    c2 INT,
    c3 VARCHAR(20)
)
DUPLICATE KEY (c1)
DISTRIBUTED BY HASH (c1) BUCKETS 10
PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
);
  

StreamLoad Command

    Shell
   
   curl --location-trusted -u "root:" -T /path/to/data.csv -H "format:csv" -H "column_separator:," -H "columns:c1,c2,A,B,C,c3=CONCAT(A,B,C)" http://ip:8030/api/test/test_streamload2/_stream_load

Conclusion

By understanding and addressing these common data import errors, you can significantly reduce the time spent troubleshooting and ensure smoother data synchronization with Apache Doris.

sql Apache Data processing

Opinions expressed by DZone contributors are their own.

Related

Trending

Problem Analysis in Apache Doris StreamLoad Scenarios

Troubleshoot common Apache Doris StreamLoad import errors, including missing partitions, data type mismatches, and special character issues, for smooth syncing.

Several Common Data Import Errors

1. Partition Not Created

Schema

StreamLoad Command

Error

Solution

2. Data and Column Type Mismatch

Schema

StreamLoad Command

Data

Error

Solution

3. Mismatched Columns Between Data and Schema

Schema

Data

StreamLoad Command

Error

Solution

4. Special Characters in CSV Causing Import Failure

Solution

Handling Special Characters in CSV Files

1. Data Containing Quotation Marks

Example Schema

Data

StreamLoad Command

Solution

2. Data Containing Partial Quotes

Data

StreamLoad Command

Handling Windows Line Endings

Solution

Using Expression in StreamLoad

Example 1

StreamLoad Command

Example 2

StreamLoad Command

Conclusion

Related

Partner Resources