Best Practices for Syncing Hive Data to Apache Doris : From Scenario Matching to Performance Tuning

Learn how to efficiently sync and analyze big data by combining Hive’s storage with Doris’s real-time analytics using various sync strategies and optimizations.

Darren Xu

Jul. 17, 25 · Analysis

Likes (0)

Comment

Save

4.2K Views

In the realm of big data, Hive has long been a cornerstone for massive data warehousing and offline processing, while Apache Doris shines in real-time analytics and ad-hoc query scenarios with its robust OLAP capabilities. When enterprises aim to combine Hive’s storage prowess with Doris’s analytical agility, the challenge lies in efficiently and reliably syncing data between these two systems.

This article provides a comprehensive guide to Hive-to-Doris data synchronization, covering use cases, technical solutions, model design, and performance optimization.

Core Use Cases and Scope

When target data resides in a Hive data warehouse and requires accelerated analysis via Doris’s OLAP capabilities, key scenarios include:

Reporting and ad-hoc queries: Enable fast analytics through synchronization or federated queries.
Unified data warehouse construction: Build layered data models in Doris to enhance query efficiency.
Federated query acceleration: Directly access Hive tables from Doris to avoid frequent data ingestion.

Technical Pathways and Synchronization Modes

Synchronization Modes

Full/Incremental sync: Suitable for low-update-frequency scenarios (e.g., log data, dimension tables) where a complete data model is needed in Doris.
Federated query mode: Ideal for high-frequency, small-data-volume scenarios (e.g., real-time pricing data) to reduce storage costs and ingestion latency by querying Hive directly from Doris.

Technical Solutions Overview

Four mainstream approaches exist, chosen based on data volume, update frequency, and ETL complexity:

In-Depth Analysis of Four Synchronization Solutions

Broker Load: Asynchronous Sync for Large Datasets

Core Principle

Leverage Doris’s built-in Broker service to asynchronously load data from HDFS (where Hive data resides) into Doris, supporting full and incremental modes.

Use Case

Data scale: Suitable for datasets ranging from tens to hundreds of GB, stored in HDFS and accessible by Doris.
Performance: Syncing a 5.8GB SSB dataset (60M rows) takes 140–164 seconds, achieving 370k–420k rows/sec (cluster-dependent).
Key Operations:
Table optimization: Temporarily set replication_num=1 during ingestion for speed, then adjust to 3 replicas for durability.
Partition conversion: Convert Hive partition fields (e.g., yyyymm) to Doris-compatible date types using str_to_date.
HA configuration: Include namenode addresses in WITH BROKER for HDFS high-availability setups.

Doris On Hive: Low-Latency Federated Queries

Core Principle

Use a Catalog to access Hive metadata, enabling direct queries or INSERT INTO SELECT syncs.

Use Case

Small datasets (e.g., pricing tables) with frequent updates (minute-level), no pre-aggregation needed in Doris.
Supports text, parquet, and ORC formats (Hive ≥ 2.3.7).
Advantages:
No data landing in Doris; direct join queries between Hive and Doris tables with sub-0.2-second latency.

Spark Load: Performance Acceleration for Complex ETL

Core Principle

Offload data preprocessing to an external Spark cluster, reducing Doris’s computational pressure.

Use Case

Data cleaning: Complex data cleaning (e.g., multi-table JOINs, field transformations) with Spark accessing HDFS
Performance: 5.8GB synced in 137 seconds (440k rows/sec), outperforming Broker Load
Configuration:
Spark settings: Update Doris FE config (fe.conf) with spark_home and spark_resource_path:

    Shell
   
   enable_spark_load = true  
spark_home_default_dir = /opt/cloudera/parcels/CDH/lib/spark  
spark_resource_path = /opt/cloudera/parcels/CDH/lib/spark/spark-2x.zip

External Resource Creation:

    SQL
   
 

   CREATE EXTERNAL RESOURCE "spark0"
PROPERTIES
(
"type" = "spark",
"spark.master" = "yarn",
"spark.submit.deployMode" = "cluster",
"spark.executor.memory" = "1g",
"spark.yarn.queue" = "queue0",
"spark.hadoop.yarn.resourcemanager.address" = "hdfs://nodename:8032",
"spark.hadoop.fs.defaultFS" = "hdfs://nodename:8020",
"working_dir" = "hdfs://nodename:8020/tmp/doris",
"broker" = "broker_name_1"
); 
  

DataX: Heterogeneous Data Source Compatibility

Core Principle

Use Alibaba’s open-source DataX tool with custom hdfsreader and doriswriter plugins.

Use Case

Non-standard file formats (e.g., CSV) or non-HA HDFS environments.

Drawback: Lower performance (5.8GB in 1,421 seconds, 40k rows/sec) — use as a fallback.

Configuration example:

    JSON
   
 

   {  
  "job": {  
    "content": [  
      {  
        "reader": {  
          "name": "hdfsreader",  
          "parameter": {  
            "path": "/data/ssb/*",  
            "defaultFS": "hdfs://xxxx:9000",  
            "fileType": "text"  
          }  
        },  
        "writer": {  
          "name": "doriswriter",  
          "parameter": {  
            "feLoadUrl": ["xxxx:18040"],  
            "database": "test",  
            "table": "lineorder3"  
          }  
        }  
      }  
    ]  
  }  
}
  

Decision Tree for Solution Selection

Priority: Broker Load – Large datasets (≥ 10GB), minimal ETL, high throughput needs.
Second choice: Doris on Hive – Small datasets (< 1GB), frequent updates, federated query requirements.
Complex ETL: Spark Load – Data preprocessing needed; leverage Spark cluster resources.
Fallback: DataX – Special formats or network constraints; prioritize compatibility over performance.

Data Modeling and Storage Optimization

Data Model Selection

Aggregate model: Ideal for log statistics; stores aggregated metrics by key to reduce data volume.
Unique model: Ensures key uniqueness for slowly changing dimensions (equivalent to Replace in aggregate).
Duplicate model: Stores raw data for multi-dimensional analysis without aggregation.

Data Type Mapping

String to Varchar: Use Varchar for Doris key columns (avoid String); reserve 3x Hive field length for Chinese characters.
Type consistency: Convert Hive dates to Doris Date/DateTime and numeric types to Decimal/Float to avoid query-time conversions.

Partitioning and Bucketing Strategies

Partition keys: Reuse Hive partition fields (e.g., year-month) converted via str_to_date for pruning.
Bucket keys: Choose high-cardinality fields (e.g., order ID); keep single bucket size under 10GB to avoid skew and segment limits (default ≤ 200).

Performance Comparison and Best Practices

Optimization Tips

Small file merging: Use HDFS commands to merge small files and reduce Broker Load scanning.
Model tuning: Use a Duplicate model for fast ingestion, then create materialized views for query speed.
Monitoring: Track load status with SHOW LOAD in Doris.

Conclusion

Combining Hive and Doris unlocks synergies between offline storage and real-time analytics. By choosing the right sync strategy (prioritizing Broker/Spark Load), optimizing data models (using Aggregate for storage and bucketing for skew), and leveraging federated queries (Doris on Hive), enterprises can build efficient data architectures. Test with small datasets (e.g., SSB) before scaling to production, and stay updated with Doris community improvements (e.g., predicate pushdown) for ongoing performance gains.

Big data Extract, transform, load SPARK (programming language) Performance Apache

Opinions expressed by DZone contributors are their own.

Related

Trending

Best Practices for Syncing Hive Data to Apache Doris : From Scenario Matching to Performance Tuning

Learn how to efficiently sync and analyze big data by combining Hive’s storage with Doris’s real-time analytics using various sync strategies and optimizations.

Core Use Cases and Scope

Technical Pathways and Synchronization Modes

Synchronization Modes

Technical Solutions Overview

In-Depth Analysis of Four Synchronization Solutions

Broker Load: Asynchronous Sync for Large Datasets

Core Principle

Use Case

Doris On Hive: Low-Latency Federated Queries

Core Principle

Use Case

Spark Load: Performance Acceleration for Complex ETL

Core Principle

Use Case

DataX: Heterogeneous Data Source Compatibility

Core Principle

Use Case

Decision Tree for Solution Selection

Data Modeling and Storage Optimization

Data Model Selection

Data Type Mapping

Partitioning and Bucketing Strategies

Performance Comparison and Best Practices

Optimization Tips

Conclusion

Related

Partner Resources