DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Optimizing Databricks Spark Pipelines Using Declarative Patterns
  • Boost Your Spark Jobs: How Photon Accelerates Apache Spark Performance
  • Apache Spark 3 to Apache Spark 4 Migration: What Breaks, What Improves, What's Mandatory
  • Hadoop on AmpereOne Reference Architecture

Trending

  • Designing API-First EMR Architectures in .NET: Enabling Modular Growth in Compliance-Driven Systems
  • From Indicators to Insights: Automating IOC Enrichment Using Python and Threat Feeds
  • DevOps and Platform Engineering Readiness Checklist: Everything Needed for a Scalable, Secure, High-Velocity Delivery Platform
  • How to Format Articles for DZone
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Best Practices for Syncing Hive Data to Apache Doris :  From Scenario Matching to Performance Tuning

Best Practices for Syncing Hive Data to Apache Doris :  From Scenario Matching to Performance Tuning

Learn how to efficiently sync and analyze big data by combining Hive’s storage with Doris’s real-time analytics using various sync strategies and optimizations.

By 
Darren Xu user avatar
Darren Xu
·
Jul. 17, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
4.0K Views

Join the DZone community and get the full member experience.

Join For Free

In the realm of big data, Hive has long been a cornerstone for massive data warehousing and offline processing, while Apache Doris shines in real-time analytics and ad-hoc query scenarios with its robust OLAP capabilities. When enterprises aim to combine Hive’s storage prowess with Doris’s analytical agility, the challenge lies in efficiently and reliably syncing data between these two systems. 

This article provides a comprehensive guide to Hive-to-Doris data synchronization, covering use cases, technical solutions, model design, and performance optimization.

Core Use Cases and Scope

When target data resides in a Hive data warehouse and requires accelerated analysis via Doris’s OLAP capabilities, key scenarios include:

  • Reporting and ad-hoc queries: Enable fast analytics through synchronization or federated queries.
  • Unified data warehouse construction: Build layered data models in Doris to enhance query efficiency.
  • Federated query acceleration: Directly access Hive tables from Doris to avoid frequent data ingestion.

Technical Pathways and Synchronization Modes

Synchronization Modes

  • Full/Incremental sync: Suitable for low-update-frequency scenarios (e.g., log data, dimension tables) where a complete data model is needed in Doris.
  • Federated query mode: Ideal for high-frequency, small-data-volume scenarios (e.g., real-time pricing data) to reduce storage costs and ingestion latency by querying Hive directly from Doris.

Technical Solutions Overview

Four mainstream approaches exist, chosen based on data volume, update frequency, and ETL complexity:

In-Depth Analysis of Four Synchronization Solutions

Broker Load: Asynchronous Sync for Large Datasets

Core Principle

Leverage Doris’s built-in Broker service to asynchronously load data from HDFS (where Hive data resides) into Doris, supporting full and incremental modes.

Use Case

  • Data scale: Suitable for datasets ranging from tens to hundreds of GB, stored in HDFS and accessible by Doris.
  • Performance: Syncing a 5.8GB SSB dataset (60M rows) takes 140–164 seconds, achieving 370k–420k rows/sec (cluster-dependent).
    Key Operations:
  • Table optimization: Temporarily set replication_num=1 during ingestion for speed, then adjust to 3 replicas for durability.
  • Partition conversion: Convert Hive partition fields (e.g., yyyymm) to Doris-compatible date types using str_to_date.
  • HA configuration: Include namenode addresses in WITH BROKER for HDFS high-availability setups.

Doris On Hive: Low-Latency Federated Queries

Core Principle

Use a Catalog to access Hive metadata, enabling direct queries or INSERT INTO SELECT syncs.

Use Case

  • Small datasets (e.g., pricing tables) with frequent updates (minute-level), no pre-aggregation needed in Doris.
  • Supports text, parquet, and ORC formats (Hive ≥ 2.3.7).
    Advantages:
  • No data landing in Doris; direct join queries between Hive and Doris tables with sub-0.2-second latency.

Spark Load: Performance Acceleration for Complex ETL

Core Principle

Offload data preprocessing to an external Spark cluster, reducing Doris’s computational pressure.

Use Case

  • Data cleaning: Complex data cleaning (e.g., multi-table JOINs, field transformations) with Spark accessing HDFS
  • Performance: 5.8GB synced in 137 seconds (440k rows/sec), outperforming Broker Load
    Configuration:
  • Spark settings: Update Doris FE config (fe.conf) with spark_home and spark_resource_path:
Shell
 
enable_spark_load = true  
spark_home_default_dir = /opt/cloudera/parcels/CDH/lib/spark  
spark_resource_path = /opt/cloudera/parcels/CDH/lib/spark/spark-2x.zip  


External Resource Creation:

SQL
 
CREATE EXTERNAL RESOURCE "spark0"
PROPERTIES
(
"type" = "spark",
"spark.master" = "yarn",
"spark.submit.deployMode" = "cluster",
"spark.executor.memory" = "1g",
"spark.yarn.queue" = "queue0",
"spark.hadoop.yarn.resourcemanager.address" = "hdfs://nodename:8032",
"spark.hadoop.fs.defaultFS" = "hdfs://nodename:8020",
"working_dir" = "hdfs://nodename:8020/tmp/doris",
"broker" = "broker_name_1"
); 


DataX: Heterogeneous Data Source Compatibility

Core Principle

Use Alibaba’s open-source DataX tool with custom hdfsreader and doriswriter plugins.

Use Case

Non-standard file formats (e.g., CSV) or non-HA HDFS environments.

Drawback: Lower performance (5.8GB in 1,421 seconds, 40k rows/sec) — use as a fallback.

Configuration example:

JSON
 
{  
  "job": {  
    "content": [  
      {  
        "reader": {  
          "name": "hdfsreader",  
          "parameter": {  
            "path": "/data/ssb/*",  
            "defaultFS": "hdfs://xxxx:9000",  
            "fileType": "text"  
          }  
        },  
        "writer": {  
          "name": "doriswriter",  
          "parameter": {  
            "feLoadUrl": ["xxxx:18040"],  
            "database": "test",  
            "table": "lineorder3"  
          }  
        }  
      }  
    ]  
  }  
}


Decision Tree for Solution Selection

  • Priority: Broker Load  –  Large datasets (≥ 10GB), minimal ETL, high throughput needs.
  • Second choice: Doris on Hive  –  Small datasets (< 1GB), frequent updates, federated query requirements.
  • Complex ETL: Spark Load  –  Data preprocessing needed; leverage Spark cluster resources.
  • Fallback: DataX  –  Special formats or network constraints; prioritize compatibility over performance.

Data Modeling and Storage Optimization

Data Model Selection

  • Aggregate model: Ideal for log statistics; stores aggregated metrics by key to reduce data volume.
  • Unique model: Ensures key uniqueness for slowly changing dimensions (equivalent to Replace in aggregate).
  • Duplicate model: Stores raw data for multi-dimensional analysis without aggregation.

Data Type Mapping

  • String to Varchar: Use Varchar for Doris key columns (avoid String); reserve 3x Hive field length for Chinese characters.
  • Type consistency: Convert Hive dates to Doris Date/DateTime and numeric types to Decimal/Float to avoid query-time conversions.

Partitioning and Bucketing Strategies

  • Partition keys: Reuse Hive partition fields (e.g., year-month) converted via str_to_date for pruning.
  • Bucket keys: Choose high-cardinality fields (e.g., order ID); keep single bucket size under 10GB to avoid skew and segment limits (default ≤ 200).

Performance Comparison and Best Practices

Performance comparison and best practices


Optimization Tips

  • Small file merging: Use HDFS commands to merge small files and reduce Broker Load scanning.
  • Model tuning: Use a Duplicate model for fast ingestion, then create materialized views for query speed.
  • Monitoring: Track load status with SHOW LOAD in Doris.

Conclusion

Combining Hive and Doris unlocks synergies between offline storage and real-time analytics. By choosing the right sync strategy (prioritizing Broker/Spark Load), optimizing data models (using Aggregate for storage and bucketing for skew), and leveraging federated queries (Doris on Hive), enterprises can build efficient data architectures. Test with small datasets (e.g., SSB) before scaling to production, and stay updated with Doris community improvements (e.g., predicate pushdown) for ongoing performance gains.

Big data Extract, transform, load SPARK (programming language) Performance Apache

Opinions expressed by DZone contributors are their own.

Related

  • Optimizing Databricks Spark Pipelines Using Declarative Patterns
  • Boost Your Spark Jobs: How Photon Accelerates Apache Spark Performance
  • Apache Spark 3 to Apache Spark 4 Migration: What Breaks, What Improves, What's Mandatory
  • Hadoop on AmpereOne Reference Architecture

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook