How Doris and Hive Work Together to Maximize Data Analysis Efficiency
At 3 a.m., the office is filled only with the dim glow of the computer screens. Data engineer Xiao Ming is struggling with two "heavyweights" — Doris and Hive.
Join the DZone community and get the full member experience.
Join For FreeAt 3 a.m., the office is filled only with the dim glow of the computer screens. Data engineer Xiao Ming is struggling with two "heavyweights" — Doris and Hive. "Export, clean, import..." He mechanically repeats these steps between different components, his eyes starting to see stars.
This scene is all too common in data teams, making one wonder: Do we really have to manually shuffle data for the rest of our lives? Just then, Doris extended an "olive branch" to Hive — the Hive Catalog made its dazzling debut! It's like arranging a perfect marriage for this "data couple," allowing Doris to directly read and write Hive data, enabling the two systems to "fly together." Whether it's HDFS or object storage, simple queries or complex analyses, one Catalog can handle it all!
This amazing feature caught Xiao Ming's attention, and he could finally say goodbye to those cumbersome data synchronization tasks. Let's uncover this "life-saving tool" for data engineers together!
The Perfect Encounter of Doris and Hive
Late at night, Xiao Ming was staring at the screen, worried. As a data engineer, he faced a tricky problem: The company's data was scattered between Doris and Hive systems, and every cross-system data analysis required manual export and import, which was cumbersome and inefficient.
"If only Doris could directly read and write Hive data..." he muttered to himself.
Xiao Ming is not the only one with this concern. With the explosive growth of data, enterprise data architectures have become increasingly complex, with data storage scattered across various systems. How to connect these data silos and achieve unified data access and analysis has become a common technical pain point.
The good news is that Apache Doris has perfectly solved this problem through the Hive Catalog feature. It's like building a bridge between Doris and Hive, enabling seamless collaboration between the two systems.
Starting from version 2.1.3, through the Hive Catalog, Doris can not only query and import data from Hive but also perform operations such as creating tables and writing data back, truly realizing the architecture design of a unified lakehouse.
The core value of Hive Catalog lies in its provision of a unified data access layer. For data developers, there is no need to worry about where the data is specifically stored; all data operations can be completed through Doris. For example, you can directly create a Hive table in Doris:
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.16.47:7004'
);
Once set up, you can operate Hive tables just like regular Doris tables. Not only does it support queries, but it also allows write operations such as INSERT
and CREATE TABLE AS SELECT
. The system automatically handles complex details such as partition management and file format conversion.
Even more excitingly, Doris provides a comprehensive security mechanism. By integrating Kerberos authentication and Ranger permission management, enterprises do not have to worry about data security issues. Fine-grained access control down to the column level can be achieved to ensure compliance with data access.
Now, Xiao Ming finally smiled. With Hive Catalog, his daily work efficiency has improved significantly. Cross-system data analysis has become so simple, as smooth as operating within the same system.
This is just the beginning. In the following sections, we will explore more powerful features of Hive Catalog. Let's take a look at the new chapter of Doris + Hive data lakehouse integration!
Core Features of Doris-Hive Catalog
Xiao Ming recently faced a new challenge. The company's data analysis scenarios are becoming increasingly complex, with both traditional HDFS storage and the introduction of object storage.
How can Doris elegantly handle these different storage media? Let's delve into the powerful features of Doris Hive Catalog in a simple and understandable way.
Diverse Storage Support
Each storage system has its own strengths. HDFS + Hive is suitable for large-scale offline processing of historical full data, while object storage offers high scalability and low-cost advantages...
But, Hive Catalog provides a unified access interface, shielding the differences of the underlying storage:
-- Connect to S3
CREATE CATALOG hive_s3 PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://172.0.0.1:9083",
"s3.endpoint" = "s3.us-east-1.amazonaws.com",
"s3.region" = "us-east-1",
"s3.access_key" = "ak",
"s3.secret_key" = "sk",
"use_path_style" = "true"
);
-- Optional properties:
-- s3.connection.maximum: Maximum number of S3 connections, default 50
-- s3.connection.request.timeout: S3 request timeout, default 3000ms
-- s3.connection.timeout: S3 connection timeout, default 1000ms
-- Connect to OSS
CREATE CATALOG hive_oss PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://172.0.0.1:9083",
"oss.endpoint" = "oss.oss-cn-beijing.aliyuncs.com",
"oss.access_key" = "ak",
"oss.secret_key" = "sk"
);
Intelligent Metadata Management
Doris employs an intelligent metadata caching mechanism to provide high-performance queries while ensuring data consistency:
Local Cache Policy
Doris caches table metadata locally to reduce the frequency of access to HMS. When the cache exceeds the threshold, it uses the LRU (Least-Recent-Used) strategy to remove some caches.
Smart Refresh
[Notification Event Diagram]
By subscribing to HMS's Notification Event, Doris can promptly detect metadata changes. For example, you can set the Catalog's timed refresh when creating the Catalog:
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
'metadata_refresh_interval_sec' = '3600'
);
You can also manually refresh as needed:
-- Refresh the specified Catalog.
REFRESH CATALOG ctl1 PROPERTIES("invalid_cache" = "true");
-- Refresh the specified Database.
REFRESH DATABASE [ctl.]db1 PROPERTIES("invalid_cache" = "true");
-- Refresh the specified Table.
REFRESH TABLE [ctl.][db.]tbl1;
Enterprise-Level Security Features
Security is always a top priority in enterprise data management. Hive Catalog also provides a complete security solution:
Ranger Permission Control
Apache Ranger is a security framework for monitoring, enabling services, and comprehensive data access management on the Hadoop platform.
Doris supports using Apache Ranger for authorization for a specified External Hive Catalog.
Currently, it supports Ranger's authorization for databases, tables, and columns, but does not support encryption, row-level permissions, Data Mask, and other functions.
You only need to configure the FE environment and add it when creating the Catalog:
-- access_controller.properties.ranger.service.name refers to the type of service
-- For example, hive, hdfs, etc. It is not the value of ranger.plugin.hive.service.name in the configuration file.
"access_controller.properties.ranger.service.name" = "hive",
"access_controller.class" = "org.apache.doris.catalog.authorizer.ranger.hive.RangerHiveAccessControllerFactory",
Kerberos Authentication
In addition to integrating with Ranger, Doris Hive Catalog also supports seamless integration with the existing Kerberos authentication system in enterprises. For example:
CREATE CATALOG hive_krb PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
'hive.metastore.sasl.enabled' = 'true',
'hive.metastore.kerberos.principal' = 'your-hms-principal',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.keytab' = '/your-keytab-filepath/your.keytab',
'hadoop.kerberos.principal' = 'your-principal@YOUR.COM',
'yarn.resourcemanager.principal' = 'your-rm-principal'
);
Xiao Ming can now flexibly choose storage methods and security modes according to different business needs, truly achieving unified management and efficient analysis of Doris + Hive data.
The boundaries between data lakes and data warehouses are blurring, and Doris has built a bridge connecting the two worlds through Hive Catalog. With the continuous evolution of technology, we look forward to seeing more innovative application scenarios.
Opinions expressed by DZone contributors are their own.
Comments