DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Reporting in Microservices: How To Optimize Performance
  • Why Database Migrations Take Months and How to Speed Them Up
  • Unmasking Entity-Based Data Masking: Best Practices 2025
  • Fixing Common Oracle Database Problems

Trending

  • Kubeflow: Driving Scalable and Intelligent Machine Learning Systems
  • Building Enterprise-Ready Landing Zones: Beyond the Initial Setup
  • Mastering Fluent Bit: Installing and Configuring Fluent Bit on Kubernetes (Part 3)
  • Streamlining Event Data in Event-Driven Ansible
  1. DZone
  2. Data Engineering
  3. Big Data
  4. How to Improve Hive Query Performance With Hadoop

How to Improve Hive Query Performance With Hadoop

Apache Hive is a powerful tool for analyzing data. It's very important that you know how to improve the performance of query when you are processing petabytes of data.

By 
Jitendra Bafna user avatar
Jitendra Bafna
DZone Core CORE ·
Mar. 03, 17 · Tutorial
Likes (9)
Comment
Save
Tweet
Share
84.8K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Hive is a data warehouse built on the top of Hadoop for data analysis, summarization, and querying. Hive provides an SQL-like interface to query data stored in various data sources and file systems.

Use Tez Engine

Apache Tez Engine is an extensible framework for building high-performance batch processing and interactive data processing. It is coordinated by YARN in Hadoop. Tez improved the MapReduce paradigm by increasing the processing speed and maintaining the MapReduce ability to scale to petabytes of data.

Tez engine can be enabled in your environment by setting hive.execution.engine to tez:

set hive.execution.engine=tez;

Use Vectorization

Vectorization improves the performance by fetching 1,024 rows in a single operation instead of fetching single row each time. It improves the performance for operations like filter, join, aggregation, etc.

Vectorization can be enabled in the environment by executing below commands.

set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;

Use ORCFile

Optimized Row Columnar format provides highly efficient ways of storing the hive data by reducing the data storage format by 75% of the original. The ORCFile format is better than the Hive files format when it comes to reading, writing, and processing the data. It uses techniques like predicate push-down, compression, and more to improve the performance of the query.

Consider two tables: employee and employee_details, tables that are stored in a text file. Let's say we will use join to fetch details from both tables.

Select a.EmployeeID, a.EmployeeName, b.Address,b.Designation from Employee a
Join Employee_Details b
On a.EmployeeID=b.EmployeeID;

Above query will take a long time, as the table is stored as text. Converting this table into ORCFile format will significantly reduce the query execution time.

Create Table Employee_ORC (EmployeeID int, EmployeeName varchar(100),Age int)
STORED AS ORC tblproperties("compress.mode"="SNAPPY");

Select * from Employee Insert into Employee_ORC;

Create Table Employee_Details_ORC (EmployeeID int, Address varchar(100)
                                  ,Designation Varchar(100),Salary int)
STORED AS ORC tblproperties("compress.mode"="SNAPPY");

Select * from Employee_Details Insert into Employee_Details_ORC;
Select a.EmployeeID, a.EmployeeName, b.Address,b.Designation from Employee_ORC a
Join Employee_Details_ORC b
On a.EmployeeID=b.EmployeeID;

ORC supports compressed (ZLIB and Snappy), as well as uncompressed storage.

Use Partitioning

With partitioning, data is stored in separate individual folders on HDFS. Instead of querying the whole dataset, it will query partitioned dataset.

Create Temporary Table and Load Data Into Temporary Table

Create Table Employee_Temp(EmloyeeID int, EmployeeName Varchar(100), 
                           Address Varchar(100),State Varchar(100),
                           City Varchar(100),Zipcode Varchar(100))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

LOAD DATA INPATH '/home/hadoop/hive' INTO TABLE Employee_Temp;

Create Partitioned Table

Create Table Employee_Part(EmloyeeID int, EmployeeName Varchar(100), 
                           Address Varchar(100),State Varchar(100),
                           Zipcode Varchar(100))
PARTITIONED BY (City Varchar(100))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Enable Dynamic Hive Partition

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Import Data From Temporary Table To Partitioned Table

Insert Overwrite table Employee_Part Partition(City) Select EmployeeID,
EmployeeName,Address,State,City,Zipcode from Emloyee_Temp;

Use Bucketing

The Hive table is divided into a number of partitions and is called Hive Partition. Hive Partition is further subdivided into clusters or buckets and is called bucketing or clustering.

Create Table Employee_Part(EmloyeeID int, EmployeeName Varchar(100), 
                           Address Varchar(100),State Varchar(100),
                           Zipcode Varchar(100))
PARTITIONED BY (City Varchar(100))
Clustered By (EmployeeID) into 20 Buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

Cost-Based Query Optimization

Hive optimizes each query's logical and physical execution plan before submitting for final execution. However, this is not based on the cost of the query during the initial version of Hive.

During later versions of Hive, query has been optimized according to the cost of the query (like which types of join to be performed, how to order joins, the degree of parallelism, etc.).

To use cost-based optimization, set the below parameters at the start of the query.

set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Summary

Apache Hive is a very powerful tool for analyzing data, and it supports batch and interactive data processing. It is one of the most-used techniques by data analysts and data scientists. It is very important that you know how to improve the performance of query when you are processing petabytes of data.

Now you know how to improve the performance of the Hive query!

Database Data (computing) hadoop

Opinions expressed by DZone contributors are their own.

Related

  • Reporting in Microservices: How To Optimize Performance
  • Why Database Migrations Take Months and How to Speed Them Up
  • Unmasking Entity-Based Data Masking: Best Practices 2025
  • Fixing Common Oracle Database Problems

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!