DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Resolving Parameter Sensitivity With Parameter Sensitive Plan Optimization in SQL Server 2022
  • Comparing Managed Postgres Options on The Azure Marketplace
  • Useful System Table Queries in Relational Databases
  • Introducing Graph Concepts in Java With Eclipse JNoSQL

Trending

  • Designing a Java Connector for Software Integrations
  • Strategies for Securing E-Commerce Applications
  • When Airflow Tasks Get Stuck in Queued: A Real-World Debugging Story
  • Designing AI Multi-Agent Systems in Java
  1. DZone
  2. Data Engineering
  3. Databases
  4. How to Use Vectorized Reader in Hive

How to Use Vectorized Reader in Hive

I have faced some issues with using the Vectorized Reader in Hive. I've written this blog to help you avoid the confusion I faced.

By 
Anubhav Tarar user avatar
Anubhav Tarar
·
Jul. 24, 17 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
8.9K Views

Join the DZone community and get the full member experience.

Join For Free

The reason for writing this blog is that I tried to use Vectorized Reader in Hive, but faced some problems with its documentation. That's why I've decided to write this blog.

Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1,024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop.

To use vectorized query execution, you must store your data in ORC format like so:

set hive.vectorized.execution.enabled = true ;

How to Query

To use vectorized query execution, you must store your data in ORC format. Just follow the below steps:

Start Hive CLI and create an ORC table with some data:

hive> create table vectortable(id int) stored as orc;
                OK
Time taken: 0.487 seconds
hive>set hive.vectorized.execution.enabled = true;

hive> insert into vectortable values(1);

Query ID = hduser_20170713203731_09db3954-246b-4b23-8d34-1d9d7b62965c

Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop)

2017-07-13 20:37:33,237 Stage-1 map = 100%, reduce = 0% Ended Job = job_local722393542_0002

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver. Moving data to: hdfs://localhost:54310/user/hive/warehouse/vectortable/.hive-staging_hive_2017-07-13_20-37-31_172_3262390557269287245-1/-ext-10000

Loading data to table default.vectortable Table default.vectortable stats: [numFiles=1, numRows=1, totalSize=199, rawDataSize=4]

MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 321 HDFS Write: 545 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 2.672 seconds

Now, query the table with the explain command to see whether Have is using vectorized execution or not.

Note: When Fetch is used in the plan instead of Map, it does not vectorize. So, first set hive.fetch.task.conversion=none:

hive> explain select id from vectortable where id>=1;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: vectortable
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (id >= 1) (type: boolean)
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: int)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized

Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

Time taken: 0.081 seconds, Fetched: 33 row(s)

As you can see in the explain command, Execution mode: vectorized is printed and is enabled for the query.

Database

Published at DZone with permission of Anubhav Tarar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Resolving Parameter Sensitivity With Parameter Sensitive Plan Optimization in SQL Server 2022
  • Comparing Managed Postgres Options on The Azure Marketplace
  • Useful System Table Queries in Relational Databases
  • Introducing Graph Concepts in Java With Eclipse JNoSQL

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!