Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Partitioning in Apache Hive

DZone's Guide to

Partitioning in Apache Hive

Hive is great for performing queries on large datasets, but what should you do when you need to filter data on specific column values? Partition!

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

Hive is a good tool for performing queries on large datasets — especially datasets that require full table scans. But quite often, there are instances in which users need to filter the data on specific column values. And that's where partitioning comes into play. A partition is nothing but a directory that contains the chunk of data. When we do partitioning, we create a partition for each unique value of the column.

Let's run a simple example to see what it is. The syntax to create a partition table is:

create table tablename(colname type) partitioned by(colname type);

If hive.exec.dynamic.partition.mode is set to strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic:

yyy

partition

Here, we created a table named emp info with two fields, name and address. We partitioned the table by column ID of type  nt and then we inserted the value in this table. It’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data. Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.

Below is the syntax for inserting values:

insert into partition values();

First, we inserted a record with id=1. Now, we insert another record with id=2.

par

Now, go to the /user/hive/warehouse/default/empinfo directory in your Hadoop Filesystem.

ppp.png

As we can see, there are two partitions: one with name id=1 and one with id=2. Now, when a select query is fired with the where clause, it will not scan the full table but will scan only the required partition.

www.png

If you try it with a non-partitioned table with a large dataset, it will take more time, as it will have to go through entire table scan.

I hope this blog is helpful. Happy coding!

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
big data ,hadoop ,apache hive ,partitioning

Published at DZone with permission of Anubhav Tarar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}