Hive is a good tool for performing queries on large datasets — especially datasets that require full table scans. But quite often, there are instances in which users need to filter the data on specific column values. And that's where partitioning comes into play. A partition is nothing but a directory that contains the chunk of data. When we do partitioning, we create a partition for each unique value of the column.
Let's run a simple example to see what it is. The syntax to create a partition table is:
create table tablename(colname type) partitioned by(colname type);
hive.exec.dynamic.partition.mode is set to
strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic:
Here, we created a table named
emp info with two fields, name and address. We partitioned the table by column ID of type
nt and then we inserted the value in this table. It’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data. Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.
Below is the syntax for inserting values:
insert into partition values();
First, we inserted a record with id=1. Now, we insert another record with id=2.
Now, go to the
/user/hive/warehouse/default/empinfo directory in your Hadoop Filesystem.
As we can see, there are two partitions: one with name id=1 and one with id=2. Now, when a select query is fired with the
where clause, it will not scan the full table but will scan only the required partition.
If you try it with a non-partitioned table with a large dataset, it will take more time, as it will have to go through entire table scan.
I hope this blog is helpful. Happy coding!