Partitioning and Bucketing in Apache Hive can greatly assist in breaking tabular data collections into more manageable portions.
Hive Partitioning is a method of separating hive table data into several files/directories, which improves efficiency and aids data organization. When a column with a high search query has low cardinality, such as “Country,” “City,” “Brand,” “Make,” “Department,” and so on, partitioning is beneficial.
However, partitioning columns with a high cardinality, such as ProductID, StudentID, Timestamp, and price, should be avoided since it will result in a large number of directories that will be difficult to maintain and will result in bad performance.
Partitions can be split into buckets using the hash function of a column in bucketing. It adds more structure to the data, allowing for more efficient searches. Bucketed tables generate almost evenly distributed data file portions and, as a result, provide efficient sampling and quicker query results.
Bucketing can be done with or without partitioning. The diagrams below demonstrate how bucketing divides the data.
1) Bucketed Table (Without Partitioning)
2) Bucketed Table With Partitioning
The following are the examples to create a bucketed table.
CREATE TABLE emp2(
CLUSTERED BY (deptno) INTO 3 buckets;
SET hive.enforce.bucketing = true;
INSERT OVERWRITE TABLE emp2 SELECT * FROM emp;
CREATE TABLE tblCustomerDetails(
PARTITIONED BY (Country STRING)
CLUSTERED BY (State) SORTED BY (City) INTO 50 BUCKETS;
The table above is divided by country, as you can see. Then there’s the sub-division “State.” Unlike partitioned columns, bucketed columns are part of the table definitions.
Since the LOAD DATA command will not function for the above-mentioned tables, Hive should have a staging table where the INSERT INTO query can be used.
Hope you find this article helpful.
Please subscribe for more interesting updates.