Bucketing In Apache Hive

Partitioning and Bucketing in Apache Hive can greatly assist in breaking tabular data collections into more manageable portions.

Hive Partitioning is a method of separating hive table data into several files/directories, which improves efficiency and aids data organization. When a column with a high search query has low cardinality, such as “Country,” “City,” “Brand,” “Make,” “Department,” and so on, partitioning is beneficial.

However, partitioning columns with a high cardinality, such as ProductID, StudentID, Timestamp, and price, should be avoided since it will result in a large number of directories that will be difficult to maintain and will result in bad performance.

Partitions can be split into buckets using the hash function of a column in bucketing. It adds more structure to the data, allowing for more efficient searches. Bucketed tables generate almost evenly distributed data file portions and, as a result, provide efficient sampling and quicker query results.

Bucketing can be done with or without partitioning. The diagrams below demonstrate how bucketing divides the data.

1) Bucketed Table (Without Partitioning)

Bucketing_Without_Partitioning

2) Bucketed Table With Partitioning

Bucketing_With_Partitioning

The following are the examples to create a bucketed table.

Example-1:
CREATE TABLE emp2(
  empno INT,
  ename STRING,
  job STRING,
  mgr INT,
  hiredate STRING,
  sal DECIMAL(9,2),
  comm DECIMAL(9,2),
  deptno INT)
CLUSTERED BY (deptno) INTO 3 buckets;

SET hive.enforce.bucketing = true;

INSERT OVERWRITE TABLE emp2 SELECT * FROM emp;

Clusteredby

Example2:
CREATE TABLE tblCustomerDetails(
CustFirstName STRING,
CustLastName STRING,
Address STRING,
City STRING,
State STRING,
Zip STRING,
MobileNumber STRING)
PARTITIONED BY (Country STRING)
CLUSTERED BY (State) SORTED BY (City) INTO 50 BUCKETS;

The table above is divided by country, as you can see. Then there’s the sub-division “State.” Unlike partitioned columns, bucketed columns are part of the table definitions.

Since the LOAD DATA command will not function for the above-mentioned tables, Hive should have a staging table where the INSERT INTO query can be used.

Hope you find this article helpful.

Please subscribe for more interesting updates.