Bucketing In Apache Hive

Partitioning and Bucketing in Apache Hive can greatly assist in breaking tabular data collections into more manageable portions. 

Hive Partitioning is a method of separating hive table data into several files/directories, which improves efficiency and aids data organization. When a column with a high search query has low cardinality, such as “Country,” “City,” “Brand,” “Make,” “Department,” and so on, partitioning is beneficial.

However, partitioning columns with a high cardinality, such as ProductID, StudentID, Timestamp, and price, should be avoided since it will result in a large number of directories that will be difficult to maintain and will result in bad performance.

Partitions can be split into buckets using the hash function of a column in bucketing. It adds more structure to the data, allowing for more efficient searches. Bucketed tables generate almost evenly distributed data file portions and, as a result, provide efficient sampling and quicker query results.

Bucketing can be done with or without partitioning. The diagrams below demonstrate how bucketing divides the data.

1) Bucketed Table (Without Partitioning)

Bucketing_Without_Partitioning

2) Bucketed Table With Partitioning

Bucketing_With_Partitioning

The following are the examples to create a bucketed table.

Example-1:

CREATE TABLE emp2(
          empno INT,
          ename STRING,
          job STRING,
          mgr INT,
          hiredate STRING,
          sal DECIMAL(9,2),
          comm DECIMAL(9,2),
          deptno INT)
CLUSTERED BY (deptno) INTO 3 buckets;

SET hive.enforce.bucketing = true;

INSERT OVERWRITE TABLE emp2 SELECT * FROM emp;

Clusteredby

Example2: 
CREATE TABLE tblCustomerDetails(
                CustFirstName  STRING,
                CustLastName   STRING,
                Address                STRING,
                City                        STRING,
                State                      STRING,
                Zip                          STRING,
                MobileNumber   STRING)
      PARTITIONED BY (Country STRING)
      CLUSTERED BY (State) SORTED BY (City) INTO 50 BUCKETS;

The table above is divided by country, as you can see. Then there’s the sub-division “State.” Unlike partitioned columns, bucketed columns are part of the table definitions. 

Since the LOAD DATA command will not function for the above-mentioned tables, Hive should have a staging table where the INSERT INTO query can be used.

Hope you find this article helpful.

Please subscribe for more interesting updates.

6 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s