Please see my previous post on bucketing and bucketed tables for more information. Bucketed Sorted Tables will be explored in this post.
As discussed in the previous article, Apache Hive’s partitioning and bucketing features can help divide down tabular data collections into more manageable chunks. In bucketing, partitions can be divided into buckets using the hash function of a column. It gives the data greater structure, making searches more efficient.
Bucketed tables produce nearly uniformly dispersed data file portions, allowing for efficient sampling and faster query responses.
Bucketed Sorted Table:
While the table is bucketed (clustered by) based on a column, data can be sorted in increasing order of the provided column within each bucket. The user can execute efficient sampling on the clustered column. When assessing queries, the sorting attribute allows internal operators to take advantage of the more well-known data structure, which increases efficiency. If any of the columns are lists or maps, the keywords MAP KEYS and COLLECTION ITEMS can be used.
Please refer to the following example for better understanding.
Example-1:
CREATE TABLE emp2(
empno INT,
ename STRING,
job STRING,
mgr INT,
hiredate STRING,
sal DECIMAL(9,2),
comm DECIMAL(9,2),
deptno INT)
CLUSTERED BY (deptno) SORTED BY (empno) INTO 3 buckets;
SET hive.enforce.bucketing = true;
INSERT OVERWRITE TABLE emp2 SELECT * FROM emp;
In the example above, the emp2 table is bucketed (clustered by) deptno and within each bucket the data is sorted by empno.
Example2:
CREATE TABLE tblCustomerDetails(
CustFirstName STRING,
CustLastName STRING,
Address STRING,
City STRING,
State STRING,
Zip STRING,
MobileNumber STRING)
PARTITIONED BY (Country STRING)
CLUSTERED BY (State) SORTED BY (City) INTO 50 BUCKETS;
In the example above, the “tblCustomerDetails” table is bucketed (clustered by) “State” and within each bucket the data is sorted by “City”.
I hope you found this article to be both informative and useful.
Please continue to follow us for more interesting information.