Cluster By Clause in Hive

CLUSTER BY and DISTRIBUTE BY are used mainly with the Transform/Map-Reduce Scripts. However, it might be beneficial in SELECT statements if the output of a query has to be partitioned and sorted for subsequent queries. Hive distributes the data among reducers using the columns in CLUSTER BY. Multiple reducers will use the CLUSTER BY columns.

CLUSTER BY is used to bucket the table, which means that the number of partitions in CLUSTER BY will determine how many files are produced and dispersed in memory based on the hash of the column combination values mentioned in CLUSTER BY. When you save these files to disk, they will be redistributed depending on the hash of the combination of values in the CLUSTERED BY columns.

The following example shows how to utilize CLUSTER BY in a SELECT query.

Example:
SELECT deptno, job, COUNT(*) FROM emp
GROUP BY job, deptno
CLUSTER BY deptno;

When this query is run, it will return results to multiple reducers on the back end.

Clusterby-1

Please note that CLUSTER BY is a short-cut for both Distribute By and Sort By.

SELECT deptno, job, COUNT(*) FROM emp

GROUP BY job, deptno
DISTRIBUTE BY deptno
SORT BY deptno;

DistributeBy_SortBy-1

Hope you find this article helpful.

Please subscribe for more interesting updates.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s