CLUSTER BY and DISTRIBUTE BY are used mainly with the Transform/Map-Reduce Scripts. However, it might be beneficial in SELECT statements if the output of a query has to be partitioned and sorted for subsequent queries. Hive distributes the data among reducers using the columns in CLUSTER BY. Multiple reducers will use the CLUSTER BY columns.
CLUSTER BY is used to bucket the table, which means that the number of partitions in CLUSTER BY will determine how many files are produced and dispersed in memory based on the hash of the column combination values mentioned in CLUSTER BY. When you save these files to disk, they will be redistributed depending on the hash of the combination of values in the CLUSTERED BY columns.
The following example shows how to utilize CLUSTER BY in a SELECT query.
Example:
SELECT deptno, job, COUNT(*) FROM emp
GROUP BY job, deptno
CLUSTER BY deptno;
When this query is run, it will return results to multiple reducers on the back end.
Please note that CLUSTER BY is a short-cut for both Distribute By and Sort By.
SELECT deptno, job, COUNT(*) FROM emp
GROUP BY job, deptno
DISTRIBUTE BY deptno
SORT BY deptno;
Hope you find this article helpful.
Please subscribe for more interesting updates.
One comment