Cluster By vs Clustered By in Hive

Clustered by and Sorted by dividing the keys into several buckets and then sorting the buckets. Cluster by guarantees that each of the N reducers has non-overlapping ranges, then sorts the reducers by those ranges. CLUSTERED BY is is used in CREATE TABLE statements, whereas CLUSTER BY is utilized in SELECT statements.

Let’s take a closer look at each one.

CLUSTER BY:

CLUSTER BY and DISTRIBUTE BY are used mainly with the Transform/Map-Reduce Scripts. However, it might be beneficial in SELECT statements if the output of a query has to be partitioned and sorted for subsequent queries. Hive distributes the data among reducers using the columns in CLUSTER BY. Multiple reducers will use the CLUSTER BY columns.

CLUSTER BY is used to bucket the table, which means that the number of partitions in CLUSTER BY will determine how many files are produced and dispersed in memory based on the hash of the column combination values mentioned in CLUSTER BY. When you save these files to disk, they will be redistributed depending on the hash of the combination of values in the CLUSTERED BY columns.

The following example shows how to utilize CLUSTER BY in a SELECT query.

Example:
SELECT deptno, job, COUNT(*) FROM emp
GROUP BY job, deptno
CLUSTER BY deptno;

When this query is run, it will return results to multiple reducers on the back end.

CLUSTERED BY:
The CLUSTERED BY clause in the CREATE TABLE statement helps in the creation of bucketed tables, which is a useful approach for splitting table data sets into smaller chunks.

Example:
CREATE TABLE emp2(
empno INT,
ename STRING,
job STRING,
mgr INT,
hiredate STRING,
sal DECIMAL(9,2),
comm DECIMAL(9,2),
deptno INT)
CLUSTERED BY (deptno) INTO 3 buckets;