Usage of Collect_Set Function in Apache Hive

Collect_Set is one of the aggregate functions that help in return a set of objects with duplicate elements eliminated. This will help in bringing the column values into a single row. All values will be displayed in an array format.

The datasets on which we are about to practice the aggregate function “Collect_Set” are shown below.

Dataset1:
Table: tbSalesData
tbSalesData_dataset_aggr

Dataset2:
Table: StudentMarks

Let’s begin the exercises:

SELECT COLLECT_SET(city) from tbSalesData;

SELECT COLLECT_SET(id), COLLECT_SET(city) from tbSalesData;

It can be used along with other aggregate functions such as SUM, AVG, MAX, MIN, etc.

SELECT COLLECT_SET(city), SUM(amount) FROM tbSalesData;

SELECT
           StudentName,
           COLLECT_SET(Subject1),
           COLLECT_SET(Subject2),
           COLLECT_SET(Subject3),
           SUM(Subject1)+SUM(Subject2)+SUM(subject3) from StudentMarks
GROUP BY StudentName;

Hope you liked this post.

Please do follow for more interesting updates.

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s