ORDER BY vs SORT BY in Apache Hive

ORDER BY in HiveQL is similar to ORDER BY in SQL language, both in syntax and functionality. Order by is a clause that we use with the “SELECT” statement in Hive queries to sort data in ascending or descending order based on our preferences. The good and bad about ORDER BY is it ensures global ordering, but it does so by passing all data via a single reducer, which is inefficient for big datasets.

On the other hand, “SORT BY” sorts data by reducer, which means that “SORT BY” sorts the rows before passing them to the reducer, and when there are many reducers, “SORT BY” may produce partially sorted final results; additionally, there is a possibility of having a number of sorted files with overlapping ranges because each reducer may receive data from multiple sources.

In short, the distinction between “ORDER BY” and “SORT BY” is that ORDER BY ensures overall order in the output, whilst SORT BY simply assures ordering of the rows inside a reducer.

Please note that SORT BY consumes more resources than ORDER BY.

Syntax for ORDER BY:
SELECT * FROM TableName ORDER BY ColumnName

Example:
SELECT * FROM SampleData ORDER BY RandCol LIMIT 10;

OrderBy

Syntax for SORT BY:
SELECT * FROM TableName SORT BY ColumnName

Example:
SELECT * FROM sampledata SORT BY RandCol LIMIT 10;
Sorting and Limiting in the above example invokes two map reduce jobs.

SortBy-1

Despite the fact that they both appear the same and perform the same purpose (sorting), their functioning is not the same. The same query took 28.723 seconds for ORDER BY and 49.066 seconds for SORT BY, as shown in the screenshots.

Hope you find this article helpful.

Please do subscribe for more interesting updates.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s