Performance Tuning in Hive

Performance tuning is the process of ensuring that an application’s SQL queries execute as quickly as feasible. The procedures may or may not differ from those used in typical RDBMS applications, but the end aim is that every developer and analyst should be aware of the tuning process in order to save time, storage and decrease resource use.

1) File Formats:
There are several file formats supported by Hive that gives the best compression ratio such as Sequence, Row columnar, ORC, Parquet, etc. Choosing the right file format can provide substantial advantages, such as faster read and write speeds. Read: File Formats and Compressions in Hive.

2) Enable Compression:
Compression properties can be enabled in Hive-shell before creating and loading the data into the table. Enabling compression in Hive can increase query execution performance while also saving storage space on the HDFS cluster.

3) Partitioning:
Partitioning is a method of separating a table’s data into different pieces based on values, making data retrieval faster by allowing us to choose the partitioned data we need. Read: Partitioning in Hive.

4) Indexing:
Indexes are specific lookups that can be used by the database search engine to speed up data retrieval. Indexes help to speed up query processing and search operations. Read: Indexing in Hive. Read: Indexes in Apache Hive.

5) Bucketing:
Bucketing is another way of splitting table data or partitioned data into more manageable pieces, which is done using the hash function of a table column to provide the data more structure that may be utilized for more efficient searches. Read: Bucketing in Hive.

6) Sampling Techniques:
Data sampling is the best practice to understand the data patterns and trends of large datasets by looking at the smaller portion of the data instead of retrieving the large dataset. The main advantages of sampling are lower cost and faster data collection than measuring the entire huge dataset. Read: Data Sampling Techniques in Hive.

7) Joins:
Wherever feasible, avoid sub-queries and instead utilize Joins. Because the retrieval time of a join query is always faster than the retrieval time of a subquery. Joins allow you to avoid running numerous queries against the tables. Read: Joins in HiveQL.

8) Auto-Map Joins & Skew Joins & Bucketed Map Joins:
If you need to join a large table with a small table, you may use Auto Map-Join, which saves the small table in the local cache on each node and joins it with the larger table during the Map phase. A skew join is used when there is a table with skew data in the joining column. We can use bucketed map join to increase speed if tables are bucketed by a certain column and these tables are utilized in joins. These can be enabled using SET commands in Hive or setting the properties in hive-site.xml file.

9) Execution Engine
The command “set hive.execution.engine=tez” switches the execution engine from MapReduce to Tez, which provides faster Hive query processing.

Apart from the above, there are several approaches to enhance the performance such as enabling Vectorization, Cost Based Optimization and Parallel execution depending on the work-load, etc.