Skewed tables are those in which some column values occur more frequently than others. As a result, the distribution is skewed. Hive will automatically separate skewed values into different files and take this into consideration during searches so that it can skip or include whole files if possible; thus enhances the performance.
In this post, we’ll look at how to use this functionality to boost performance for tables with skewed values in several columns.
Let’s start by looking at how to create a table with many skewed values in a single column.
CREATE TABLE tbl_skew_single (key STRING, value STRING)
SKEWED BY (col1) ON (120,130,140) [STORED AS DIRECTORIES];
And here is an example of a table with two skewed columns.
CREATE TABLE tbl_skew_multiple (col1 STRING, col2 int, col3 STRING)
SKEWED BY (col1, col2) ON ((‘v1’,1), (‘v2’,2), (‘v3’,3), (‘v4’,4)) [STORED AS DIRECTORIES];
I hope you found this post to be informative.
Please enter your email address to receive notifications of new postings.