Skewed tables are those in which some column values occur more frequently than others. As a result, the distribution is skewed. Hive will automatically separate skewed values into different files and take this into consideration during searches so that it can skip or include whole files if possible; thus enhances the performance.
Look at the below dataset.
Name | Town | Country |
Ampthill | Bedfordshire | England |
Arlesey | Bedfordshire | England |
Bedford | Bedfordshire | England |
Biggleswade | Bedfordshire | England |
Dunstable | Bedfordshire | England |
Flitwick | Bedfordshire | England |
Sandiacre | Derbyshire | England |
Shallcross | Derbyshire | England |
Ashburton | Devon | England |
Axminster | Devon | England |
Newtown | Powys | Wales |
Johnstone | Strathclyde | Scotland |
Assume you have millions, if not billions, of such records. And the data is transferred to you on a daily basis. If you look at the data above, you’ll notice that it can be partitioned by country. However, if you look closely, you will notice that the data you are receiving is higher for the town of ‘Bedfordshire’. If you create partitions based on town and country, you may end up with too many chunks, which is unnecessary because you know for which towns you receive more data. Skew enters the picture here.
While creating the table if you provide the column and the value, Hive will separate the data into different directories so that while querying it can improve the performance.
Syntax:
CREATE TABLE <T> (Schema) skewed by (keys) on (‘value1’, ‘value2’) [STORED as DIRECTORIES];
Example:
CREATE TABLE TownsData(
Name STRING,
Town STRING,
Country String)
SKEWED BY (Town) on (‘Bedfordshire’) [STORED AS DIRECTORIES];
The benefit of using this feature is that the values that appear more frequently than others are separated out into separate files or directories. And the execution engine uses this information during query execution to make processing more efficient. The data is separated and stored into directories if you use STORED AS DIRECTORIES clause while creating the table.
For more examples and practice, please click here.
I hope you found this post to be informative.
Please enter your email address to receive notifications of new postings.
5 comments