Apache Hive – Skewed Tables

Skewed tables are those in which some column values occur more frequently than others. As a result, the distribution is skewed. Hive will automatically separate skewed values into different files and take this into consideration during searches so that it can skip or include whole files if possible; thus enhances the performance.

Look at the below dataset.

NameTownCountry
AmpthillBedfordshireEngland
ArleseyBedfordshireEngland
BedfordBedfordshireEngland
BiggleswadeBedfordshireEngland
DunstableBedfordshireEngland
FlitwickBedfordshireEngland
SandiacreDerbyshireEngland
ShallcrossDerbyshireEngland
AshburtonDevonEngland
AxminsterDevonEngland
NewtownPowysWales
JohnstoneStrathclydeScotland

Assume you have millions, if not billions, of such records. And the data is transferred to you on a daily basis. If you look at the data above, you’ll notice that it can be partitioned by country. However, if you look closely, you will notice that the data you are receiving is higher for the town of ‘Bedfordshire’. If you create partitions based on town and country, you may end up with too many chunks, which is unnecessary because you know for which towns you receive more data. Skew enters the picture here.

While creating the table if you provide the column and the value, Hive will separate the data into different directories so that while querying it can improve the performance.

Syntax:
CREATE TABLE <T> (Schema) skewed by (keys) on (‘value1’, ‘value2’) [STORED as DIRECTORIES];

Example:
CREATE TABLE TownsData(
Name STRING,
Town STRING,
Country String)
SKEWED BY (Town) on (‘Bedfordshire’) [STORED AS DIRECTORIES];

The benefit of using this feature is that the values that appear more frequently than others are separated out into separate files or directories. And the execution engine uses this information during query execution to make processing more efficient. The data is separated and stored into directories if you use STORED AS DIRECTORIES clause while creating the table.

For more examples and practice, please click here.

I hope you found this post to be informative.

Please enter your email address to receive notifications of new postings.

5 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s