As stated in the earlier article, Skewed tables are those in which some column values occur more frequently than others. As a result, the distribution is skewed. Hive will automatically separate skewed values into different files and take this into consideration during searches so that it can skip or include whole files if possible; thus enhances the performance.
Let’s practice a little to get a better understanding of it. Here’s the sample data for this exercise.
You can download the sample dataset here.
Step1: Creating a dummy table to store the data
CREATE TABLE SkewTest(YYYYMM INT, UserID INT, Score INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’;
Step2: Load and Verify the data
LOAD DATA LOCAL INPATH ‘Desktop/SkewedTablesData.txt’ INTO TABLE skewtest;
SELECT * FROM SkewTest;
Step3: Create the Skew Table
CREATE TABLE SkewTable(
YYYYMM INT,
UserID INT,
Score INT)
SKEWED BY (UserID) on (10022,9999) STORED AS DIRECTORIES;
Step4: Now Insert the data from dummy table
INSERT INTO SkewTable SELECT * FROM SkewTest;
Step5: Verify how the sub-directories are created. You’ll see two subdirectories as shown below in the /user/hive/warehouse directory.
/user/hive/warehouse/skewtable/userid=10022
/user/hive/warehouse/skewtable/userid=9999
I hope you found this post to be informative.
Please enter your email address to receive notifications of new postings.
3 comments