Regular Expression Extract (Regexp_Extract) In Apache Hive

Regular expression functions identify precise patterns of characters in the given string and returns the matching text item from the string/data.

Syntax: REGEXP_EXTRACT(STRING subject, STRING pattern, INT index)

Let’s see practically how it works.

Below is the sample records from the dataset:
ReleaseYear MovieTitle
1969 Downhill Racer
1970 M*A*S*H
1970 The Party at Kitty and Stud’s
1970 Lovers and Other Strangers
1970 The Sidelong Glances of a Pigeon Kicker
1970 Hercules in New York
1971 Bananas
1971 Klute
1972 What’s Up, Doc?
1973 No Place to Hide

If you look at the data, there are two columns however it is difficult to convert these into structured format due to the delimitation. It can only be done using regular expressions. Let’s convert this data into columns using REGEXP_EXTRACT function. 

–Creating a table to hold the entire data in a single column.
–Let’s create dummy table to hold the entire data.
CREATE TABLE movies(data STRING);
— Loading the data
LOAD DATA LOCAL INPATH ‘Desktop/movies.csv’ INTO TABLE movies;
–Converting the data into meaningful columns.
SELECT REGEXP_EXTRACT(data, ‘^\\d+’, 0),
              REGEXP_EXTRACT(data, ‘\\s+(\\S.*)’, 1)
FROM movies;

movies_regexp_extract

Let me explain the meaning of regular expression provided for regexp_extract function:

^\\d – matches the digit at the start of the string (^). + symbol represents one or more number of characters. Thus, it captures the first string from the beginning that has any number of digits.

\\s
 – Helps in skipping the space character. + symbol helps in skipping more than one space character.

\\S
– Captures any non-whitespace char. And the dot and asterisk (.*) matches any zero or more chars other than line break chars. Index argument is 1 so that REGEXP_EXTRACT returns the captured group group without the initial whitespace.

With this function, let’s see what else we can do and what else we can add.

SORTING:
SELECT REGEXP_EXTRACT(data,’\\s+(\\S.*)’,1)x FROM movies ORDER BY x LIMIT 20;

sort1
SEARCHING:
Find out the names of the employees whose last-name starts with ‘S’

Data:
Will Smith
Black Smith
Scott James
James Miller
Allen Jackson
Some Name

SELECT REGEXP_EXTRACT(name,'(.*) (S.*)’,0) from someNames;

Result:
Will Smith
Black Smith

Find out the names whose first-name or last-name starts with ‘S’

SELECT REGEXP_EXTRACT(name,'(.*)(S.*)’,0) from someNames;

Result:
Some Name
Will Smith
Black Smith
Scott James


Hope you find this article. Please do follow for more interesting updates.

164 comments

  1. I’m truly enjoying the design and layout of your
    site. It’s a very easy on the eyes which makes it much more pleasant for me to come here and
    visit more often. Did you hire out a developer to create your theme?
    Excellent work!

    Like

  2. I have to thank you for the efforts you have
    put in writing this website. I really hope to view the same high-grade content by you in the future as well.
    In truth, your creative writing abilities has motivated
    me to get my own, personal site now 😉

    Like

  3. Howdy this is kinda of off topic but I was wanting to know if blogs use WYSIWYG editors or if you have to
    manually code with HTML. I’m starting a blog
    soon but have no coding skills so I wanted to get guidance from someone with
    experience. Any help would be enormously appreciated!

    Like

  4. Your style is really unique in comparison to other people I’ve
    read stuff from. Thank you for posting when you have the opportunity, Guess I will just bookmark
    this web site.
    Here is my web site Chandra

    Like

  5. I’m really impressed with your writing talents as well as with the format on your blog.
    Is that this a paid subject matter or did you customize it your self?
    Anyway keep up the nice high quality writing, it’s rare to peer a great weblog like this
    one these days..

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s