Using XPath to read XML in Hive

XPath is used to navigate through elements and attributes in an XML document. XPath Short, XPath Int, XPath Long, XPath Float, XPath Double, XPath Number, and XPath String are only a few of the XPath functions available in Apache Hive. Though each of these functions produces a different result, they all operate in the same way. All is based on the details in the XML text. No XML validation is performed however malformed xml returns an error.

Let’s do some exercises to understand how XPath can be used.

Here’s my sample XML data.
<CustInfo>
<CustName>Melanie Pike</CustName>
<CustPhone>00971305374745</CustPhone>
</CustInfo>
<CustInfo>
<CustName>Rose Moore</CustName>
<CustPhone>00971305374746</CustPhone>
</CustInfo>
<CustInfo>
<CustName>Dalillah</CustName>
<CustPhone>00971305374747</CustPhone>
</CustInfo>

My text will look like this-

XML_SampleData
Let’s create a table in Apache Hive and load this data into it.

–Creating the test table to hold the xml document in a single column.
CREATE TABLE TestXML (line STRING);

–Importing the data into the table.
LOAD DATA LOCAL INPATH ‘Desktop/Docs/xmltest.csv’ INTO TABLE TestXML;

Now, let’s use XPath to retrieve needed information.

SELECT XPATH(line, ‘/CustInfo/CustName/text()’) as CustName,
              XPATH(line, ‘/CustInfo/CustPhone/text()’) as CustPhone
FROM TestXML;

xpath_xml_scrshot

This way, XPath will help us retrieving the necessary information from the XML document.

Hope you will find this article.

Please do subscribe to receive the notifications of the latest posts.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s