Loading data from many flat files into an RDBMS is a challenging operation. But in Hive, it won’t take long since Apache Hive is designed to make it easy to load, read, write, and manage massive and multiple datasets stored in distributed storage using SQL.
The example below will show you how to load several flat files into a table in Apache Hive.
Let’s consider the below datasets for the practice.
File: USPopulation1.txt
Phyllis L Pike,4961,John Calvin Drive,Chicago,IL,60606
Teri B Smith,3430,Caynor Circle,Branchburg,NJ,08876
Charmaine M Crumble,3731,Cerullo Road,Louisville,KY,40205
Michael S Aguilera,943,Sundown Lane,Austin,TX,78748
File: USPopulation2.txt
Agnes R Lavine,4841,Glory Road,Nashville,TN,37201
Teddy C Fenderson,205,Camden Place,Mount Pleasant,SC,29464
James S Her,2491,Jim Rosa Lane,SAN JOSE,CA,95120
Susan A Yamasaki,1014,Parrill Court,LIMERICK,NY,13657
Let’s store these flat files into HDFS in a specific directory.
[cloudera@quickstart ~]$ hadoop fs -mkdir /user/cloudera/multifiles
The above command creates a directory called “multifiles” in /user/cloudera.
[cloudera@quickstart ~]$
hadoop fs -put Desktop/Docs/USPopulation1.txt /user/cloudera/multifiles
[cloudera@quickstart ~]$
hadoop fs -put Desktop/Docs/USPopulation2.txt /user/cloudera/multifiles
The preceding commands move the flat files from the local system to the “multifiles” directory in /user/cloudera.
[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/multifiles
The above command will list out the files from the specified directory.
Result is:
Found 2 items
-rw-r–r– 1 cloudera cloudera 217 2021-06-10 14:52 /user/cloudera/multifiles/USPopulation1.txt
-rw-r–r– 1 cloudera cloudera 212 2021-06-10 14:52 /user/cloudera/multifiles/USPopulation2.txt
We have now successfully imported the flat files into HDFS using the given method. Now, in Apache Hive, let’s create an external table and connect these files to it.
CREATE EXTERNAL TABLE MultifilesEx(
Name String,
Street Int,
Area String,
City String,
State String,
Zip Int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘/user/cloudera/multifiles’
The preceding command not only creates a table but also loads data from the provided location where the flat files were put. It reads data from files automatically, therefore there is no need to import or load any data manually.
Hope you find this article helpful.
Please subscribe to receive notifications on latest updates.