Input File Format Constraints in Hive

The goal of this article is to discuss the boundaries of utilizing the various file formats and compression codecs available in Apache Hive for different data types. However, they are not limitations, but rather methods for dealing with them.

As discussed earlier, HiveQL handles structured data only, much like SQL. In order to store the data in it, Hive has a derby database by default. The data will be stored as files in the backend framework while it shows the data in a structured format when it is retrieved. Some special file formats that Hive can handle are available, such as Text file format, Sequence Files, RC, Avro, ORC and Parquet.

You can load data from HDFS and the local file system into the text file formatted table. The same cannot be done for Parquet, Sequence File, RC, and ORC. Loading data into these tables differs from loading data into a TEXT-formatted table. Because the data is compressed before being stored in the table according to the file format provided. Because there is no means to load compressed files into tables, loading directly like in TEXTFILE format is not possible. As a result, we must design a table that stores data in TEXTFILE FORMAT. After importing the data, we can extract it and utilize it to INSERT into the table that has been setup to save the data in SEQUENCE FILE, RC, ORC or Parquet.

The format of the input file is not chosen at random. It all depends on what we want to achieve in the end, Because each file format has advantages and disadvantages, such as slower or faster read and write times, it may be impossible to divide compressed files or possible to split files that allow us to read only a portion of a file rather than the complete file; Possibility of schema evolution support that allows us to alter the fields in a dataset; Possibility of advanced compression support, which allows columnar files to be compressed with a compression codec without compromising these features.

Hope you find this article helpful.

For further studies on this topic, please stay in touch or subscribe to receive notifications.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s