Apache Hive Data Model

Apache Hive is built on top of Apache Hadoop, which is a distributed, fault-tolerant, and open source data warehouse platform for reading, writing, and handling massive datasets stored directly in HDFS or other data management structures such as Apache HBase. Hive is characterized by the ability to query massive datasets using Apache Tez or MapReduce.

Hive was designed to allow non-programmers who are familiar with SQL to work with petabytes of data using a SQL-like interface known as HiveQL.

Apache Hive Data Model:
Data in Hive organized into

  • Tables
  • Partitions
  • Buckets (also known as Clusters)

Tables:
Tables in Hive are similar to spreadsheets in that data is logically structured in a row-and-column manner. Each column represents a field in the record, and each row represents a distinct record.

hive data model

Tables can be classified into-

  • Internal Tables / Managed Tables
  • External Tables
  • Temporary Tables

Internal Table:
As mentioned in the previous post, when the data is temporary or if you want Hive to control the life cycle of the table and data, internal tables will be created. In internal tables, data and metadata are kept within the Hive warehouse by default. Prior to dropping some internal table, one must be careful as it would erase the data along with the metadata.

External Table:

When the data isn’t transient and you don’t want Hive to manage the table’s and data’s life cycle, you’ll consider creating external tables, as discussed in the previous article. As stated, Hive does not own the data; instead, a table would be built on top of the data that is stored elsewhere. The DROP TABLE statement would only remove the table from Hive, but the data will remain in HDFS.

Temporary Table:

A temporary table in Apache Hive, as the name indicates, is a table that works like a regular table but exists only temporarily in Hive’s temp space. It will remain till the user’s session expires. They are most often used to save interim results from batch processing that do not need to be saved permanently.

Partitions:
Partitioning is a method of organizing tables by dividing them into smaller portions based on partition keys, which are fundamental factors in defining how data is kept in the database.

Partitioning has the advantage of distributing execution load horizontally. There is a significant increase in performance while data retrieval since searching in data chunks is much faster than searching in the entire table.

Partitions can be classified into-

  • Static Partitions
  • Dynamic Partitions

Static Partition:
The technique of putting individual input data files into a partition table is known as static partitioning. In other words, you create a partition in the database and transfer the file into it manually. These static partitions are commonly used for importing large files into Hive tables since they minimize loading time.

Dynamic Partition:
Dynamic partitioning refers to the amount of manual intervention necessary to load substantial amounts of data into a partition table is low, or when a single insert statement divides the data into partitions. Dynamic partitioning often loads data from a non-partitioned table.

Although dynamic partitions take longer to load data than static partitions, they are the ideal solution when you have a significant amount of data stored in a table. Apart from this, if you wish to split a number of columns but don’t know how many, dynamic partitioning is an option.

Bucketing:
Bucketing divides data into smaller, more manageable chunks. Bucketing in hive, in other words, is the concept of breaking data down into ranges, or buckets, to give it greater structure and allow it to be used for more efficient queries.

Bucketing can be with or without partitioning.

Read : “Interanal Tables in Hive
Read : “External Tables in Hive
Read : “Differences between Internal and External Tables
Read : “Partitioning in Hive
Read : “Bucketing in Hive
Read : “Partitioned, Bucketed and Skewed Tables

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s