Apache Hive Introduction

Apache Hive:
Apache Hive is built on top of Apache Hadoop, which is a distributed, fault-tolerant, and open source data warehouse platform for reading, writing, and handling massive datasets stored directly in HDFS or other data management structures such as Apache HBase. Hive is characterized by the ability to query massive datasets using Apache Tez or MapReduce.

Designed For:
Hive was designed to allow non-programmers who are familiar with SQL to work with petabytes of data using a SQL-like interface known as HiveQL.

How it works:
Hive turns HiveQL queries into MapReduce or Tez jobs that run on Yet Another Resource Negotiator (YARN), Apache Hadoop’s distributed job scheduling framework. It queries data stored in a distributed database system such as Hadoop Distributed File System (HDFS) or Amazon Easy Storage Service (S3).

Storage:
Hive stores its database and table metadata in a metastore, which is a database or file backed store that enables easy data abstraction and exploration.

Traditional RDBMS vs Apache Hive:
Traditional relational databases are designed for dynamic queries on small to medium datasets, and they fail to accommodate or process massive datasets. Hive, on the other hand, operates efficiently through a broad distributed database by using batch processing.

Advantages:
Apache Hive allows massive-scale analytics. It generates a central source of data that can be quickly accessed and enable data-driven decisions.

Almost all of the features are defined in depth on this site, with various examples and datasets. Hope it helps you.

Please do follow this site for more interesting updates.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s