The collection of interview questions for Apache Hive is available here. I don’t claim full ownership of the questions and answers because the majority of them were found on numerous websites.
What is the difference between Pig and Hive?
Hive is typically used for structured data while Apache Pig is typically used for semi-structured data.
While the schema is not necessary for Apache Pig, it is necessary for Hive.
While Apache Pig uses a procedural data flow language, Hive uses a declarative language that adheres to the SQL dialect.
While Hive is primarily used for reporting, Apache Pig is primarily used for programming.
Hive is typically utilized on the server side of the Hadoop cluster as opposed to the client side.
When to use Pig Latin and HiveQL?
Pig and Hive are two major components of the Hadoop ecosystem that make it easier to write complicated Java MapReduce processes. However, most Hive developers are unsure about when to employ them. The answer is nearly identical to the one given previously.
- Hive is used mainly by data analysts whereas Pig is generally used by researchers and programmers.
- Hive is used for completely structured data whereas Pig is used for semi-structured data.
- Hive has a declarative ‘SQL’ish language (HiveQL) whereas Pig Hadoop Component has a procedural data flow language (Pig Latin)
- Hive Hadoop Component is mainly used for creating reports whereas Pig Hadoop Component is mainly used for programming.
- Hive Hadoop Component operates on the server side of any cluster whereas Pig Hadoop Component operates on the client side of any cluster.
- Hive Hadoop Component is helpful for ETL whereas Pig Hadoop is a great ETL tool for big data because of its powerful transformation and processing capabilities.
- Hive can start an optional thrift-based server that can send queries from any nook and corner directly to the Hive server which will execute them whereas this feature is not available with Pig.
List a few commonly used Hive services.
- Command Line Interface (cli)
- Hive Web Interface (hwi)
- HiveServer (hiveserver)
- Printing the contents of an RC file using the tool rcfilecat.
How to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state.
Yes, administrators can monitor various hive transactions using the ‘SHOW TRANSACTIONS’ command which is supported in Hive version 0.13.0 and above.
What is the use of Hcatalog?
Hcatalog can be used to share data structures with external systems. Hcatalog provides access to Hive metastore to users of other tools on Hadoop so that they can read and write data to Hive’s data warehouse.
Where is table data stored in Apache Hive by default?
Explain the difference between partitioning and bucketing.
- Partitioning and bucketing of tables are done to enhance query performance efficiency. Partitioning helps execute queries faster, only if the partitioning scheme has some common range filtering i.e., either by timestamp ranges, by location, etc. Bucketing does not work by default.
- Partitioning helps eliminate data when used in the WHERE clause. Bucketing helps organize data inside the partition into multiple files so that the same set of data will always be written in the same bucket. Bucketing helps in joining various columns.
- In the partitioning technique, a partition is created for every unique value of the column and there could be a situation where several tiny partitions may have to be created. However, with bucketing, one can limit it to a specific number and the data can then be decomposed in those buckets.
- In simpler words, in Hive, a partition is essentially a directory, whereas a bucket is a file.
Explain the different types of partitioning that can be done in Hive?
Partitioning in Hive helps trim the data when executing the queries to speed up the processing. Partitions are created when data is inserted into the table.
In static partitions, the name of the partition is hardcoded into the insert statement whereas, in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.
Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions, the complete data in the file is read and partitioned through a MapReduce job based on the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.
In a nutshell, when loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.
When executing Hive queries in different directories, why is metastore_db created in all places from where Hive is launched?
When running Hive in embedded mode, it creates a local metastore. When the user runs the query, it first checks whether a metastore already exists or not. The property javax.jdo.option.ConnectionURL defined in the hive-site.xml has a default value jdbc: derby: databaseName=metastore_db; create=true.
The value implies that embedded derby will be used as the Hive metastore and the location of the metastore is metastore_db which will be created only if it does not exist already. The location metastore_db is a relative location so when you run queries from different directories it gets created at all places from wherever you launch hive. This property can be altered in the hive-site.xml file to an absolute path so that it can be used from that particular location instead of creating multiple metastore_db subdirectory multiple times.