Apache Impala Architecture

Impala is a massively parallel query engine that runs on hundreds of servers in existing Hadoop clusters. Unlike standard relational database management systems, where query processing and the underlying storage engine are tightly coupled, it is decoupled from the latter.

Impala improves the efficiency of SQL queries on Apache Hadoop while maintaining a familiar user interface. Impala allows you to query data in real time, whether it’s stored in HDFS or Apache HBase, using SELECT, JOIN, and aggregate functions. Impala also shares Apache Hive’s metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax), making it a familiar and coherent platform for batch and real-time queries. (As a result, Hive users can use Impala with minimal setup time.)

Impala Architecture
There are three services in an Impala deployment.

The Impala daemon (impalad) service is in charge of both receiving client queries and managing their execution across the cluster, as well as executing individual query fragments on behalf of other Impala daemons. When an Impala daemon manages query execution in the first role, it is referred as as the query’s coordinator. All Impala daemons, on the other hand, are symmetric, and they can play any role. This property aids with load balance and fault tolerance.

Every system in the cluster that is also running a datanode process – the block server for the underlying HDFS deployment – has one Impala daemon, thus there is normally one Impala daemon on each machine. Impala can now take use of data locality and read blocks from the filesystem without needing to connect to a network.

Impala’s metadata publish-subscribe service, Statestore (statestored), distributes clusterwide metadata to all Impala processes.

Finally, Impala’s catalog repository and metadata access gateway is the Catalog daemon (catalogd). Impala daemons can use catalogd to send DDL commands to external catalog stores like the Hive Metastore. The statestore is used to distribute changes to the system catalog.

Query processing interfaces:
The interfaces that communicate with Impala are Hive metastore, JDBC/ODBC clients, Impala-shell and Hue web user interface.

Sources:
impala.apache.org
cidrdb.org