Calculating “Approximate Median” in Cloudera Impala, Apache Hive, SQL Server, Oracle and MySQL

APPROX_MEDIAN is an approximate inverse distribution function that accept a nonstop/continuous dispersion model. It takes a numeric or datetime value and returns an estimated middle value or an approximate interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation.

In short, median is the middle value of a set of ordered data.

Median = {(n + 1) ÷ 2}th value

is the number of values in a set of data.

This function is available few RDBMSs like Oracle SQL*Plus and Cloudera Impala, also in Hive, we can achieve it using PERCENTILE function. If the function is not available in the RDBMS in which you work, we still can get the approximate median value in simple steps which we will discuss later.

Click here to get the “Emp” dataset from my previous post if the table and data not exists in your database.

Let’s see how we implement it in Cloudera Impala first.

SELECT appx_median(sal) FROM emp;

appxmedian1

Result: 
appx_median(Sal)
20000.00

SELECT DeptID, appx_median(sal) FROM emp GROUP BY DeptID;

appxmedian2
Result: as shown in the picture.

The same can be accomplished in Hive with a different function.

SELECT DeptID, PERCENTILE(CAST(sal AS INT),0.5) FROM emp GROUP BY DeptID;

appxmedian_hive

Let’s try in Oracle SQL*Plus 12c
SELECT department_id “Department”,
       APPROX_MEDIAN(salary DETERMINISTIC) “Median Salary”
  FROM employees
  GROUP BY department_id  

In SQL Server:
The below will work only if the compact mode is 110 or higher
SELECT
  percentile_cont(0.25) WITHIN GROUP(ORDER BY sal) OVER () AS percentile_cont_25,
  percentile_cont(0.50) WITHIN GROUP(ORDER BY sal) OVER () AS percentile_cont_50,
  percentile_cont(0.75) WITHIN GROUP(ORDER BY sal) OVER () AS percentile_cont_75,
  percentile_cont(0.95) WITHIN GROUP(ORDER BY sal) OVER () AS percentile_cont_95
FROM emp;

In case of MySQL, there are many ways to calculate the median value. The workarounds can be found here.

Hope you find this article useful in calculating approximate median in Big Data technologies like Cloudera Impala, Apache Hive and various traditional RDBMSs.

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s