Apache Hive Aggregate Functions – Part-2

This is continuation part of “Apache Hive Aggregate Functions”

Function: VARIANCE & VAR_POP

Returns the population variance of the total number of records present in the specified column.
Syntax: variance(col), var_pop(col)

Example:
SELECT VARIANCE(amount) FROM tbSalesData;
The result is 14618.639
As per the calculation, the variance supposed to be 15836.858 however in Apache Hive 1.1.0, it will be calculated as population variance. Apache Impala returns 15836.858.

SELECT VAR_POP(amount) FROM tbSalesData;
The result is 14618.639

Function: VAR_SAMP
Returns the unbiased sample variance of a numeric column in the group.
Syntax: var_samp(col)

Example:
SELECT VAR_SAMP(amount) FROM tbSalesData;
Returns 15836.858 as the result.

Function: STDDEV_POP
Returns the standard deviation of a numeric column in the group.
Syntax: stddev_pop(col)

Example:
SELECT STDDEV_POP(amount) FROM tbSalesData;
Returns 120.90 as the result.

Function: STDDEV_SAMP
Returns the unbiased sample standard deviation of a numeric column in the group.
Syntax: stddev_samp(col)

Example:
SELECT STDDEV_SAMP(amount) FROM tbSalesData;
Returns 125.84 as the result.

Function: COVAR_POP
Returns the population covariance of a pair of numeric columns in the group based on the below formula.
(SUM(x*y) – SUM(x) * SUM(y) / COUNT(*)) / COUNT(*)
x is the independent variable and y is the dependent variable.
Syntax: covar_pop(col1, col2)

Example:
— Create the table and Insert the values using Impala-shell;
CREATE TABLE CovarPopExample(id int, value1 decimal(7,2), value2 decimal(7, 2));
INSERT INTO CovarPopExample VALUES(1001, 4, 0);
INSERT INTO CovarPopExample VALUES(1001, 5, 4);
INSERT INTO CovarPopExample VALUES(1002, 6, 5);
INSERT INTO CovarPopExample VALUES(1002, 7, 6);
INSERT INTO CovarPopExample VALUES(1002, 8, 7);

Execute the below in Apache Hive. This function is not available in Apache Impala (as of version 2.7.0).
SELECT id, COVAR_POP(value1, value2) FROM CovarPopExample GROUP BY id;
Returns the following
1001     1.0
1002     0.6666666666666666

Function: COVAR_SAMP
Returns the sample covariance of a pair of a numeric columns in the group. This function is not available in Apache Impala (as of version 2.7.0).
Syntax: covar_samp(col1, col2)

Example:
SELECT id, COVAR_SAMP(value1, value2) FROM CovarPopExample GROUP BY id;
Returns the following result.
1001    2.0
1002    1.0

Function: CORR

Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group. This function is not available in Apache Impala (as of version 2.7.0)
Syntax: corr(col1, col2)

Example:
SELECT id, CORR(value1, value2) FROM CovarPopExample GROUP BY id;
Returns 1575.00 as the result.
1001    0.9999999999999999
1002    0.9999999999999999
SELECT CORR(value1, value2) FROM CovarPopExample;
Returns 0.936 as the result.

Function: PERCENTILE
Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.
This function is not available in Apache Impala (as of version 2.7.0)
Syntax: percentile(BIGINT col, p)

Example:
SELECT PERCENTILE(159900102,1);
Returns 1.59900102E8 as the result.

Function: PERCENTILE_APPROX
Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.
This function is not available in Apache Impala (as of version 2.7.0)
Syntax: percentile_approx(DOUBLE col, p [, B])

Example:
SELECT PERCENTILE_APPROX(159900102,0.9);
Returns 1.59900102E8 as the result.