Blog

back to blog
March 15, 2014 | Dipti Joshi

Windowing Functions for Analytics

 

Statistical windowing functions allow analysts to run statistical calculations in-database, without needing to use an external statistical processing environment such as R. Since most statistical processing platforms are limited to the memory of the single server on which they’re run, the ability to run statistics in-database on InfiniDB, utilizing multiple cores and the processing power of multiple nodes, is a game-changer for Data Scientists and Business Intelligence analysts looking to do deep analytics on Big Data.

The term “windowing” refers to the fact that InfiniDB performs a calculation on a subset, or a “window”, of the data. Although these functions provide aggregate values, unlike traditional aggregate functions using GROUP BY clause in standard SQL syntax, they return multiple rows for each group. Windowing functions simplify complex analytics tasks for computing cumulative, moving, centered or reporting aggregates without running sub-queries or writing stored procedures.  

Traditional Aggregate Function

Windowing Aggregate  Function

compute aggregates by creating groups

compute aggregates via partitions and window frames

output is one row for each group

 

output is one row for each input row

 

only one way of aggregating for each group

different rows in the same partition can have different window frames

only one way of grouping for each SELECT

aggregates in the same SELECT can use different partitions

Window functions operate on multiple rows, or groups of rows, within the result set returned from a query. Partitioning and windowing clauses define the groups of rows that a window function operates on. The order within these groups is defined by an ordering clause for the purpose of function evaluation. The general form of a windowing function is:

FUNCTION_NAME(expr) OVER [PARTITION  BY value, [value,…] ][ORDER BY rule, [rule, …]] [{ROWS|RANGE}  {start|between}]

As an example, to find the moving average of employees’ salaries within their department as ordered by their hiring date as well as their ranking by hiring date within their department:

mysql> select *,  avg(salary) over(partition by depname order by enroll_date ROWS UNBOUNDED PRECEDING) as moving_avg, rank() over(partition by depname order by enroll_date) as rank from empsalary order by depname, rank;

+-----------+-------+--------+-------------+------------+------+

| depname   | empno | salary | enroll_date | moving_avg | rank |

+-----------+-------+--------+-------------+------------+------+

| develop   |     8 |   6000 | 2006-10-01  |  6000.0000 |    1 |

| develop   |    10 |   5200 | 2007-08-01  |  5600.0000 |    2 |

| develop   |    11 |   5200 | 2007-08-15  |  5466.6667 |    3 |

| develop   |     7 |   4200 | 2008-01-01  |  5150.0000 |    4 |

| develop   |     9 |   4500 | 2008-01-01  |  5020.0000 |    4 |

| develop   |    12 |   7000 | 2008-08-09  |  5350.0000 |    6 |

| personnel |     2 |   3900 | 2006-12-23  |  3900.0000 |    1 |

| personnel |     5 |   3500 | 2007-12-10  |  3700.0000 |    2 |

| sales     |     1 |   5000 | 2006-10-01  |  5000.0000 |    1 |

| sales     |     3 |   4800 | 2007-08-01  |  4900.0000 |    2 |

| sales     |     4 |   4800 | 2007-08-08  |  4866.6667 |    3 |

+-----------+-------+--------+-------------+------------+------+

11 rows in set (0.02 sec)

 

Understanding Partition and Frame

  • For each row, calculation for an aggregation is done over a  FRAME of rows
  • The PARTITION of a row is the group of rows that have a value for a specific column same as the current row
  • FRAME for each row is  a subset of a PARTITION for the row

Example: The table below shows Partition and Frame for each row for this SQL statement

SELECT x,y,sum(x) OVER (PARTITION BY y RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) FROM a ORDER BY x, y

ROW NUMBER

X

Y

PARTITION

FRAME

FRAME

FRAME

FRAME

1

1

1

Partition for rows 1 to 4

Frame for row 1 sum(x) =22

     

2

4

1

Frame for row 2 sum(x) = 21

   

3

7

1

Frame for row 3 sum(x) = 17

 

4

10

1

Frame for row 4 sum(x) = 10

5

2

2

Partition for rows 5 to 7

Frame for row 5 sum(x) = 15

     

6

5

2

Frame for row 6 sum(x) = 13

   

7

8

2

Frame for row 7 sum(x) = 8

 

8

3

3

Partition for rows 8 to 10

Frame for row 8 sum(x) =18

     

9

6

3

Frame for row 9 sum(x) =15

   

10

9

3

Frame for row 10 sum(x) = 9

 

 

Processing Order

A SQL statement using a windowing function in its projection list, processes the join, where, group by and having clause of the query before computing the function as the order shown below.

    1. JOIN, WHERE, GROUP BY, HAVING CLAUSE

    2. PARTITIONS created and function applied to each row

    3. Final ORDER BY of the main query

Supported Windowing functions in InfiniDB

Aggregate

  • MAX, MIN, COUNT, SUM, AVG, STD, STDDEV, STDDEV_POP, VARIANCE, VAR_POP, VAR_SAMP

 Ranking

  • RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST, NTILE, PERCENTILE, PERCENTILE_CONT, PERCENTILE_DISC, MEDIAN, ROW_NUMBER

FIRST/LAST

  • NTH_VALUE, FIRST_VALUE, LAST_VALUE

LEAD/LAG

  • LAG, LEAD

 

Do you have a specific use case for windowing functions or another windowing function for InfiniDB that you would like to discuss?  Let us know in the comments, below, via email at info@ifninidb.co, or on Twitter @infiniDB or @dipti_smg.

Category
Categories