Incremental aggregation

The reduce function provides a closed formula mechanism for defining incremental aggregation over a collection $C$ by calling a reducer function $r$ for each element $e$ in $C$ in some order to finally produce an aggregated value $v$ . Each call to the reducer updates an intermediate accumulator being a partial result in the aggregation $v$ .

For bags, vectors and arrays the reduce function returns a single aggregated result for the entire collection.

The reduce function is defined for streams as well but behaves differently in that streamed reduce produces a new aggregated value for each new element produced by the stream.

Simple reduce

In its simplest form the reduce function has two arguments reduce(c,r). The reducer r is here a binary function $r(a,e_i)$ that returns a new value of accumulator $a$ for each element $e_i$ in $C$ . The accumulator has the same type as the elements and is initialized to $e_1$ . The result of the aggregation is the value of $a$ after all elements in $C$ have been visited by reduce in some order decided by the query processor.

We can define the semantics of simple reduce with the following incremental recurrence formulas for a collection $C$ with elements $e_1,...,e_N$ :

Initialization: $a_1 = e_1 \newline$ Reduction: $a_i = r(a_{i-1}, e_i), i > 1 \newline$ Finalization: $v = a_N$

Example: The following expression computes 1+2+3+4.

reduce((values 1,2,3,4),'+')

Here, the accumulator $a$ is a partial sum in the final aggregation.

The same reducer can be applied on vectors too.

reduce([1,2,3,4],'+')

For arrays the reducer is applied on each array element independent of the array's shape:

Examples:

reduce(array([1,2,3,4]),'+')

reduce(array([[1,2],[3,4]]),'+')

If the collection is empty, the result of a reduction is always null. If the collection contains a single element $e_1$ the result is that element.

Example:

reduce([1],'+')

The result type of reduce(c,r) is the same as the type of the elements in r.

Example:

typesig(reduce((values 1.1, 2.3), '+'))

Both the argument and result types of the reducer must be compatible with the types of the element of the reduced collection.

Example: The following expression fails.

reduce((values 'a','b'),'*')

If the reducer is overloaded, a matching resolvent is applied on the elements.

Example:

reduce((values [1,2],[3,4]), '+')

In general, simple reduction works for aggregations where the reductor is commutative, i.e. $r(x,y) = r(y,x)$ .

Examples:

reduce((values 1,2,3),'*')

reduce((values 1,2,3),'max')

note

Choosing non-commutative reducer, such as -, should be avoided. It causes the result of $reduce(C,r)$ to be unstable because the result will depend on in what order the system chooses the elements in $C$ for which $r$ is called.

The second argument of reduce, the reducer, is normally the name of a generic function. It can also be a functional.

Example:

reduce((values 1,2,3),thefunction('max'))

note

The reduce function is a higher order function since the reducer function in an argument.

Parameterized reduction

Many aggregate functions require that the accumulator is separate from the elements in the collection. For this, extra parameters can be provided when calling reduce, parameterized reduction. The extra parameters provide one or several accumulators used during the reduction. Parameterized reduction requires reduction initialization to set initial accumulator values before calling the reducer for $e_1$ and reduction finalization to compute the final aggregated value $v$ .

Example: To count the elements $count(C)$ in a collection $C$ having the elements $e_1,...,e_N$ the following recurrence formulas are used.

Initialization: $cnt_0 = 0 \newline$ Reduction: $cnt_i = cnt_{i-1} + 1 \newline$ Finalization: $count(C) = cnt_N$

The following function, the reducer, implements the reduction for $count(C)$ .

create function count_reducer(Integer cnt, Object e) -> Integer next_cnt
  as cnt + 1;

Using count_reducer we can define an aggregate function count1 with the accumulator initialized by $cnt_0 = 0$ as follows.

create function count1(Bag b) -> Integer
  as reduce(b, 'count_reducer', 0)

Test it:

count1(values 1,2,'a')

Here the accumulator cnt is separate from the elements in the bag and initialized by the extra argument 0 in the call to reduce. The finalization is the final value of the accumulator cnt returned by the call to reduce.

The type of the accumulator must be compatible with the type of the extra argument and the accumulator argument of the reducer.

Example: The following fails.

reduce((values 1,2,3), 'count_reducer', 'a')

Often aggregations require more than one accumulator. They can be added and initialized as more arguments to reduce.

Example: The mean of a collection $C$ with elements $e_i, i=1..N$ can be computed by the following recurrence formulas.

Initialization: $\newline cnt_0 = 0 \newline sum_0 = 0.0 \newline$ Reduction: $\newline cnt_i = cnt_{i-1} + 1 \newline sum_i = sum_{i-1} + e_i \newline$ Finalization: $\newline mean(C) = \frac{sum_N}{cnt_N}$

In this case the reducer needs two accumulators, cnt and sum, for $cnt_i$ and $sum_i$ , respectively. The reducer for $e_i$ is defined as follows based of the reduction step:

create function mean1_reducer(Integer cnt, Real sum, Real e)
                            -> (Integer next_cnt, Integer next_sum)
  as select cnt + 1,
            sum + e;

Now we can define mean1 for bags as follows:

create function mean1(Bag of Number b) -> Real
  as select sum/count
       from Integer count, Real sum
      where (count, sum) = reduce(b, 'mean1_reducer', 0, 0.0)

note

The initialization of accumulators cnt and sum are passed as extra parameters 0, 0.0 to reduce. The finalization is the expression sum/count.

Test it:

mean1(values 1,2,3)

note

The reduce function is a variadic function accepting a varying number of arguments.

Materialized reduction

Unlike bags and streams, whose elements are dynamically generated, vectors and arrays are materialized as in-memory data structures. This makes it efficient to run multi-pass aggregations over vectors and arrays.

We could define mean1 using mean1_reducer also for vectors:

create function mean1(Vector of Number v) -> Real
  as select sum/count
       from Integer count, Real sum
      where (count, sum) = reduce(v, 'mean1_reducer', 0, 0.0)

Test it:

mean1([1,2,3])

However, here we should notice that vectors are materialized so their sizes are precomputed and accessible with the function dim(Vector v)->Integer.

The aggregate function mean1(Vector of Number v) -> Real can therefore be implemented more efficiently without the counter as:

create function mean1(Vector of Number v) -> Real
  as reduce(v, '+') / dim(v)

Test it:

mean1([1,2,3])

Analogously, it can be implemented for arrays as:

create function mean1(Array a) -> Real
  as reduce(a, '+') / dim(a)

Test it:

mean1(array([1,2,3]))

One-pass reduction

Since elements in bags are generated it is preferred to use a one-pass reduction algorithm in order to avoid generating elements several times. One pass reduction algorithms are based on recurrence formulas as for mean1 above.

For example, the classical formula to compute the variance of a collection $C$ with elements $e_i$ , $i=1..N$ is two pass, since the mean has to be computed before the summation:

$var(C) = \frac{\sum_{i=1}^N(e_i - mean(C))^2}{N - 1}$

The classical formula can be used for computing the variance of vectors and arrays, since they are materialized. The materialized variance reduction needs two accumulators, the accumulated variance and the mean value computed beforehand.

create function variance1_reducer(Real var, Real mean, Real e)
                                -> (Real next_var, Real next_mean)
  as select var + (e - mean)^2,
            mean;

note

The accumulator mean is a really a constant parameter so the reducer keeps it constant.

In the aggregate function variance1(v) we initialize var to 0 and mean to mean1(v) defined above.

create function variance1(Vector of Number v) -> Real
  as select var/(dim(v) - 1)
       from Real var, Real mean
      where (var, mean) = reduce(v, 'variance1_reducer', 0, mean1(v))

Test it:

variance1([1,2,3])

Example: We could define variance1 for bags as follows

create function variance1(Bag of Number b) -> Real
  as select var/ (count(b) - 1)
       from Real var, Real mean
      where (var, mean) = reduce(b, 'variance1_reducer', 0, mean1(b))

Test it:

variance1(values 1,2,3)

The serious problem with the naive definition of variance1(Bag of Number)->Real is that the elements in the bag will be generated three times, by count, mean1, and variance1. A one-pass data reduction algorithm would be much better.

Welford's on line algorithm is a recommended numerically stable algorithm for computing statistics of a collection in one pass. It is defined by the following recurrence formulas for a collection $C$ with elements $e_i, i=1..N$ .

Initialization: $\newline cnt_0 = 0 \newline m_0 = 0 \newline s_0 = 0 \newline$ Reduction: $\newline cnt_i = cnt_{i-1} + 1 \newline m_i = e_i$ when $cnt_i = 1\newline m_i = m_{i-1} + \frac{e_i - m_{i-1}}{cnt_i}$ when $cnt_i > 1\newline s_i = s_{i-1} + (e_i - m_i) * (e_i - m_{i-1}) \newline$ Finalization: $\newline var(C) = \frac{s_N}{cnt_N - 1}$

Based on these formulas, we define a numerically stable and one-pass variance2(Bag of Number)->Real having the following reducer:

create function variance2_reducer(Integer cnt, Real m, Real s, Real e)
                                -> (Integer next_cnt, Real next_m, Real next_s)
  as select next_cnt, next_m, next_s
      where next_cnt = cnt + 1
        and case when next_cnt = 1 then next_m = e
                                   else next_m = m + (e - m)/next_cnt
             end
        and next_s = s + cnt*(e - m)^2/next_cnt

The definition of variance2 becomes:

create function variance2(Bag of Number b) -> Real
   as select s/(cnt - 1)
        from Integer cnt, Real m, Real s
       where (cnt, m, s) = reduce(b, 'variance2_reducer', 0, 0, 0)

Test it:

variance2(values 1,2,3,4,5)

Incremental aggregation

Simple reduce​

Parameterized reduction​

Materialized reduction​

One-pass reduction​

Simple reduce

Parameterized reduction

Materialized reduction

One-pass reduction