Skip to main content

Incremental aggregation

The reduce function provides a closed formula mechanism for defining incremental aggregation over a collection CC by calling a reducer function rr for each element ee in CC in some order to finally produce an aggregated value vv. Each call to the reducer updates an intermediate accumulator being a partial result in the aggregation vv.

For bags, vectors and arrays the reduce function returns a single aggregated result for the entire collection.

The reduce function is defined for streams as well but behaves differently in that streamed reduce produces a new aggregated value for each new element produced by the stream.

Simple reduce

In its simplest form the reduce function has two arguments reduce(c,r). The reducer r is here a binary function r(a,ei)r(a,e_i) that returns a new value of accumulator aa for each element eie_i in CC. The accumulator has the same type as the elements and is initialized to e1e_1. The result of the aggregation is the value of aa after all elements in CC have been visited by reduce in some order decided by the query processor.

We can define the semantics of simple reduce with the following incremental recurrence formulas for a collection CC with elements e1,...,eNe_1,...,e_N:

Initialization: a1=e1a_1 = e_1 \newline Reduction: ai=r(ai1,ei),i>1a_i = r(a_{i-1}, e_i), i > 1 \newline Finalization: v=aNv = a_N

Example: The following expression computes 1+2+3+4.

reduce((values 1,2,3,4),'+')

Here, the accumulator aa is a partial sum in the final aggregation.

The same reducer can be applied on vectors too.

reduce([1,2,3,4],'+')

For arrays the reducer is applied on each array element independent of the array's shape:

Examples:

reduce(array([1,2,3,4]),'+')
reduce(array([[1,2],[3,4]]),'+')

If the collection is empty, the result of a reduction is always null. If the collection contains a single element e1e_1 the result is that element.

Example:

reduce([1],'+')

The result type of reduce(c,r) is the same as the type of the elements in r.

Example:

typesig(reduce((values 1.1, 2.3), '+'))

Both the argument and result types of the reducer must be compatible with the types of the element of the reduced collection.

Example: The following expression fails.

reduce((values 'a','b'),'*')

If the reducer is overloaded, a matching resolvent is applied on the elements.

Example:

reduce((values [1,2],[3,4]), '+')

In general, simple reduction works for aggregations where the reductor is commutative, i.e. r(x,y)=r(y,x)r(x,y) = r(y,x).

Examples:

reduce((values 1,2,3),'*')
reduce((values 1,2,3),'max')
note

Choosing non-commutative reducer, such as -, should be avoided. It causes the result of reduce(C,r)reduce(C,r) to be unstable because the result will depend on in what order the system chooses the elements in CC for which rr is called.

The second argument of reduce, the reducer, is normally the name of a generic function. It can also be a functional.

Example:

reduce((values 1,2,3),thefunction('max'))
note

The reduce function is a higher order function since the reducer function in an argument.

Parameterized reduction

Many aggregate functions require that the accumulator is separate from the elements in the collection. For this, extra parameters can be provided when calling reduce, parameterized reduction. The extra parameters provide one or several accumulators used during the reduction. Parameterized reduction requires reduction initialization to set initial accumulator values before calling the reducer for e1e_1 and reduction finalization to compute the final aggregated value vv.

Example: To count the elements count(C)count(C) in a collection CC having the elements e1,...,eNe_1,...,e_N the following recurrence formulas are used.

Initialization: cnt0=0cnt_0 = 0 \newline Reduction: cnti=cnti1+1cnt_i = cnt_{i-1} + 1 \newline Finalization: count(C)=cntNcount(C) = cnt_N

The following function, the reducer, implements the reduction for count(C)count(C).

create function count_reducer(Integer cnt, Object e) -> Integer next_cnt
as cnt + 1;

Using count_reducer we can define an aggregate function count1 with the accumulator initialized by cnt0=0cnt_0 = 0 as follows.

create function count1(Bag b) -> Integer
as reduce(b, 'count_reducer', 0)

Test it:

count1(values 1,2,'a')

Here the accumulator cnt is separate from the elements in the bag and initialized by the extra argument 0 in the call to reduce. The finalization is the final value of the accumulator cnt returned by the call to reduce.

The type of the accumulator must be compatible with the type of the extra argument and the accumulator argument of the reducer.

Example: The following fails.

reduce((values 1,2,3), 'count_reducer', 'a')

Often aggregations require more than one accumulator. They can be added and initialized as more arguments to reduce.

Example: The mean of a collection CC with elements ei,i=1..Ne_i, i=1..N can be computed by the following recurrence formulas.

Initialization: cnt0=0sum0=0.0\newline cnt_0 = 0 \newline sum_0 = 0.0 \newline Reduction: cnti=cnti1+1sumi=sumi1+ei\newline cnt_i = cnt_{i-1} + 1 \newline sum_i = sum_{i-1} + e_i \newline Finalization: mean(C)=sumNcntN\newline mean(C) = \frac{sum_N}{cnt_N}

In this case the reducer needs two accumulators, cnt and sum, for cnticnt_i and sumisum_i, respectively. The reducer for eie_i is defined as follows based of the reduction step:

create function mean1_reducer(Integer cnt, Real sum, Real e)
-> (Integer next_cnt, Integer next_sum)
as select cnt + 1,
sum + e;

Now we can define mean1 for bags as follows:

create function mean1(Bag of Number b) -> Real
as select sum/count
from Integer count, Real sum
where (count, sum) = reduce(b, 'mean1_reducer', 0, 0.0)
note

The initialization of accumulators cnt and sum are passed as extra parameters 0, 0.0 to reduce. The finalization is the expression sum/count.

Test it:

mean1(values 1,2,3)
note

The reduce function is a variadic function accepting a varying number of arguments.

Materialized reduction

Unlike bags and streams, whose elements are dynamically generated, vectors and arrays are materialized as in-memory data structures. This makes it efficient to run multi-pass aggregations over vectors and arrays.

We could define mean1 using mean1_reducer also for vectors:

create function mean1(Vector of Number v) -> Real
as select sum/count
from Integer count, Real sum
where (count, sum) = reduce(v, 'mean1_reducer', 0, 0.0)

Test it:

mean1([1,2,3])

However, here we should notice that vectors are materialized so their sizes are precomputed and accessible with the function dim(Vector v)->Integer.

The aggregate function mean1(Vector of Number v) -> Real can therefore be implemented more efficiently without the counter as:

create function mean1(Vector of Number v) -> Real
as reduce(v, '+') / dim(v)

Test it:

mean1([1,2,3])

Analogously, it can be implemented for arrays as:

create function mean1(Array a) -> Real
as reduce(a, '+') / dim(a)

Test it:

mean1(array([1,2,3]))

One-pass reduction

Since elements in bags are generated it is preferred to use a one-pass reduction algorithm in order to avoid generating elements several times. One pass reduction algorithms are based on recurrence formulas as for mean1 above.

For example, the classical formula to compute the variance of a collection CC with elements eie_i, i=1..Ni=1..N is two pass, since the mean has to be computed before the summation:

var(C)=i=1N(eimean(C))2N1var(C) = \frac{\sum_{i=1}^N(e_i - mean(C))^2}{N - 1}

The classical formula can be used for computing the variance of vectors and arrays, since they are materialized. The materialized variance reduction needs two accumulators, the accumulated variance and the mean value computed beforehand.

create function variance1_reducer(Real var, Real mean, Real e)
-> (Real next_var, Real next_mean)
as select var + (e - mean)^2,
mean;
note

The accumulator mean is a really a constant parameter so the reducer keeps it constant.

In the aggregate function variance1(v) we initialize var to 0 and mean to mean1(v) defined above.

create function variance1(Vector of Number v) -> Real
as select var/(dim(v) - 1)
from Real var, Real mean
where (var, mean) = reduce(v, 'variance1_reducer', 0, mean1(v))

Test it:

variance1([1,2,3])

Example: We could define variance1 for bags as follows

create function variance1(Bag of Number b) -> Real
as select var/ (count(b) - 1)
from Real var, Real mean
where (var, mean) = reduce(b, 'variance1_reducer', 0, mean1(b))

Test it:

variance1(values 1,2,3)

The serious problem with the naive definition of variance1(Bag of Number)->Real is that the elements in the bag will be generated three times, by count, mean1, and variance1. A one-pass data reduction algorithm would be much better.

Welford's on line algorithm is a recommended numerically stable algorithm for computing statistics of a collection in one pass. It is defined by the following recurrence formulas for a collection CC with elements ei,i=1..Ne_i, i=1..N.

Initialization: cnt0=0m0=0s0=0\newline cnt_0 = 0 \newline m_0 = 0 \newline s_0 = 0 \newline Reduction: cnti=cnti1+1mi=ei\newline cnt_i = cnt_{i-1} + 1 \newline m_i = e_i when cnti=1mi=mi1+eimi1cnticnt_i = 1\newline m_i = m_{i-1} + \frac{e_i - m_{i-1}}{cnt_i} when cnti>1si=si1+(eimi)(eimi1)cnt_i > 1\newline s_i = s_{i-1} + (e_i - m_i) * (e_i - m_{i-1}) \newline Finalization: var(C)=sNcntN1\newline var(C) = \frac{s_N}{cnt_N - 1}

Based on these formulas, we define a numerically stable and one-pass variance2(Bag of Number)->Real having the following reducer:

create function variance2_reducer(Integer cnt, Real m, Real s, Real e)
-> (Integer next_cnt, Real next_m, Real next_s)
as select next_cnt, next_m, next_s
where next_cnt = cnt + 1
and case when next_cnt = 1 then next_m = e
else next_m = m + (e - m)/next_cnt
end
and next_s = s + cnt*(e - m)^2/next_cnt

The definition of variance2 becomes:

create function variance2(Bag of Number b) -> Real
as select s/(cnt - 1)
from Integer cnt, Real m, Real s
where (cnt, m, s) = reduce(b, 'variance2_reducer', 0, 0, 0)

Test it:

variance2(values 1,2,3,4,5)