Basic data reduction
| Guide specification | |
|---|---|
| Guide type: | Wasm code |
| Requirements: | None |
| Recommended reading: | None |
Introduction
In this guide we look at different ways of doing data reduction
through sampling. We will illustrate the data reduction techniques on
streams from the built-in synthetic stream
generator
simstream(), but the techniques apply just as well to data streams
from real sensors.
Input data
The function simstream(pace) generates a stream with a new simulated
numerical element every pace seconds. We can look at the stream from
the first 1.0 seconds of simstream(0.1):
timeout(simstream(0.1), 1.0)
Count-based sampling
The easiest way to reduce the data rate is to sample the output of
simstream(pace). One way of sampling is to use the function
winagg(s,size,stride). For each stride elements of stream s,
winagg() returns a window as a vector of the last size
elements. Choosing a window size less than the stride will return a
down-sampled stream window based on selecting every 10th element.
winagg(simstream(0.1), 1, 10)
The preceding example samples from simstream() a window with one
element every ten elements. We use vector indexing to get the first
element in each result stream window:
select Stream of v[1]
from Vector v
where v in winagg(simstream(.1), 1, 10)
Set a larger value of size and see how it affects the output!
This was an easy way of reducing the data stream through down sampling.
Time-based sampling
In certain applications it is meaningful to sample windows in time
stamped
streams rather
than the above counting windows formed by winagg(). The built-in
stream function twinagg(s,size,stride) is similar to
winagg(s,size,stride), but the input parameters size and stride
are specified in seconds rather than number of stream elements. The
function returns a stream of time stamped
windows
containing the last size elements in stream s each stride
seconds. Note that twinagg() requires the stream s to be of a
time stamped
stream.
We can timestamp any stream using ts():
timeout(ts(simstream(0.1)), 1.0)
Now that we have a time-stamped stream we can use twinagg() on the stream:
twinagg(ts(simstream(0.1)), 1.0, 1.0)
In the preceding example, each window contains the elements in
simstream(0.1) received each second (size=1.0), and the stride is
also one second (stride=1.0), so all elements in simstream(0.1)
are present in the output, which is called a temporal tumbling
window.
We see that twinagg(s,size,stride) forms a stream of time stamped
temporal windows
of the elements in s. A time stamped
window
consists of a time stamp and a vector that represents the elements of
the window. To get the window elements of window w we use the
value(w) function. The following query extracts the window elements
from the twinagg() result and returns the first element in the
window vector, thereby sampling one element from the stream each
second:
select value(tsv)[1]
from Timeval of Vector tsv, Stream of Timeval s
where s = ts(simstream(.1))
and tsv in twinagg(s, 1.0, 1.0)
We can extract the timestamp from the time stamped
window with
the function timestamp(). So if we parameterize the example above
with the variables streamrate and samplingrate we can adjust the
frequency of the stream and how often the stream is sampled:
select timestamp(tsv), value(tsv)[1]
from Timeval of Vector tsv, Stream of Timeval s,
Number streamrate, Number samplingrate
where streamrate = .02 and samplingrate = .5
and s = ts(simstream(streamrate))
and tsv in twinagg(s, samplingrate, samplingrate);
Visit Streams for more on streams.
Conclusion
This guide has shown how to do data reduction by sampling. As next step we would recommend reading the Data reduction on edge devices guide where we try these concepts on a real edge device.
