*k*-means clustering

SA Engine supports *k*-means clustering through the `kmeans`

system model.

System models are loaded with the `system_models:load()`

function.

`system_models:load("kmeans");`

When the system model is loaded the first step is to use the function `kmeans:generate(name)`

to generate a new instance of *k*-means for your analysis.

`kmeans:generate('my_test');`

The command generates a number of *k*-means functions with the prefix
`kmeans_my_test:`

. Indexed stored functions are generated to allow fast
training and inference on the model, along with functions for
populating the model with training data, training the model, and
inferring feature vectors.

To illustrate *k*-means we will generate random training data points in
three square shapes using the following function:

`create function gen_datapoint(Number rad, Number noise, Number i)`

-> Vector of Number

as [rad * mod(i,3) + frand(noise) + 0.5 - noise/2,

rad * mod(i,3) + frand(noise) + 0.5 - noise/2];

Try generating a vector of 1000 random data points with:

`select Vector of gen_datapoint(1, 0.8, range(1000));`

To train the k-means we need to store the training set in the stored function `kmeans_my_test:datapoints()`

. To populate the stored function we use the function `kmeans_my_test:add_data()`

. Let's populate it with three random square shapes:

`kmeans_my_test:add_data(select Stream of gen_datapoint(1,0.8,range(1000)));`

Now that we have populated our dataset with random points we can visualize them in a scatter plot:

`//plot: Scatter plot`

select Vector of { "x": v[1], "y": v[2] }

from Vector v, Number n

where v = kmeans_my_test:datapoints(n);

Now we have a training dataset and can move on to training the *k*-means model. We do this by calling the function `kmeans_my_test:fit(Number max_iter, Number num_clusters)`

, where `max_iter`

is the maximum number of iterations when training and `num_cluster`

is the number of clusters to create. For more details on these parameter setting we refer to *k*-means

*Example:*

`kmeans_my_test:fit(10000,3);`

The *k*-means model is now trained, so let's have a look at the
result. To visualize the clusters we use scatter plot where each point
is colored by its cluster. An additional value in the input vectors
to scatter plot specify the colors of the points, i.e.
each vector can have the format
`[x,y,color]`

. The color is specified as an integer in the
interval *[-1,255]*.

*Example:*

`//plot: Multi plot`

{

"sa_plot": "Scatter plot",

"size_axis": "none",

"color_axis": 3

};

select vector of v

from Vector v, Number cid, Vector of number row

where row in (select stream of kmeans_my_test:datapoints(i) from Number i)

and cid = kmeans_my_test:classify(row)

and v = concat(row,[cid]);

We can see the three different clusters here, each with a distinct color. Let's use this model to classify a stream of 2D data points.
This is done by using the function `kmeans_my_test:classify(Vector of Number feature Vector point) -> Number`

which returns the closest
cluster for point `point`

.

`//plot: Multi plot`

{

"sa_plot": "Scatter plot",

"size_axis": "none",

"color_axis": 3

};

select vector of v

from Vector v, Number cid, Vector of number row

where row in gen_datapoint(1,1.4,siota(1,10000))

and cid = kmeans_my_test:classify(row)

and v = concat(row,[cid]);

## Save the trained *k*-means modelâ€‹

You can create a model of the kmeans instance `my_test`

by calling the function
`kmeans:save_model(charstring instance, Charstring model)`

:

`kmeans:save_model("my_test", "my_trained_kmeans_model");`

More information on how to work with models can be found in Managing models guide and Working with queries and models in the SA Studio Manual.

## APIâ€‹

All *k*-means functions are listed in the OSQL API.