*k*-means clustering

This model implements *k*-means
clustering

The first step is to generate a new instance of *k*-means for your
analysis. This is done with the function ```
kmeans:generate(Charstring
name)
```

:

`load_system_model("kmeans");`

kmeans:generate('my_test');

The command will generate a number of *k*-means functions with the prefix
`kmeans_my_test:`

. Indexed stored functions are generated to allow fast
training and inference on the model, along with functions for
populating the model with training data, training the model, and
inferring feature vectors.

To illustrate *k*-means we will generate random training data points in
three square shapes using the following function:

`create function gen_datapoint(Number rad, Number noise, Number i)`

-> Vector of Number

as [rad * mod(i,3) + frand(noise) + 0.5 - noise/2,

rad * mod(i,3) + frand(noise) + 0.5 - noise/2];

Try generating a vector of 1000 random data points with:

`select Vector of gen_datapoint(1, 0.8, range(1000));`

The training set of data points is stored in the function
`kmeans_my_test:datapoints`

. It is populated with the function
`kmeans_my_test:add_data`

. Lets populate it with three random square
shapes:

`kmeans_my_test:add_data(select Stream of gen_datapoint(1,0.8,range(1000)));`

Now that we have populated our dataset with random points we can start by visualizing them as a scatter plot:

`{`

"sa_plot": "Scatter plot"

};

select Vector of { "x": v[1], "y": v[2] }

from Vector v, Number n

where v = kmeans_my_test:datapoints(n);

Now that we have populated our training dataset with random 2D points
we move on to training the *k*-means model by calling the function
`kmeans_my_test:fit(Number max_iter, Number num_clusters)`

. `max_iter`

is the maximum
number of iterations when training, `num_cluster`

the number of clusters to create.
For more details on these parameter setting please read
*k*-means

*Example:*

`kmeans_my_test:fit(10000,3);`

The *k*-means model is now trained, so let's have a look at the
result. To visualize the clusters we use scatter plot where each point
is colored by its cluster. An additional value in the input vectors
to scatter plot specify the colors of the points, i.e.
each vector can have the format
`[x,y,color]`

. The color is specified as an integer in the
inteval *[-1,255]*.

*Example:*

`{`

"sa_plot": "Scatter plot",

"size_axis": "none",

"color_axis": 3

};

select vector of v

from Vector v, Number cid, Vector of number row

where row in (select stream of kmeans_my_test:datapoints(i) from Number i)

and cid = kmeans_my_test:classify(row)

and v = concat(row,[cid]);

We can see the three different clusters here, each with a distinct color. Let's use this model to classify a stream of 2D datapoints.
This is done using the function `kmeans_my_test:classify(Vector of Number feature Vector point) -> Number`

Which return the closest
cluster for point `point`

.

`{`

"sa_plot": "Scatter plot",

"size_axis": "none",

"color_axis": 3

};

select vector of v

from Vector v, Number cid, Vector of number row

where row in gen_datapoint(1,1.4,siota(1,10000))

and cid = kmeans_my_test:classify(row)

and v = concat(row,[cid]);

# Save the trained *k*-means model

You can create a user model of the kmeans instance `my_test`

by callin the function
`kmeans:save_model(charstring instance, Charstring model)`

:

`kmeans:save_model("my_test", "my_trained_kmeans_model");`

The model `my_trained_kmeans_model`

is now a regular model that can be found in the OSQL-editor and
deployed as any other regular user defined model.