Skip to main content

k-means clustering

This model implements k-means clustering

The first step is to generate a new instance of k-means for your analysis. This is done with the function kmeans:generate(Charstring name):

load_system_model("kmeans");
kmeans:generate('my_test');

The command will generate a number of k-means functions with the prefix kmeans_my_test:. Indexed stored functions are generated to allow fast training and inference on the model, along with functions for populating the model with training data, training the model, and inferring feature vectors.

To illustrate k-means we will generate random training data points in three square shapes using the following function:

create function gen_datapoint(Number rad, Number noise, Number i)
-> Vector of Number
as [rad * mod(i,3) + frand(noise) + 0.5 - noise/2,
rad * mod(i,3) + frand(noise) + 0.5 - noise/2];

Try generating a vector of 1000 random data points with:

select Vector of gen_datapoint(1, 0.8, range(1000));

The training set of data points is stored in the function kmeans_my_test:datapoints. It is populated with the function kmeans_my_test:add_data. Lets populate it with three random square shapes:

kmeans_my_test:add_data(select Stream of gen_datapoint(1,0.8,range(1000)));

Now that we have populated our dataset with random points we can start by visualizing them as a scatter plot:

{
"sa_plot": "Scatter plot"
};
select Vector of { "x": v[1], "y": v[2] }
from Vector v, Number n
where v = kmeans_my_test:datapoints(n);

Now that we have populated our training dataset with random 2D points we move on to training the k-means model by calling the function kmeans_my_test:fit(Number max_iter, Number num_clusters). max_iter is the maximum number of iterations when training, num_cluster the number of clusters to create. For more details on these parameter setting please read k-means

Example:

kmeans_my_test:fit(10000,3);

The k-means model is now trained, so let's have a look at the result. To visualize the clusters we use scatter plot where each point is colored by its cluster. An additional value in the input vectors to scatter plot specify the colors of the points, i.e. each vector can have the format [x,y,color]. The color is specified as an integer in the inteval [-1,255].

Example:

{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": 3
};
select vector of v
from Vector v, Number cid, Vector of number row
where row in (select stream of kmeans_my_test:datapoints(i) from Number i)
and cid = kmeans_my_test:classify(row)
and v = concat(row,[cid]);

We can see the three different clusters here, each with a distinct color. Let's use this model to classify a stream of 2D datapoints. This is done using the function kmeans_my_test:classify(Vector of Number feature Vector point) -> Number Which return the closest cluster for point point.

{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": 3
};
select vector of v
from Vector v, Number cid, Vector of number row
where row in gen_datapoint(1,1.4,siota(1,10000))
and cid = kmeans_my_test:classify(row)
and v = concat(row,[cid]);

Save the trained k-means model

You can create a user model of the kmeans instance my_test by callin the function kmeans:save_model(charstring instance, Charstring model):

kmeans:save_model("my_test", "my_trained_kmeans_model");

The model my_trained_kmeans_model is now a regular model that can be found in the OSQL-editor and deployed as any other regular user defined model.