Skip to main content

k-means clustering

SA Engine supports k-means clustering through the kmeans system model.

System models are loaded with the system_models:load() function.

system_models:load("kmeans");

When the system model is loaded the first step is to use the function kmeans:generate(name) to generate a new instance of k-means for your analysis.

kmeans:generate('my_test');

The command generates a number of k-means functions with the prefix kmeans_my_test:. Indexed stored functions are generated to allow fast training and inference on the model, along with functions for populating the model with training data, training the model, and inferring feature vectors.

To illustrate k-means we will generate random training data points in three square shapes using the following function:

create function gen_datapoint(Number rad, Number noise, Number i)
-> Vector of Number
as [rad * mod(i,3) + frand(noise) + 0.5 - noise/2,
rad * mod(i,3) + frand(noise) + 0.5 - noise/2];

Try generating a vector of 1000 random data points with:

select Vector of gen_datapoint(1, 0.8, range(1000));

To train the k-means we need to store the training set in the stored function kmeans_my_test:datapoints(). To populate the stored function we use the function kmeans_my_test:add_data(). Let's populate it with three random square shapes:

kmeans_my_test:add_data(select Stream of gen_datapoint(1,0.8,range(1000)));

Now that we have populated our dataset with random points we can visualize them in a scatter plot:

//plot: Scatter plot
select Vector of { "x": v[1], "y": v[2] }
from Vector v, Number n
where v = kmeans_my_test:datapoints(n);

Now we have a training dataset and can move on to training the k-means model. We do this by calling the function kmeans_my_test:fit(Number max_iter, Number num_clusters), where max_iter is the maximum number of iterations when training and num_cluster is the number of clusters to create. For more details on these parameter setting we refer to k-means

Example:

kmeans_my_test:fit(10000,3);

The k-means model is now trained, so let's have a look at the result. To visualize the clusters we use scatter plot where each point is colored by its cluster. An additional value in the input vectors to scatter plot specify the colors of the points, i.e. each vector can have the format [x,y,color]. The color is specified as an integer in the interval [-1,255].

Example:

//plot: Multi plot
{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": 3
};
select vector of v
from Vector v, Number cid, Vector of number row
where row in (select stream of kmeans_my_test:datapoints(i) from Number i)
and cid = kmeans_my_test:classify(row)
and v = concat(row,[cid]);

We can see the three different clusters here, each with a distinct color. Let's use this model to classify a stream of 2D data points. This is done by using the function kmeans_my_test:classify(Vector of Number feature Vector point) -> Number which returns the closest cluster for point point.

//plot: Multi plot
{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": 3
};
select vector of v
from Vector v, Number cid, Vector of number row
where row in gen_datapoint(1,1.4,siota(1,10000))
and cid = kmeans_my_test:classify(row)
and v = concat(row,[cid]);

Save the trained k-means model

You can create a model of the kmeans instance my_test by calling the function kmeans:save_model(charstring instance, Charstring model):

kmeans:save_model("my_test", "my_trained_kmeans_model");

More information on how to work with models can be found in Managing models guide and Working with queries and models in the SA Studio Manual.

API

All k-means functions are listed in the OSQL API.