k-means clustering
This model implements k-means clustering
The first step is to generate a new instance of k-means for your
analysis. This is done with the function kmeans:generate(Charstring
name)
:
load_system_model("kmeans");
kmeans:generate('my_test');
The command will generate a number of k-means functions with the prefix
kmeans_my_test:
. Indexed stored functions are generated to allow fast
training and inference on the model, along with functions for
populating the model with training data, training the model, and
inferring feature vectors.
To illustrate k-means we will generate random training data points in three square shapes using the following function:
create function gen_datapoint(Number rad, Number noise, Number i)
-> Vector of Number
as [rad * mod(i,3) + frand(noise) + 0.5 - noise/2,
rad * mod(i,3) + frand(noise) + 0.5 - noise/2];
Try generating a vector of 1000 random data points with:
select Vector of gen_datapoint(1, 0.8, range(1000));
The training set of data points is stored in the function
kmeans_my_test:datapoints
. It is populated with the function
kmeans_my_test:add_data
. Lets populate it with three random square
shapes:
kmeans_my_test:add_data(select Stream of gen_datapoint(1,0.8,range(1000)));
Now that we have populated our dataset with random points we can start by visualizing them as a scatter plot:
{
"sa_plot": "Scatter plot"
};
select Vector of { "x": v[1], "y": v[2] }
from Vector v, Number n
where v = kmeans_my_test:datapoints(n);
Now that we have populated our training dataset with random 2D points
we move on to training the k-means model by calling the function
kmeans_my_test:fit(Number max_iter, Number num_clusters)
. max_iter
is the maximum
number of iterations when training, num_cluster
the number of clusters to create.
For more details on these parameter setting please read
k-means
Example:
kmeans_my_test:fit(10000,3);
The k-means model is now trained, so let's have a look at the
result. To visualize the clusters we use scatter plot where each point
is colored by its cluster. An additional value in the input vectors
to scatter plot specify the colors of the points, i.e.
each vector can have the format
[x,y,color]
. The color is specified as an integer in the
inteval [-1,255].
Example:
{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": 3
};
select vector of v
from Vector v, Number cid, Vector of number row
where row in (select stream of kmeans_my_test:datapoints(i) from Number i)
and cid = kmeans_my_test:classify(row)
and v = concat(row,[cid]);
We can see the three different clusters here, each with a distinct color. Let's use this model to classify a stream of 2D datapoints.
This is done using the function kmeans_my_test:classify(Vector of Number feature Vector point) -> Number
Which return the closest
cluster for point point
.
{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": 3
};
select vector of v
from Vector v, Number cid, Vector of number row
where row in gen_datapoint(1,1.4,siota(1,10000))
and cid = kmeans_my_test:classify(row)
and v = concat(row,[cid]);
Save the trained k-means model
You can create a user model of the kmeans instance my_test
by callin the function
kmeans:save_model(charstring instance, Charstring model)
:
kmeans:save_model("my_test", "my_trained_kmeans_model");
The model my_trained_kmeans_model
is now a regular model that can be found in the OSQL-editor and
deployed as any other regular user defined model.