DBSCAN
SA Engine supports the DBSCAN clustring algorithm through the dbscan
system model.
System models are loaded with the system_models:load()
function.
system_models:load("dbscan");
The first step is to generate a new instance of DBSCAN for your analysis. This is done with the function dbscan:generate(Charstring name)
:
dbscan:generate('my_test');
The command will generate a number of DBSCAN functions with the prefix my_test:
. Indexed stored functions are generated to allow fast training and inference on the model, along with functions for populating the model with training data, training the model, and inferring feature vectors.
To illustrate DBSCAN we will generate random training data points in circular shapes using the following function:
create function gen_datapoint(Number rad, Number noise, Number i)
-> Vector of Number
as [rad * sin(i*2*pi()/100)+frand(noise)-noise/2,
rad * cos(i*2*pi()/100)+frand(noise)-noise/2];
Try generating a vector of 100 random data points with:
//plot: Scatter plot
select Vector of gen_datapoint(1, 0.2, range(100));
The training set of data points is stored in the function my_test:datapoints
. It is populated with the function my_test:dbscan_add_data
. Lets populate it with two circular shapes, one with radius 1.0
and one with radius 0.5
:
my_test:dbscan_add_data(select Stream of gen_datapoint(1.0, 0.2, range(1000)));
my_test:dbscan_add_data(select Stream of gen_datapoint(0.5, 0.2, range(500)));
Now that we have populated our dataset with random points we can start by visualizing them as a scatter plot:
//plot: Scatter plot
select vector of { "cos(x)": v[1], "sin(x)": v[2] }
from Vector v, Number n
where v = my_test:datapoints(n);
Now we can train the DBSCAN model by calling the function my_test:dbscan(Number eps, Number min_nbr)
, where eps
is the maximum distance between points for being classified as neighbors, and min_nbr
is the minimum number of neighbor points to be classified as a cluster. For more details on these parameter setting please read DBSCAN
Example:
my_test:dbscan(0.1, 3);
The DBSCAN model is now trained, so let's have a look at the result. To visualize the clusters we use scatter plot where each point is colored by its cluster. The points for each cluster are stored in the function my_test:clustered_points(Number cluster_id)
. Two additional values in the input vectors to scatter plot specify the sizes and colors of the points (i.e., each vector can have the format [x,y,size,color]
). The color is specified as an integer in the inteval [-1,255].
Example:
//plot: Multi plot
{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": "cluster"
};
select vector of { "cos(x)": v[1],
"sin(x)": v[2],
"cluster": cid
}
from Number cid, Number pid, Vector v
where pid in my_test:clustered_points(cid)
and v = my_test:datapoints(pid);
We can see two clusters here, the inner and the outer circle. Let's use this model to classify a stream of 2D data points. This is done with the function my_test:dbscan_classify(Vector feature_vector, Number eps, Number minpts) -> Number
, which returns -1 if the point is an outlier and the cluster id if it belongs to a cluster.
We generate a random data set by calling:
select vector of gen_datapoint(bag(0.5, 1.0), 0.3, iota(1, 200))
This generates 400 datapoints, having the radii 0.5
and 1.0
. The noise is 0.3
.
Simulated classification:
//plot: Multi plot
{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": "cluster"
};
select vector of { "cos(x)": v[1],
"sin(x)": v[2],
"cluster": label
}
from Vector v, Number label
where v in gen_datapoint(bag(0.5, 1.0), 0.3, iota(1, 200))
and label = my_test:dbscan_classify(v, 0.1, 3);
Save the trained DBSCAN model​
If you use SA Studio in the cloud or self-hosted you can save the DBSCAN instance to a model with one of the following functions:
dbscan:save_model(charstring instance, Charstring model)
dbscan:save_model(charstring instance, Charstring model, Charstring file)
The first will save the DBSCAN instance as master-weights.json
into the user model model
. The second one will save the DBSCAN instance to a file specifed by file
.
Note that model
must exist in your user model directory. Otherwise the save will give the error No model named <model>
.
More information on how to work with models can be found in Managing models guide and Working with queries and models in the SA Studio Manual.
API​
All DBSCAN functions are listed in the OSQL API.