Skip to main content

Density-based spatial clustering of applications with noise (DBSCAN)

This model implements DBSCAN

The first step is to generate a new instance of DBSCAN for your analysis. This is done with the function dbscan:generate(Charstring name):


load_system_model("dbscan");
dbscan:generate('my_test');
Not connected

To run this code block you must be logged in and your studio instance must be started.

The command will generate a number of DBSCAN functions with the prefix my_test:. Indexed stored functions are generated to allow fast training and inference on the model, along with functions for populating the model with training data, training the model, and inferring feature vectors.

To illustrate DBSCAN we will generate random training data points in circular shapes using the following function:

create function gen_datapoint(Number rad, Number noise, Number i)
-> Vector of Number
as [rad * sin(i*2*pi()/100)+frand(noise)-noise/2,
rad * cos(i*2*pi()/100)+frand(noise)-noise/2];
Not connected

To run this code block you must be logged in and your studio instance must be started.

Try generating a vector of 100 random data points with:

select Vector of gen_datapoint(1, 0.2, range(100));
Not connected

To run this code block you must be logged in and your studio instance must be started.

Here we notice that the automatic scaling make the shape oval rather than circular. This can be fixed by changing the visualization to Multi plot and prefixing it with a visual formatting. When the data is plotted you simply click the lock in the upper left corner, then you can grab the small rectangle in the lower right corner and drag to resize the plot and get a circular shape of the cluster.

Example:

{"sa_plot":"Scatter plot"};
select Vector of gen_datapoint(1, 0.2, range(100));
Not connected

To run this code block you must be logged in and your studio instance must be started.

See Visualization for further details.

The training set of data points is stored in the function my_test:datapoints. It is populated with the function my_test:dbscan_add_data. Lets populate it with two random circular shapes:

my_test:dbscan_add_data(select Stream of gen_datapoint(1,0.2,range(1000)));
my_test:dbscan_add_data(select Stream of gen_datapoint(0.5,0.2,range(500)));
Not connected

To run this code block you must be logged in and your studio instance must be started.

Now that we have populated our dataset with random points we can start by visualizing them as a scatter plot:

select vector of { "cos(x)": v[1], "sin(x)": v[2] }
from Vector v, Number n
where v = my_test:datapoints(n);
Not connected

To run this code block you must be logged in and your studio instance must be started.

Now that we have populated our training dataset with random 2D points we move on to training the DBSCAN model by calling the function my_test:dbscan(Number eps, Number minPts). eps is the maximum distance between points for being classified as neighbors, minPts is the minimum number of neighbor points to be classified as a cluster. For more details on these parameter setting please read DBSCAN

Example:

my_test:dbscan(0.1,3);
Not connected

To run this code block you must be logged in and your studio instance must be started.

The DBSCAN model is now trained, so let's have a look at the result. To visualize the clusters we use scatter plot where each point is colored by its cluster. The points for each cluster id are stored in the function my_test:clustered_points(Number cluster_id). Two additional values in the input vectors to scatter plot specify the sizes and colors of the points, i.e. each vector can have the format [x,y,size,color]. The color is specified as an integer in the inteval [-1,255].

Example:

{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": "cluster"
};
select vector of { "cos(x)": v[1],
"sin(x)": v[2],
"cluster": cid
}
from Number cid, Number pid, Vector v
where pid in my_test:clustered_points(cid)
and v = my_test:datapoints(pid);
Not connected

To run this code block you must be logged in and your studio instance must be started.

We can see two clusters here, the inner and the outer circle. Let's use this model to classify a stream of 2D data points. This is done by using the function my_test:dbscan_classify(Vector feature_vector, Number eps, Number minpts) -> Number, which returns -1 if the point is an outlier and the cluster id if it belongs to a cluster.

We generate a random data set by calling:

select vector of gen_datapoint(bag(0.5,1),0.3,iota(1,200))
Not connected

To run this code block you must be logged in and your studio instance must be started.

This generates 400 datapoints, having the radii 0.5 and 1. The noise is 0.3.

Siimulated classification:

{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": "cluster"
};
select vector of { "cos(x)": v[1],
"sin(x)": v[2],
"cluster": label
}
from Vector v, Number label
where v in gen_datapoint(bag(0.5,1),0.3,iota(1,200))
and label = my_test:dbscan_classify(v, 0.1, 3);
Not connected

To run this code block you must be logged in and your studio instance must be started.

Save the trained DBSCAN model

You can save the dbscan instance my_test to a model by calling the function dbscan:save_model(charstring instance, Charstring model). This will save the dbscan instance as master-weights.json into the user mode model:

caution

Note that model must exist in your user model directory. Otherwise the save will give the error No model named <model>.

dbscan:save_model("my_test", "test");
Not connected

To run this code block you must be logged in and your studio instance must be started.

Note: If you look into the generated files for the model test a file named master-weights.json which contains all the DBSCAN data for tour DBSCAN instance.

If you do not want to save the DBSCAN instance in the file master-weights.json then you can use the function dbscan:save_model(charstring instance, Charstring model, Charstring file) which will save it in the file file under model model instead.