Density-based spatial clustering of applications with noise (DBSCAN)
This model implements DBSCAN
The first step is to generate a new instance of DBSCAN for your
analysis. This is done with the function dbscan:generate(Charstring
name)
:
load_system_model("dbscan");
dbscan:generate('my_test');
To run this code block you must be logged in and your studio instance must be started.
The command will generate a number of DBSCAN functions with the prefix
my_test:
. Indexed stored functions are generated to allow fast
training and inference on the model, along with functions for
populating the model with training data, training the model, and
inferring feature vectors.
To illustrate DBSCAN we will generate random training data points in circular shapes using the following function:
create function gen_datapoint(Number rad, Number noise, Number i)
-> Vector of Number
as [rad * sin(i*2*pi()/100)+frand(noise)-noise/2,
rad * cos(i*2*pi()/100)+frand(noise)-noise/2];
To run this code block you must be logged in and your studio instance must be started.
Try generating a vector of 100 random data points with:
select Vector of gen_datapoint(1, 0.2, range(100));
To run this code block you must be logged in and your studio instance must be started.
Here we notice that the automatic scaling make the shape oval rather than circular. This can be fixed by changing the visualization to Multi plot and prefixing it with a visual formatting. When the data is plotted you simply click the lock in the upper left corner, then you can grab the small rectangle in the lower right corner and drag to resize the plot and get a circular shape of the cluster.
Example:
{"sa_plot":"Scatter plot"};
select Vector of gen_datapoint(1, 0.2, range(100));
To run this code block you must be logged in and your studio instance must be started.
See Visualization for further details.
The training set of data points is stored in the function
my_test:datapoints
. It is populated with the function
my_test:dbscan_add_data
. Lets populate it with two random circular
shapes:
my_test:dbscan_add_data(select Stream of gen_datapoint(1,0.2,range(1000)));
my_test:dbscan_add_data(select Stream of gen_datapoint(0.5,0.2,range(500)));
To run this code block you must be logged in and your studio instance must be started.
Now that we have populated our dataset with random points we can start by visualizing them as a scatter plot:
select vector of { "cos(x)": v[1], "sin(x)": v[2] }
from Vector v, Number n
where v = my_test:datapoints(n);
To run this code block you must be logged in and your studio instance must be started.
Now that we have populated our training dataset with random 2D points
we move on to training the DBSCAN model by calling the function
my_test:dbscan(Number eps, Number minPts)
. eps
is the maximum
distance between points for being classified as neighbors, minPts
is
the minimum number of neighbor points to be classified as a
cluster. For more details on these parameter setting please read
DBSCAN
Example:
my_test:dbscan(0.1,3);
To run this code block you must be logged in and your studio instance must be started.
The DBSCAN model is now trained, so let's have a look at the
result. To visualize the clusters we use scatter plot where each point
is colored by its cluster. The points for each cluster id are stored
in the function my_test:clustered_points(Number cluster_id)
. Two
additional values in the input vectors to scatter plot specify the
sizes and colors of the points, i.e. each vector can have the format
[x,y,size,color]
. The color is specified as an integer in the
inteval [-1,255].
Example:
{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": "cluster"
};
select vector of { "cos(x)": v[1],
"sin(x)": v[2],
"cluster": cid
}
from Number cid, Number pid, Vector v
where pid in my_test:clustered_points(cid)
and v = my_test:datapoints(pid);
To run this code block you must be logged in and your studio instance must be started.
We can see two clusters here, the inner and the outer circle. Let's
use this model to classify a stream of 2D data points. This is done by
using the function my_test:dbscan_classify(Vector feature_vector,
Number eps, Number minpts) -> Number
, which returns -1 if the point
is an outlier and the cluster id if it belongs to a cluster.
We generate a random data set by calling:
select vector of gen_datapoint(bag(0.5,1),0.3,iota(1,200))
To run this code block you must be logged in and your studio instance must be started.
This generates 400 datapoints, having the radii 0.5 and 1. The noise is 0.3.
Siimulated classification:
{
"sa_plot": "Scatter plot",
"size_axis": "none",
"color_axis": "cluster"
};
select vector of { "cos(x)": v[1],
"sin(x)": v[2],
"cluster": label
}
from Vector v, Number label
where v in gen_datapoint(bag(0.5,1),0.3,iota(1,200))
and label = my_test:dbscan_classify(v, 0.1, 3);
To run this code block you must be logged in and your studio instance must be started.
Save the trained DBSCAN model
You can save the dbscan instance my_test
to a model by calling the function
dbscan:save_model(charstring instance, Charstring model)
. This will save the
dbscan instance as master-weights.json
into the user mode model
:
Note that model
must exist in your user model directory.
Otherwise the save will give the error No model named <model>
.
dbscan:save_model("my_test", "test");
To run this code block you must be logged in and your studio instance must be started.
Note: If you look into the generated files for the model
test
a file namedmaster-weights.json
which contains all the DBSCAN data for tour DBSCAN instance.
If you do not want to save the DBSCAN instance in the file master-weights.json
then you can use the function
dbscan:save_model(charstring instance, Charstring model, Charstring file)
which will save it in
the file file
under model model
instead.