All Apps and Add-ons

Splunk Machine Learning App / Toolkit - Using DBSCAN Clustering Algorithm

hbrandt84
Path Finder

Hi,

I want to use the Clustering Algorithm "DBSCAN" from the Machine Learning Toolkit.
(https://docs.splunk.com/Documentation/MLApp/2.3.0/User/Algorithms) --> listed under "clustering algorithms"

Now, upon implementation, I noticed, that this algorithm only needs one parameter: EPS
(maximum distance between two samples for them to be considered in the same cluster)

Now if you look up any definition of the DBSCAN Algorithm, for example...
(https://en.wikipedia.org/wiki/DBSCAN)
...you will notice that a DBSCAN algorithm will need 2 Parameters to be functional:

  • EPS (Epsilon): maximum distance between two samples --> provided
  • minPTS: minimum occurences of samples within a cluster --> missing

Does anybody know, why the second Parameter ist missing?
I Don't get how this algorithm can be functional....

nryabykh
Path Finder

You need to modify $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/bin/algos/DBSCAN.py file. In __init__ function replace string

out_params = convert_params(options.get('params', {}), floats=['eps'])

with this one:

out_params = convert_params(options.get('params', {}), floats=['eps', 'min_samples'])

After this you can write something like fit DBSCAN eps=0.1 min_samples=2 * in your SPL queries.

0 Karma

niketn
Legend

@hbrandt84, I concur, scikit learn also mentions two parameters i.e. min_samples and eps (http://scikit-learn.org/stable/modules/clustering.html#dbscan)

However, algorithm description and class detail mention that these parameters are optional:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Based on the following code for DBSCAN algorithm, I would expect that initialization default value is min_samples=5 (https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/cluster/dbscan_.py#L156):

def dbscan(X, eps=0.5, min_samples=5, metric='minkowski',
           algorithm='auto', leaf_size=30, p=2, sample_weight=None, n_jobs=1):

And:

def __init__(self, eps=0.5, min_samples=5, metric='euclidean',
             algorithm='auto', leaf_size=30, p=None, n_jobs=1):
    self.eps = eps
    self.min_samples = min_samples
    self.metric = metric
    self.algorithm = algorithm
    self.leaf_size = leaf_size
    self.p = p
    self.n_jobs = n_jobs

However, this needs to be confirmed and possibly enhanced in Machine Learning Toolkit to create a min_samples input parameter for DBSCAN.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"
0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...