Splunk Search

how do I rank nodality from kmeans data

MonkeyK
Builder

I have been trying to do kmeans analysis of some data. I see some of my evaluation points falling into lots of clusters, but with heavy weighting towards 1-2 clusters. Is there a way to call this out?

[my search] | kmeans k=100 col1 col2 col3 | | eventstats count as clusterConnectionEvents by CLUSTERNUM | eval culusterConectionEvents=CLUSTERNUM."(".clusterConnectionEvents.")" | | stats dc(CLUSTERNUM) as clusterCount values(culusterConectionEvents) values(connectionCount) count by

gives me a first item with

clusterCount=26

values(clusterConnectionEvents)
1(7)
10(1)
100(14)
12(100)
14(19)
16(9)
2(50247)
20(2)
37(203)
39(122)
4(472)
40(75)
48(17)
5(2)
50(16)
52(8)
53(39)
59(8)
73(3)
74(20)
75(142)
80(2)
81(13)
83(4)
84(58)
87(96)

This clearly has a huge node at cluster 2

and another

clusterCount=12

values(clusterConnectionEvents)
1(4)
14(2)
16(5)
2(59)
32(3)
4(11)
48(2)
59(148)
75(170)
84(2)
87(69)
89(5)

which clearly has nodes at clusters 59 and 75 (and maybe 2 and 87 as well).

For other items, the nodes are less pronounced. These are less interesting to me

Is there a way to score such data so that items with the vast majority of their values falling into 1-2 buckets comes to the top of a list?

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Hmmm. I'm not sure I agree with your distinction between interesting and non-interesting with regard to clusters. Until you know the characteristics of a cluster, you don't know why the system decided it WAS a cluster. But, we can agree that identifying those clusters is initially more critical, since it is the bulk of your data.

It is easy enough to do something like this...

[my search] 
| kmeans k=100 col1 col2 col3
| eval rectype="detail" 
| appendpipe 
    [| stats count as CLUSTERCOUNT by CLUSTERNUM | sort 0 - CLUSTERCOUNT + CLUSTERNUM | eval rectype="ClusterSummary"]

This gives you a set of data at the end that summarizes your clusters.

Or you could do this to get rid of all events that are not in your biggest 3 clusters...

[my search] 
| kmeans k=100 col1 col2 col3
| eval rectype="detail" 
| appendpipe 
    [| stats count as CLUSTERCOUNT by CLUSTERNUM | sort 3 - CLUSTERCOUNT + CLUSTERNUM | eval rectype="ClusterSummary"]
| eventstats max(CLUSTERCOUNT) as keepme by CLUSTERNUM
| where isnotnull(keepme) AND rectype="detail"
0 Karma

MonkeyK
Builder

Sorry Dal, I left out the meaning of the query to keep my question from getting too complex. Generally the query looks for malware beacons by looking for traffic that is similar in period, size, and duration. The clusters are clustering on those values.

I tightened up my ability to see the nodes by throwing away all clusters that have less than 1% of the total clustered events.
In the case of my first example, this left just the one cluster, which is what I wanted to see. So maybe I could just play with the % that I throw away.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...