Splunk Search

how do I rank nodality from kmeans data

MonkeyK
Builder

I have been trying to do kmeans analysis of some data. I see some of my evaluation points falling into lots of clusters, but with heavy weighting towards 1-2 clusters. Is there a way to call this out?

[my search] | kmeans k=100 col1 col2 col3 | | eventstats count as clusterConnectionEvents by CLUSTERNUM | eval culusterConectionEvents=CLUSTERNUM."(".clusterConnectionEvents.")" | | stats dc(CLUSTERNUM) as clusterCount values(culusterConectionEvents) values(connectionCount) count by

gives me a first item with

clusterCount=26

values(clusterConnectionEvents)
1(7)
10(1)
100(14)
12(100)
14(19)
16(9)
2(50247)
20(2)
37(203)
39(122)
4(472)
40(75)
48(17)
5(2)
50(16)
52(8)
53(39)
59(8)
73(3)
74(20)
75(142)
80(2)
81(13)
83(4)
84(58)
87(96)

This clearly has a huge node at cluster 2

and another

clusterCount=12

values(clusterConnectionEvents)
1(4)
14(2)
16(5)
2(59)
32(3)
4(11)
48(2)
59(148)
75(170)
84(2)
87(69)
89(5)

which clearly has nodes at clusters 59 and 75 (and maybe 2 and 87 as well).

For other items, the nodes are less pronounced. These are less interesting to me

Is there a way to score such data so that items with the vast majority of their values falling into 1-2 buckets comes to the top of a list?

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Hmmm. I'm not sure I agree with your distinction between interesting and non-interesting with regard to clusters. Until you know the characteristics of a cluster, you don't know why the system decided it WAS a cluster. But, we can agree that identifying those clusters is initially more critical, since it is the bulk of your data.

It is easy enough to do something like this...

[my search] 
| kmeans k=100 col1 col2 col3
| eval rectype="detail" 
| appendpipe 
    [| stats count as CLUSTERCOUNT by CLUSTERNUM | sort 0 - CLUSTERCOUNT + CLUSTERNUM | eval rectype="ClusterSummary"]

This gives you a set of data at the end that summarizes your clusters.

Or you could do this to get rid of all events that are not in your biggest 3 clusters...

[my search] 
| kmeans k=100 col1 col2 col3
| eval rectype="detail" 
| appendpipe 
    [| stats count as CLUSTERCOUNT by CLUSTERNUM | sort 3 - CLUSTERCOUNT + CLUSTERNUM | eval rectype="ClusterSummary"]
| eventstats max(CLUSTERCOUNT) as keepme by CLUSTERNUM
| where isnotnull(keepme) AND rectype="detail"
0 Karma

MonkeyK
Builder

Sorry Dal, I left out the meaning of the query to keep my question from getting too complex. Generally the query looks for malware beacons by looking for traffic that is similar in period, size, and duration. The clusters are clustering on those values.

I tightened up my ability to see the nodes by throwing away all clusters that have less than 1% of the total clustered events.
In the case of my first example, this left just the one cluster, which is what I wanted to see. So maybe I could just play with the % that I throw away.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...