Splunk Search

To run Z-score search against email logs, do I need to use summary index or can I get the avg count and then perform a Z-score analysis?

jwalzerpitt
Influencer

I'd like to run some Z-score searches against my email logs, specifically to see outliers that send traffic above their average by a standard deviation (STDEV) of >1.5. Running a Z-score search turns up too many legitimate senders that have a higher output than others (Gmail, Verizon mail, etc).

To help me weed out high volume senders, I was thinking that perhaps I'd need to calculate the average count of each sender per day, week, or month and then do a Z-score to find outliers? (I would also appreciate any suggestions in regards to what to compare the average with - perhaps an integration with timewrap?)

Would I need a summary index for this, or could I do this in one search?

Thx

0 Karma
1 Solution

aljohnson_splun
Splunk Employee
Splunk Employee

@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.

Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:

sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time 
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound

You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.

View solution in original post

aljohnson_splun
Splunk Employee
Splunk Employee

@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.

Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:

sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time 
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound

You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.

jwalzerpitt
Influencer

and for clarification, I'm trying to do the avg count on a per sender basis...

0 Karma

cmerriman
Super Champion

are you using anomalydetection for the Z-Score? can you post your syntax at all?

0 Karma
Get Updates on the Splunk Community!

Wondering How to Build Resiliency in the Cloud?

IT leaders are choosing Splunk Cloud as an ideal cloud transformation platform to drive business resilience,  ...

Updated Data Management and AWS GDI Inventory in Splunk Observability

We’re making some changes to Data Management and Infrastructure Inventory for AWS. The Data Management page, ...

Introducing the Splunk Community Dashboard Challenge!

Welcome to Splunk Community Dashboard Challenge! This is your chance to showcase your skills in creating ...