Solved: To run Z-score search against email logs, do I nee...

jwalzerpitt · ‎11-30-2016

I'd like to run some Z-score searches against my email logs, specifically to see outliers that send traffic above their average by a standard deviation (STDEV) of >1.5. Running a Z-score search turns up too many legitimate senders that have a higher output than others (Gmail, Verizon mail, etc).

To help me weed out high volume senders, I was thinking that perhaps I'd need to calculate the average count of each sender per day, week, or month and then do a Z-score to find outliers? (I would also appreciate any suggestions in regards to what to compare the average with - perhaps an integration with timewrap?)

Would I need a summary index for this, or could I do this in one search?

Thx

aljohnson_splun · ‎12-02-2016

@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.

Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:

sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time 
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound

You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.

View solution in original post

aljohnson_splun · ‎12-02-2016

@jwalzerpitt Have you checked out the Machine Learning Toolkit? There is an assistant in there that does just this, and has custom visualizations and dashboards to help you in the process.

Aside, yes, there are lots of ways you could do this. I think a per-sendermoving average makes sense. Here is an example for looking for z_scores above 1.5 for a group of hostnames in some proxy logs:

sourcetype=cisco_wsa_squid earliest=-2w
| bin _time span=10m
| stats count by s_hostname, _time 
| streamstats window=6 mean(count) as mu, stdev(count) as sigma by s_hostname
| eval upper_bound = mu + (1.5 * sigma), lower_bound = mu - (1.5 * sigma)
| where count > upper_bound OR count < lower_bound

You could change the window or span or group-by fields to get some other analysis. I'd suggest you check out some of the searches in the ML Toolkit as they have nice examples of using interquartile range or median absolute deviation for doing similar things.

jwalzerpitt · ‎11-30-2016

and for clarification, I'm trying to do the avg count on a per sender basis...

cmerriman · ‎11-30-2016

are you using anomalydetection for the Z-Score? can you post your syntax at all?

To run Z-score search against email logs, do I need to use summary index or can I get the avg count and then perform a Z-score analysis?

Wondering How to Build Resiliency in the Cloud?

Updated Data Management and AWS GDI Inventory in Splunk Observability

Introducing the Splunk Community Dashboard Challenge!