How to calculate stdev for a count of one field ba...

jwalzerpitt · ‎03-17-2017

I am trying to figure out how to calculate the stdev of the number of emails a user sends. I have the following search so far where I am calculating the count of MessageId per SenderAddress:

index=exchange | stats count as MessageId by SenderAddress

I'm getting tripped up trying to bring in/carry over the count of MessageId per SenderAddress. Do I need to create an eval for this field and then plug that into the rest of the search of:

Thx

DalJeanis · ‎03-17-2017

The main question is defining what item you are trying to calculate the stdev for.

If you want to calculate the stdev for the number of recipients per email, then you need to calculate individual records for the number of recipients in each email, and then calculate the stdev.

index=exchange 
| eval RecipCount=mvcount(myrecipientMVfield)
| stats avg(RecipCount) as RecipCountAvg, stdev(RecipCount) as RecipCountStdev by  SenderAddress

On the other hand, if you want to calculate the stdev of the average number of emails he sends per day, regardless of how many recipient they are to, then you need to calculate how many he sends each day, then calculate the stdev.

index=exchange | bin _time span=1d 
| stats count as EmailsSentPerDay by SenderAddress _time 
| stats count as DaysCalculated, avg(EmailsSentPerDay) as EmailsSentAvg, stdev(EmailsSentPerDay) as EmailsSentStdev by  SenderAddress

So, FIRST figure out what exact statistic it is that you are counting that you want to know how variable it is. Then isolate that thing as a single record whose variability is in question. Then calculate ( and apply z scores etc).

jwalzerpitt · ‎03-20-2017

Thx for the reply. I want to calculate the stdev of the avg number of emails per sender per day, regardless of recipients while eliminating the senders who legitimately send a high number of emails (mailing lists, etc), which would leave me with senders who get flagged as outliers.

As always the problem is defining the time range for the baseline to compare against. 1 day? 7 days? 30 days? And so on. Eventually, I'd like to put the second search through the ML toolkit to see what can be identified.

Thx again

DalJeanis · ‎03-20-2017

Okay, so I'd start with a very long baseline to calculate your typical users. Say 90 days.

You're probably going to have to develop a verified blacklist of some sort in order to accomplish "not counting 'legitimate' high-volume users". Probably create a standing csv file of your blacklist, use a left-join from your data to inputcsv and kill any record that doesn't come back null from the join.

Before that, you need to explore your data and find out how many types of users you have.

Here's what I'd do. First, produce a single file of how many emails are produced each day by each user, going back at least 90 days...

 index=exchange earliest=-91d@d latest=-1d@d | bin _time span=1d 
 | stats count as EmailsSentPerDay by SenderAddress _time 
 | outputcsv dailyuserstats.csv

Next, do a little statistical analysis with this to determine your "classes" of user.

 | inputcsv dailyuserstats.csv 
 | stats count as DaysCalculated, avg(EmailsSentPerDay) as EmailsSentAvg, stdev(EmailsSentPerDay) as EmailsSentStdev by  SenderAddress
 | eval RelPctSD = EmailsSentStdev/EmailsSentAvg
 | kmeans k=10 EmailsSentAvg EmailsSentStdev RelPctSD cfield=SenderClusters

Now you have the users clustered by their characteristics. The cluster numbers are essentially random. You can play with the k= value (probably keeping it in the range of 3-10) in order to see how the system thinks about different partitions. Do some visualization of the clusters, perhaps by (A) aggregating each cluster and mapping each on a scatterplot at the point of its centroid, size based on count of users, or (B) doing a scatterplot with the points colored as per their cluster. You can pick any two of the three measures, since the third is just a combination of the other two, and any pair could present the data slightly different than the other two pairs.

Hopefully, after the first time you do that, you will have an idea of what clusters you want to exclude from the data. You may even already have the information you are looking for. Hard to tell, without having your actual data to play with.

Once you've done a little futzing around with the data, come back with a new question and we'll help you take the next step.

jwalzerpitt · ‎03-20-2017

Thx - I'll start plugging away at the two searches you created above (greatly appreciated).

Fooling around, I combined your second original search with niketnaily's:

index=exchange | bin _time span=1d | stats count as EmailsSent by SenderAddress | eventstats mean(EmailsSent) AS mean_sent, stdev(EmailsSent) AS stdev_sent | eval Z_score=round(((EmailsSent-mean_sent)/stdev_sent),2) | table SenderAddress, EmailsSent, mean_sent, Z_score

It appears the mean_sent is comparing every user against the overall average instead of against the user's own average. How would I individually calculate the avg of number of emails sent per user and then calculate the z_score against that avg? Looking for users who are outliers based on the amount of email they send, not the total of emails sent by all users.

Thx

niketn · ‎03-25-2017

@jwalzerpitt seems like you need to add by SenderAddress to eventstats as well.

| eventstats mean(EmailsSent) AS mean_sent, stdev(EmailsSent) AS stdev_sent by SenderAddress

Please try it out and let us know if it worked.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

jwalzerpitt · ‎03-28-2017

Thx for the reply

The only thing that's not being calculated is the Z score and I'm trying to figure out why.

jwalzerpitt · ‎03-28-2017

Looks like something is possibly wrong with the way I'm calculating stdev:

eventstats mean(EmailsSent) AS mean_sent, stdev(EmailsSent) AS stdev_sent by SenderAddress

Both of these are coming up blank

niketn · ‎03-17-2017

I am not sure why you are performing eventstats for mean and stdev ans not through stats. However, hope you are looking for the following:

 index=exchange 
| eventstats mean(field) AS mean_field, stdev(field) AS stdev_field 
| stats count as MessageId last(mean_field) as mean_field last(stdev_field) as stdev_field by SenderAddress
| eval Z_score=round(((field-mean_field)/stdev_field),2) 
| where Z_score>1.5 OR Z_score<-1.5

Besides, if you are computing Z score for anomaly detection you can refer to Splunks anomalydetection command. https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Anomalydetection

You might also find some useful SPL commands for your usecase if you go through 5-6 pages of advanced statistics in Splunk documentation: https://docs.splunk.com/Documentation/Splunk/latest/Search/Aboutadvancedstatistics

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

jwalzerpitt · ‎03-20-2017

Thx for the search and the links to anomaly detection and the advanced stats,

How to calculate stdev for a count of one field based on another?

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes