Splunk Search

Standard Deviation Difference

colinmchugo
Explorer

Hi all,

I am trying to discover the standard deviation from one set of data to another in a percentage to see if there is a large change in data flow. So 0-10/20% being low anything above showing this to be unusual as a rule

I am using the following but it's very slow, anyone ideas.

earliest=-30d@d latest=@d | bucket _time span=24h | stats count as count by _time | stats avg(count) as DailyAvgOfMonth | appendcols [search source=agent_id="*" earliest=@d latest=now | stats count as Today ] |bucket _time span=24h| eval percent_difference=((Today/DailyAvgOfMonth)*100) | table percent_difference | streamstats sum(hourcount) AS AccumulatedTOTAL | table _time AccumulatedTOTAL | fillnull value=0

thanks in advance

Colin

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Okay, different strategy to make this screaming fast.

Run this once. This will not be screaming fast, but it will only need to run once...

 (earliest=-30d@d latest=@d index="main" source="splunk" agent_id="*") 
 | bucket _time span=1d 
 | stats count as daycount by _time
 | outputcsv append=t mydaycounts.csv

Run this once a day...

(earliest=-1d@d latest=@d index="main" source="splunk" agent_id="*") 
 | bucket _time span=1d 
 | stats count as daycount by _time
 | inputcsv append=t mydaycounts.csv
 | dedup _time
 | where _time >= relative_time(now(),"-30d@d") 
 | outputcsv append=f mydaycounts.csv

Run this to get your report/alert...

(earliest=@d latest=@h index="main" source="splunk" agent_id="*")
| bucket _time span=1d 
| stats count as todaycount by _time 
| rename COMMENT as "The above gets you one record per day for today."

| rename COMMENT as "This adds your 30-day history and calcs your averages."
| inputcsv append=t mydaycounts.csv
| stats avg(daycount) as avgcount, stdev(daycount) as stdevcount, avg(todaycount) as todaycount, max(_time) as startofday

| rename COMMENT as "Now we need to know what percentage of the day we have gone through. 25% shows as .2500"
| addinfo 
| eval ThePercent = round ( ( 1 + info_max_time - startofday) / 3600 , 4)

| rename COMMENT as "And we check to see if the change in data flow is unusual..."
| eval FullDayCalc = todaycount / ThePercent
| eval FullDaySD = abs(FullDayCalc - avgcount)/stdevcount

| rename COMMENT as "The above tells you how many Stdevs you are from the norm right now..."

| rename COMMENT as "Add this to the end to kill all the normal stuff as an alert. ..."
| where FullDaySD > 2
| rename COMMENT as "Or leave this off and set the alert to only run  FullDaySD > 2 ..."
0 Karma

DalJeanis
SplunkTrust
SplunkTrust

@colinmchugo - Is this still an issue or did you work it out?

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Okay, this is a different approach to your problem. Whenever possible, you want to avoid subsearch [] and join, because they can be very machine intensive. And, for now, just NEVER USE appendcols. There is almost no good use case for it.

This version should be pretty darn fast, relatively. It puts all the records together in a pot, then teases apart the different types of records, then does the statistics on them and reports results.

(earliest=-30d@d latest=@d index=bar source=foo) OR 
(earliest=@d latest=@h source=agent_id="*" index=baz)   
| bucket _time span=1d 
| stats count as daycount by _time 
| rename COMMENT as "The above gets you one record per day for history, one for today.  Only records with source=agent_id will have the highest (@d) timestamp."

| rename COMMENT as "This moves the count for today into a new field, then calculates our averages."
| eventstats max(_time) as startofday
| eval todaycount = if(_time=startofday,daycount,null())
| eval daycount = if(_time<startofday,daycount,null())
| stats avg(daycount) as avgcount, stdev(daycount) as stdevcount, avg(todaycount) as todaycount 

| rename COMMENT as "Now we need to know what percentage of the day we have gone through. 25% shows as .2500"
| addinfo 
| eval ThePercent = round ( ( 1 + info_max_time - startofday) / 3600 , 4)

| rename COMMENT as "And we check to see if the change in data flow is unusual..."
| eval FullDayCalc = todaycount / ThePercent
| eval FullDaySD = abs(FullDayCalc - avgcount)/stdevcount
| where FullDaySD > 2

Caveat: The above strategy is not statistically accurate, because the hourly variability isn't accounted for. I would expect the first few hours every day to throw lots of false positives. If I get some time later, I may code you a version that accounts for that. As is, I would say don't run it for the first quarter of the day, at a guess.


Okay, we can use the Central Limit Theorem to improve this. See this link for an overview of this fundamental statistical law -https://www.khanacademy.org/math/statistics-probability/sampling-distributions-library/sample-means/...

Let's assume the hours are independent samples -- which is another dubious assumption, but a MUCH better dubious assumption than ignoring it completely.

Under that assumption, the stdevcount we have calculated from a chunk of 24 independent hourly samples will tend to cluster around the real hourly standard deviation (let's call it stdevhour), divided by the square root of 24. Thus, the value of stdevhour is calculated as sqrt(24)*stdevcount.

The stdev for samples of N hours, will be stdevhour divided by the square root of N. But, let's get rid of variables and use an example that's just a number: 6 hours. The approximate stdev for 6 hours (1/4 of a day) can be expressed as any of these:

(sqrt(24) * stdevcount) / (sqrt(6))
(sqrt(24) * stdevcount) * (1 / sqrt(6))
(sqrt(24) * (1 / sqrt(6)) * stdevcount) 
(sqrt(24)/sqrt(6)) * stdevcount
2 * stdevcount

stdevcount * (sqrt(24) / sqrt(6))
stdevcount * 1/(sqrt(6)/sqrt(24))
stdevcount / (sqrt(6)/sqrt(24))
stdevcount / (sqrt(6/24))
stdevcount / sqrt(.25)
stdevcount / (.5)

Notice that the next-to-the-last formula has a decimal percentage in it. We've already calculated ThePercent, so we can completely avoid calculating the hourly stdevhour, which isn't needed other than conceptually, and replace the entire last section with this ...

| rename COMMENT as "And we check to see if the change in data flow is unusual..."
| rename COMMENT as "Here we spread the stdev using the Central Limit Theorem, assuming that hourly samples are independent..."
| eval CurrentDailyDeviation = abs(todaycount/ThePercent - daycount)
| eval CurrentSD = sqrt(ThePercent)*(CurrentDailyDeviation / stdevcount)
| where CurrentSD > 2
0 Karma

colinmchugo
Explorer

Dal,

I completed the search with the extra additions you recommended but the search is taking the same lenght of time as the old one. The events are over 7.5 million but no stats or visuals as yet.

The full formula is the following, thanks

(earliest=-30d@d latest=@d index="main" source="splunk" agent_id="*") OR (earliest=@d latest=@h index="main" source="splunk" agent_id="*")   
 | bucket _time span=1d 
 | stats count as daycount by _time 
 | rename COMMENT as "The above gets you one record per day for history, one for today.  Only records with source=agent_id will have the highest (@d) timestamp."
 | rename COMMENT as "This moves the count for today into a new field, then calculates our averages."
 | eventstats max(_time) as startofday
 | eval todaycount = if(_time=startofday,daycount,null())
 | eval daycount = if(_time 2

thanks

Colin

0 Karma

colinmchugo
Explorer

Thank you Dal

I really appreaciate your assistance. I have ran the query and i am getting events but i am not getting any percentage or anything statically or in the visualisation which i want. I am curious should i put in | table percent_difference at the end to get the percentage difference ?

thanks again

Colin

0 Karma

colinmchugo
Explorer

Thanks DalJeanis

I really appreciate it, i am reviewing your instructions. I would love if you could account for the false positives as you said as i do have to get thee figures right but let me know if you have time or not.

Thanks so much

Colin

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...