Solved: How to smooth spikes in event data using Splunk?

sfncook · ‎02-26-2014

We have a process that produces 8,000 requests per second that are consumed by a server. We average only about 2 timeout events per second. A few times per day, however, this timeout rate will spike to 1,000 timeouts for no more than a second or two. We don't care about these spikes, but need to know as quickly as possible when the consuming service is down. How can I "smooth" the event count such that we can ignore the spikes and be notified within about 5 minutes of an outage? I'm thinking of something like a kalman filter (I'm not a mathematicition) acting on the past 5 minutes of data and runs every 5 minutes. A normal average won't do the trick because it can't tell the difference between performance degradation and a spike. It doesn't seem like the predictive functions native to Splunk would work right out of the box. Any other ideas? Thanks.

kaufmanm · ‎02-26-2014

You can look at the median value, e.g. something like:

source=*timeout.log earliest=-6m@m latest=-1m@m | timechart span=1m count(_raw) AS timeouts | stats median(timeouts)

Then if the median(timeouts) value is over 120 or whatever you consider an outage you could generate an alert.

You can bucket all the way down to 1s I believe if you want to be notified faster. (Say within a minute or two.)

View solution in original post

kaufmanm · ‎02-26-2014

You can look at the median value, e.g. something like:

source=*timeout.log earliest=-6m@m latest=-1m@m | timechart span=1m count(_raw) AS timeouts | stats median(timeouts)

Then if the median(timeouts) value is over 120 or whatever you consider an outage you could generate an alert.

You can bucket all the way down to 1s I believe if you want to be notified faster. (Say within a minute or two.)

sfncook · ‎02-28-2014

I'm going to repost this new problem in a different thread as I feel like kaufman answered my original question and now I'm dealing with a separate issue. Thanks, kaufmanm!

sfncook · ‎02-27-2014

So the following two search phrases both behave exactly the same way. Sometimes the 0's are there and sometimes those buckets are missing.

1.) source="/var/log/spread/error_log" [strin | timechart span=1m count(_raw) as timeout_count | fillnull value=0 timeout_count

2.) source="/var/log/spread/error_log" [strin | timechart span=1m count(_raw) as timeout_count

sfncook · ‎02-27-2014

So timechart is producing variable results and I think it has to do with how the browser or the client terminates the HTTP response. In any case, sometimes timechart returns the 0's and sometimes it doesn't. The fillnull command does not 'fill in' the empty one minute buckets when the HTTP response does not have 0's. Reading through other forum answers it seems 'fillnull' is commonly the suggestion for this problem. But it does not appear to work. [part 2 to follow...]

kaufmanm · ‎02-27-2014

The first data set should always have a median of 0 and the second a median of 1176. It looks like maybe instead of treating those times as 0 it might be seeing no value at all, in which case you might need to use a fillnull.

e.g. | fillnull value=0 timeouts | stats median(timeouts)

You want to verify before you pipe your data set to the median command that you have the correct series, e.g. 0, 0, 1212, 0, 0, and not just 1212.

sfncook · ‎02-27-2014

Here is some sample data:

SPIKE: (I want to ignore these events)
t0: 0
t1: 0
t2: 1212
t3: 0
t4: 0
Splunk's results:
median: sometimes it's '0' and sometimes it's '1212'
average: when median is '0' then avg is '242.4000'. When median is '1212' then avg is '1212.0000'.

ACTUAL EVENT:
t0: 2395
t1: 1387
t2: 459
t3: 1176
t4: 708
Splunks results:
median: 1176
average: 1225

So, yes, you're right. If Splunk would give a consist result for the median function during a 'spike' then it would be perfect. But it appears to have a flaw in it.

Thanks so much for the response, kaufmanm.

sfncook · ‎02-27-2014

No, median does not appear to do the trick mainly because the Splunk median function returns different values when I run the same query. By that I mean I keep hitting the search button without changing any input parameters (as far as I can tell) and sometime the median value is '0' and sometimes it is '1212'. Maybe this is now becoming a new question for answers.splunk.com. (sample data to follow)

How to smooth spikes in event data using Splunk?

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor