Getting Data In

How to smooth spikes in event data using Splunk?

sfncook
Explorer

We have a process that produces 8,000 requests per second that are consumed by a server. We average only about 2 timeout events per second. A few times per day, however, this timeout rate will spike to 1,000 timeouts for no more than a second or two. We don't care about these spikes, but need to know as quickly as possible when the consuming service is down. How can I "smooth" the event count such that we can ignore the spikes and be notified within about 5 minutes of an outage? I'm thinking of something like a kalman filter (I'm not a mathematicition) acting on the past 5 minutes of data and runs every 5 minutes. A normal average won't do the trick because it can't tell the difference between performance degradation and a spike. It doesn't seem like the predictive functions native to Splunk would work right out of the box. Any other ideas? Thanks.

0 Karma
1 Solution

kaufmanm
Communicator

You can look at the median value, e.g. something like:

source=*timeout.log earliest=-6m@m latest=-1m@m | timechart span=1m count(_raw) AS timeouts | stats median(timeouts)

Then if the median(timeouts) value is over 120 or whatever you consider an outage you could generate an alert.

You can bucket all the way down to 1s I believe if you want to be notified faster. (Say within a minute or two.)

View solution in original post

kaufmanm
Communicator

You can look at the median value, e.g. something like:

source=*timeout.log earliest=-6m@m latest=-1m@m | timechart span=1m count(_raw) AS timeouts | stats median(timeouts)

Then if the median(timeouts) value is over 120 or whatever you consider an outage you could generate an alert.

You can bucket all the way down to 1s I believe if you want to be notified faster. (Say within a minute or two.)

sfncook
Explorer

I'm going to repost this new problem in a different thread as I feel like kaufman answered my original question and now I'm dealing with a separate issue. Thanks, kaufmanm!

0 Karma

sfncook
Explorer

So the following two search phrases both behave exactly the same way. Sometimes the 0's are there and sometimes those buckets are missing.

1.) source="/var/log/spread/error_log" [strin | timechart span=1m count(_raw) as timeout_count | fillnull value=0 timeout_count

2.) source="/var/log/spread/error_log" [strin | timechart span=1m count(_raw) as timeout_count

0 Karma

sfncook
Explorer

So timechart is producing variable results and I think it has to do with how the browser or the client terminates the HTTP response. In any case, sometimes timechart returns the 0's and sometimes it doesn't. The fillnull command does not 'fill in' the empty one minute buckets when the HTTP response does not have 0's. Reading through other forum answers it seems 'fillnull' is commonly the suggestion for this problem. But it does not appear to work. [part 2 to follow...]

0 Karma

kaufmanm
Communicator

The first data set should always have a median of 0 and the second a median of 1176. It looks like maybe instead of treating those times as 0 it might be seeing no value at all, in which case you might need to use a fillnull.

e.g. | fillnull value=0 timeouts | stats median(timeouts)

You want to verify before you pipe your data set to the median command that you have the correct series, e.g. 0, 0, 1212, 0, 0, and not just 1212.

sfncook
Explorer

Here is some sample data:

SPIKE: (I want to ignore these events)
t0: 0
t1: 0
t2: 1212
t3: 0
t4: 0
Splunk's results:
median: sometimes it's '0' and sometimes it's '1212'
average: when median is '0' then avg is '242.4000'. When median is '1212' then avg is '1212.0000'.

ACTUAL EVENT:
t0: 2395
t1: 1387
t2: 459
t3: 1176
t4: 708
Splunks results:
median: 1176
average: 1225

So, yes, you're right. If Splunk would give a consist result for the median function during a 'spike' then it would be perfect. But it appears to have a flaw in it.

Thanks so much for the response, kaufmanm.

0 Karma

sfncook
Explorer

No, median does not appear to do the trick mainly because the Splunk median function returns different values when I run the same query. By that I mean I keep hitting the search button without changing any input parameters (as far as I can tell) and sometime the median value is '0' and sometimes it is '1212'. Maybe this is now becoming a new question for answers.splunk.com. (sample data to follow)

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...