Alerts based on calculated percent errors?

Cuyose · ‎02-19-2017

I see a lot of answers here that are fine if you are running a scheduled search for a set time and just piping the "search percent>5" or whatever.

I want to trigger an alert as quickly as possible, however am unclear how the "rolling time window" within an alert action definition works. Below is what I have after my base query for the alert query itself.

|bin _time span=5m
|stats count(eval(errorSystem="ERROR"))  AS fail_cnt, count as total by _time
|fillnull value=0
|eval pct=100*fail_cnt/total

Now this works just fine and can be vetted by piping it to a timechart span=5m sum(pct)

Although my dilemma is , do I eval this every minute in a saved search? I am attempting to use the alert custom with rolling real time search of 5 minutes and search pct>5. The problem I am running into is that the alert is triggering when the historical timechart does not support the threshold. I am thinking that if there is a group of errors, its is always being evaluated as to be included in any 5 minute segment it appears, thereby I believe working better than the artificial 5 min bin I have.

I think I may have an idea as I type this, but what is the recommended way to do this?

DalJeanis · ‎02-20-2017

How about using time latest=-2m, bin at 1m, accum and delta -6?

  earliest=-10m latest=-2m
 | bin _time span=1m
 | stats count(eval(errorSystem="ERROR"))  AS fail_cnt, count as total_cnt by _time
 | accum fail_cnt as fail_cum 
 | accum total_cnt as total_cum
 | delta fail_cum as fail_delta p=6 
 | delta total_cum as total_delta p=6 
 | eventstats max(_time) as maxtime
 | eval fail_pct=100*fail_delta/total_delta
 | where (fail_pct >= 5) and (_time==maxtime)

Probably also want a 5m+ throttle on it.

martin_mueller · ‎02-19-2017

bin and timechart will turn five-minute buckets into "round" five minutes, ie starting at :00, :05, etc. - they're not a rolling window through time.

If your indexing delay is low and known, you could run this with earliest of -6m@m and latest of -m@m every minute:

base search | stats count(eval(errorSystem="ERROR"))  AS fail_cnt, count | eval pct = 100 * fail_cnt / total | where pct > 5

That will roll a five-minute window through time in one-minute steps.
Unless there's an automated reaction happening when the alert triggers, going real-time is rarely useful and brings more trouble than it's worth.

martin_mueller · ‎02-20-2017

Without your data and exact config it's impossible to guess what's going on.

Cuyose · ‎02-20-2017

I was still attempting to use the rolling time range within the alert, rather than scheduling all the alert searches every minute or 5. Iam changing back to your suggestion and vetting.

I am still interested in the original issue though, of being able to use splunk's rolling alert on a calculated field, which appears to be the issue.

Cuyose · ‎02-20-2017

Thats what I was thinking as well. I also wanted to port these alert status to a dashboard and was running into issues with too many results to display ui warnings, but it became obvious, the alerts can run in the rolling window taking account any indexing latency, and the dashboards can run with a set schedule. It wont be "exactly" the same, but should be close.

Ill give it a trial after my meeting this morning and see how it goes,

Thanks!

Cuyose · ‎02-20-2017

It appears this does not work as I would think. removing the timespan from the alert triggers way to aggressively. I get a hit for 33% errors using an alert rolling window searching for pct>33, however timecharting the same alert query for 5 minute buckets, doesn't return any 5 minute segment above 6.

Cuyose · ‎02-19-2017

So far I have arrived at ditching the bin _time part, and converting the stats command to a straight timechart with fixedrange=false and span=1m for rolling time window of 15m with latest -2m.

The latest = rt-2m was initially being used to try and not trigger on incomplete time buckets, but may be able to be removed.

Still looking for any pointers on how anyone else was working with this. Triggering on error counts is easy, however when you have traffic patterns that record error counts per time period in a sine wave, you need to use pct for this.

Alerts based on calculated percent errors?

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life