Currently, we have a search that is set to trigger if it returns a single result, and then throttle for 10 minutes before going again.
We'd like to kind of do the opposite: If the search is STILL returning results (same host OR other hosts) after 10 minutes' time, THEN trigger an alert.
At the moment, this search returns alerts as soon as they happen, but sometimes it's a single alert and therefore a minor warning, and sometimes it's continuous (aka a service is actually down). We'd like to get Splunk to trigger if the same alert is still firing after 10 minutes, which usually indicates a problem with a particular host.
Build a search that uses timechart count
with a span=
that covers what would normally be your search's time window and then count up how many threshold crossings that you have with | where count>threshold | stats count | where count>10
and alert on that.
Basically, you want to create a test that says the alert condition has not NOT been true for X minutes. Typically, the question is, "How to alert when my CPU has been over X% for Y minutes?"
Which is to say, "How do I know my CPU has been OVER X% for Y minutes and has NOT been UNDER X% for those Y minutes?"
The overall strategy is : create records for Y minutes or more back, at whatever frequency you think is reasonable for your use case, that have either a 1 or 0 for a field that means "the alert condition is true". Use streamstats
to group them based on changes in that value. Finally, use eventstats
to count the group and if the group is large enough (has enough minutes or seconds) to meet your criteria, then let the group through to throw the alert.
Here's a couple of those to review. The second one points to three more -
https://answers.splunk.com/answers/507811/how-to-edit-my-real-time-alert-to-trigger-when-ave.html
https://answers.splunk.com/answers/557838/create-an-alert-based-on-cpu-being-at-95-for-a-spa.html
You can select like last 10 or 15 min and calculate the duration of first alert event to last/latest alert event and alert when duration is 10 mins.
The event itself is brief, I just need to know how many times it's fired over the last 10 minutes (and if it's still going). Most of what I'm finding calculates event duration as opposed to difference between events firing.