Hi there,
I am in the process of setting up a proof of concept Splunk environment that will replace our current alerting system. We currently use a combination of syslog and swatch (syslog watcher) to alert on error codes across our applications (via email to a number of different recipients depending on the alert). We have about 15 different applications that can generate a total of about 900 unique alert codes. One of the main issues with our current system is that it cannot do any velocity checking on alerts (i.e. Only alert if there are 3 ERR_101 alerts in a set amount of time.
I can achieve the above if I take a small subset of the error codes and set up an alert with the trigger being number of occurrences per x minutes.
The problem is that when I try to scale this up it gets very bloated, hard to manage and ends up with several different real-time searches (which would affect performance).
I want to build (without re-inventing the wheel too much) something that will allow me to tune the email recipient for each alert and also the number of occurrences within a configurable time-frame to alert on.
Taking the example from the table below ERR_002 - If there are 3 occurrences of this error in 60 minutes an email will be sent to appteam@abc.com.
|Error Code | Email | NumOccurences | Timeframe
|ERR_001 | support@abc.com,oncall@abc.com" | 3 | 1|
|Err_002 | appteam@abc.com | 3 | 60 |
I am not looking for a complete answer to this problem, just a bit of guidance into how I would go about achieving this within Splunk. I have investigated lookup tables but have been unable to use values in the table to customise the alert.
Anything guidance help would must much appreciated.
Cormac
... View more