Solved: How to send a recovery alert only when there is a ...

anasar · ‎12-19-2016

Hi,

we have many indexes like server and core. and we have a lookup table having two columns: exception and threshold. So the requirements is: to search for all exceptions in each indexes and if any exception from the lookup table is found in the index and the count is equal to or greater than the threshold value from the lookup, then alert.

After 5 mins, it should search again and if the exceptions are not happening in the index, then send recovery alert only for those events which we already sent an alert. So intention here is send recovery alert only when there was an corresponding alert.

While alerting, we need to specify which exception happened, how many events(should be grater than threshold), source ip, etc.. The recovery alert should also have same info.

Eg

Look up table: exception.csv
exceptions,threshold
OutOfMemoryError,1
ORA-1112, 5
JVMExceptions, 2
etc.

gcusello · ‎12-19-2016

Hi anasar,
if exceptions are in a field they are easier to manage:

your_search [ | inputlookup  exception.csv | fields exceptions ] 
| stats count by exception 
| lookup  exception.csv exceptions OUTPUT threshold 
| where count > threshold 
| table exceptions count threshold

If instead (as I think you have) exceptions are strings in your events it's less easy!

your_search [ | inputlookup  exception.csv | rename exceptions AS query | fields query ] 
| rename _raw as rawText
| eval foo=[
   | inputlookup exception.csv 
   | eval query="%"+exception+"%" 
   | stats values(query) AS query 
   | eval query=mvjoin(query,",") 
   | fields query 
   | format "" "" "" "" "" ""
   ]
| eval foo=split(foo,",") 
| mvexpand foo 
| where like(rawText,foo)
| lookup  exception.csv exceptions OUTPUT threshold 
| stats count by exceprions
| where count > threshold 
| table exceptions count threshold

This solve the first requirement but I'm not sure about the second one: "After 5 mins, search again and if the exceptions are not happening in the index, then send recovery alert only for those events we already sent an alert. So intention here is send recovery alert only when there was a corresponding alert."

Bye.
Giuseppe

View solution in original post

gcusello · ‎12-19-2016

Hi anasar,
if exceptions are in a field they are easier to manage:

your_search [ | inputlookup  exception.csv | fields exceptions ] 
| stats count by exception 
| lookup  exception.csv exceptions OUTPUT threshold 
| where count > threshold 
| table exceptions count threshold

If instead (as I think you have) exceptions are strings in your events it's less easy!

your_search [ | inputlookup  exception.csv | rename exceptions AS query | fields query ] 
| rename _raw as rawText
| eval foo=[
   | inputlookup exception.csv 
   | eval query="%"+exception+"%" 
   | stats values(query) AS query 
   | eval query=mvjoin(query,",") 
   | fields query 
   | format "" "" "" "" "" ""
   ]
| eval foo=split(foo,",") 
| mvexpand foo 
| where like(rawText,foo)
| lookup  exception.csv exceptions OUTPUT threshold 
| stats count by exceprions
| where count > threshold 
| table exceptions count threshold

This solve the first requirement but I'm not sure about the second one: "After 5 mins, search again and if the exceptions are not happening in the index, then send recovery alert only for those events we already sent an alert. So intention here is send recovery alert only when there was a corresponding alert."

Bye.
Giuseppe

anasar · ‎12-20-2016

Thank you cusello. It works.

anasar · ‎12-19-2016

Planning to use summary index to save alerts and use it for recovery alerts. But I'm not getting other howto parts clearly. Please help.

anasar · ‎12-19-2016

also I need to add one more column in lookup file. severity. which says whether the exception is a warning(1) or critical(2). Hence while sending mail we need to see the severity and alert subject will be "Critical Problem alert" or "Warning Problem alert ...". The recovery alert "Recovery ...." don't need to refer the severity field. Hence the exception.csv will look like:

exceptions, threshold, severity
OutOfMemoryError, 1, 2
ORA-1114, 5, 1
JVMExceptions, 10, 2
etc.

How to send a recovery alert only when there is a corresponding alert?

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM