Solved: Search and Alert Question: Summarize found events,...

jewettg · ‎02-04-2015

So my question is based on something I am trying to do, but my splunk-foo is not powerful enough to figure this out!

So, I have the following scenario: We have a door control system that is not very smart, connected to the network. Whenever the networking team makes significant changes (usually on the holidays) they have to bring the network down. The network "reboot" causes what we call "micros" (one of many interfaces for badge-readers, mag-locks controller, etc..) to go completely brain dead. They require a reboot to come back online. While brain-dead, doors will not open, badges will not be read, etc.. A BAD THING!

When this happens the master server that I monitor with Splunk shoot out these errors (see below). Depending on the scope of the network outage, the number of lines (micros) that it reports could vary. Each line will have a different time stamp and a different micro number listed. It usually spits them all out in a matter of seconds of each other.

Example:

02:15:00.425 dbmgr : E - SUP invalid table -1 from micro 105
02:15:00.429 dbmgr : E - SUP invalid table -1 from micro 33
02:15:00.476 dbmgr : E - SUP invalid table -1 from micro 305
02:15:00.524 dbmgr : E - SUP invalid table -1 from micro 40
---- snip ----
02:15:01.434 dbmgr : E - SUP invalid table -1 from micro 38
02:15:01.536 dbmgr : E - SUP invalid table -1 from micro 21
02:15:01.554 dbmgr : E - SUP invalid table -1 from micro 42

I need to alert on these events and let our on-call people know that it has happened and have them reboot those micros. I know how to do this, but the problem I am having is that each time an alert is found (in realtime) a "help desk ticket" (for the on-call people) gets generated, and equally, the on-call people get a SMS based alert based on each ticket. NOT GOOD. This could create between 10-20 tickets and SMS alerts. I really do not want to piss off the on-call folks with ticket bombs!

I would like to be able to summarize the events that Splunk monitors for a short period of time. Then send that summary to the email address that generates the ticket and SMS alert with a list of the events that happened. It would then wait a couple of hours (set up via a threshold) before alerting the on-call staff again. It might take them 15-20 minutes to wake up, login, try to remote reboot the micro or they may have to travel into the office and reboot the micros (30-40 minutes)

So I have the alert setup and the threshold setup. Can anyone help me figure out how to summarize the events and have them listed in a single email?

jewettg · ‎02-11-2015

Never mind! Found the answer after enough playing around.

The best way to do what I wanted to do is to use the :

Alert Type: Scheduled. Cron Schedule.
Trigger Condition: Number of Results is > 0.
When triggered, execute actions: [ √ ] ONCE [ ] For each result

View solution in original post

jewettg · ‎02-11-2015

Never mind! Found the answer after enough playing around.

The best way to do what I wanted to do is to use the :

Alert Type: Scheduled. Cron Schedule.
Trigger Condition: Number of Results is > 0.
When triggered, execute actions: [ √ ] ONCE [ ] For each result

Search and Alert Question: Summarize found events, set threshold, alert only on new events.

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life

Introducing Splunk Enterprise 9.2