Visibly taking responsibility for a generated aler...

SharplyUnclear · ‎05-14-2014

I am working on a call centre solution where alerts are raised (dropped calls, email queues building up, average call length too long, etc.) and displayed in a panel on a common Splunk application to a set of team leaders. When the problem goes away, then the alert status goes 'green' (and it should disappear from the display panel).

I want a team leader to be able to say that they're taking responsibility for the alert, so that no-one else has to concern themselves with it, and for this information to be propagated to all users.

I would expect there to be 5-20 alerts active at any one time (in theory there could be a few hundred, but this would represent Armageddon). What approach would people take to designing this solution - is it practical (say) to hold the alert information in a transient CSV file, and to capture an owner's decision to take responsibility for fixing the problem from an individual screen? Could I use inputcsv and outputcsv to control this mechanism, and would the status be propagated consistently across the system?

SharplyUnclear · ‎05-16-2014

Thanks for your feedback and for your broad confirmation of the direction I'm taking. We're not going to implement a "poor man's" database transactional model, so there is a small chance that two people respond at the same time. I'll also make sure that only one instance of a particular alert is displayed on the bespoke panel we're controlling output to.

I'll update this note with information on how I get on later on.

dwaddle · ‎05-15-2014

In a traditional IT role, this is a good case for a partner to Splunk like PagerDuty (www.pagerduty.com). The pre-built integrations in Splunk hands off alerts to PagerDuty as incidents, and PagerDuty maintains the responsible party (and their responsiveness). PagerDuty also handles escalations in the event of un-responsiveness.

But I think you would struggle with using PagerDuty for this role in the system you've described. If you're going to have to maintain state, I think what you're describing sounds reasonable - lookups for state are a common solution. I think one potential issue is if you have mulitple instances of a given alert - which one is someone acknowledging / taking responsibility for?

Visibly taking responsibility for a generated alert

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor