Solved: Count Consecutive Errors and alert

ashleyherbert · ‎11-22-2011

Hi,

We have some transaction logs which log business event transactions.
I have a requirement to alert when a particular transaction fails 10 consecutive times. If a transaction is successful in the middle, it should not alert. Each transaction is a single event with a Status field which tells whether it was successful or failed.

I've tried a few different ways of doing this, but haven't found a good way. Is there any search that will find consecutive events of a field?

For example, here's the search I'd use if I just wanted a count of the errors:

index=prod sourcetype=esb-appaudit Status=TechnicalFault | stats count by ServiceName | search count>10

This would give me the ServiceName's that had more than 10 errors in the time period I'm searching, but not 10 consecutive errors.

Any help would be appreciated.

Thanks,

Ash

kristian_kolb · ‎11-24-2011

Hi, assuming that you can live with "near-realtime" you can run a scheduled search each minute along the lines of:

your_source_or_sourcetype | head 10 | where Status="Success"

Then you set alerting for the saved search to trigger if "number of events" is zero.

Depending on the actual load on the system, you may end up with hundreds of consecutive failures between the scheduled searches, but you'll get notified within a minute anyway. Play around with scheduling parameters to run as often as needed.

UPDATE: upon closer inspection of your question, I see that several services write to the same log, right? In that case this will not work as intended. Have to think a bit further on that one.

UPDATE2:
Played around a little with a different kind of log and I think this does what you want when you run it a real-time search (assuming that "Success" is the opposite of "TechnicalFault";

index=prod sourcetype=esb-appaudit | streamstats count(Status) as StatCount global=f window=10 by ServiceName, Status | chart values(StatCount) values(Status) AS XXX by ServiceName | where XXX != "Success"

This prints out a little more information than you need, but you can use that info to verify that the results are correct. This is my first shot at streamstats, so it can most likely be made shorter/more efficient.

Hope this helps,

Kristian

View solution in original post

kristian_kolb · ‎11-24-2011

Hi, assuming that you can live with "near-realtime" you can run a scheduled search each minute along the lines of:

your_source_or_sourcetype | head 10 | where Status="Success"

Then you set alerting for the saved search to trigger if "number of events" is zero.

Depending on the actual load on the system, you may end up with hundreds of consecutive failures between the scheduled searches, but you'll get notified within a minute anyway. Play around with scheduling parameters to run as often as needed.

UPDATE: upon closer inspection of your question, I see that several services write to the same log, right? In that case this will not work as intended. Have to think a bit further on that one.

UPDATE2:
Played around a little with a different kind of log and I think this does what you want when you run it a real-time search (assuming that "Success" is the opposite of "TechnicalFault";

index=prod sourcetype=esb-appaudit | streamstats count(Status) as StatCount global=f window=10 by ServiceName, Status | chart values(StatCount) values(Status) AS XXX by ServiceName | where XXX != "Success"

This prints out a little more information than you need, but you can use that info to verify that the results are correct. This is my first shot at streamstats, so it can most likely be made shorter/more efficient.

Hope this helps,

Kristian

ashleyherbert · ‎11-27-2011

Thanks Kristian, Yeah I've got to go through the number of different Services we plan to have. Most likely I'll try to group them by the window size that they require, hopefully there should only be a couple of different searches that way.

I appreciate all your help.

Cheers,
Ash

kristian_kolb · ‎11-25-2011

Sorry, but i don't think it is possible to have different window sizes in the same search.

If you only have a few, fairly static, services it could be possible to make one search job per ServiceName, though this might consume more resources than you'd like. A search job typically uses one CPU core and some memory, so if you have 5 ServiceNames you'd like to monitor for failures in real-time, that would mean that 5 cores would be tied down for this constantly.

Please vote up and/or mark as "Answered" if you're satisified.

/kristian

ashleyherbert · ‎11-24-2011

Here's the search I'm working with at the moment:

index=prod sourcetype=esb-appaudit | replace *Completed with Completed, *TechnicalFault with TechnicalFault, *BusinessFault with Completed *GatewaySysError with TechnicalFault in Status | streamstats count(Status) as StatCount, global=f window=400 by ServiceName, Status | stats last(StatCount) as TotalCount, values(Status) as Statuses by ServiceName | join type=left ServiceName [inputcsv ESBServiceThresholds.csv] | eval Alert=if(TotalCount>Threshold,"Alert",NULL) | search Alert=Alert Statuses!=Completed

Thanks for your help.
Cheers,
Ash

ashleyherbert · ‎11-24-2011

Hi Kristian,
This looks promising, I've been playing around with it and it's mostly doing what I want. The next trick is to work out how I can have different thresholds for each ServiceName, but thats a different issue (I would somehow have to change the streamstats window size based on the Threshold...).

Ayn · ‎11-23-2011

My suggestion is to build transactions (Splunk's transactions that is, not to be confused with the transactions you are gathering from your logs), joining events based on ServiceName and specify that the transaction should be closed if an event with success status is found. That way, any transactions with more than 10 events in them will be the ones with more than 10 consecutive errors (because any successful event would close the transaction). Because transaction always outputs a field called eventcount, you can then search for transactions that have more than 10 events in them based on that. I don't know what the "success" status is in your case, but for simplicity I'm calling it "Success". In that case you'd do something like:

index=prod sourcetype=esb-appaudit | transaction ServiceName endswith=eval(Status=Success) | search eventcount>10

Ayn · ‎11-24-2011

You might be able to use streamstats for this. It has the option to check a window of events and perform statistics on that, so you could for instance do something like

... | streamstats count(eval(Status=TechnicalFault)) by ServiceName window=10

and take it from there. I can't say anything about whether this could prove useful as I haven't used streamstats much myself.

Ayn · ‎11-24-2011

OK, I see. What is the desired behaviour - I'm guessing that you want an alert only once the count has reached 11, but nothing after that? That could be a bit tricky, at least I can't think of a way to do it off the top of my head. If it would be OK to have an alert fired for every time 11 consecutive failures have been reported since LAST alert was fired, you could just add maxevents=11 to your transaction command.

ashleyherbert · ‎11-23-2011

Hi Ayn
Thanks for your suggestion. I've been playing with it and it kind of works, however, I intend to run this as a realtime search/alert to notify us when there appears to be a problem with the system. With the transaction command, the transaction doesn't appear until it finds a 'Success' event (ie, once the system is working again), so I can't actually alert on that. If the system is actually having an issue, there won't be any 'Success' events to close the transactions.

Count Consecutive Errors and alert

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms