Splunk Search

Count Consecutive Errors and alert

ashleyherbert
Communicator

Hi,

We have some transaction logs which log business event transactions.
I have a requirement to alert when a particular transaction fails 10 consecutive times. If a transaction is successful in the middle, it should not alert. Each transaction is a single event with a Status field which tells whether it was successful or failed.

I've tried a few different ways of doing this, but haven't found a good way. Is there any search that will find consecutive events of a field?

For example, here's the search I'd use if I just wanted a count of the errors:

index=prod sourcetype=esb-appaudit Status=TechnicalFault | stats count by ServiceName | search count>10

This would give me the ServiceName's that had more than 10 errors in the time period I'm searching, but not 10 consecutive errors.

Any help would be appreciated.

Thanks,

Ash

Tags (3)
0 Karma
1 Solution

kristian_kolb
Ultra Champion

Hi, assuming that you can live with "near-realtime" you can run a scheduled search each minute along the lines of:

your_source_or_sourcetype | head 10 | where Status="Success"

Then you set alerting for the saved search to trigger if "number of events" is zero.

Depending on the actual load on the system, you may end up with hundreds of consecutive failures between the scheduled searches, but you'll get notified within a minute anyway. Play around with scheduling parameters to run as often as needed.

UPDATE: upon closer inspection of your question, I see that several services write to the same log, right? In that case this will not work as intended. Have to think a bit further on that one.

UPDATE2:
Played around a little with a different kind of log and I think this does what you want when you run it a real-time search (assuming that "Success" is the opposite of "TechnicalFault";

index=prod sourcetype=esb-appaudit | streamstats count(Status) as StatCount global=f window=10 by ServiceName, Status | chart values(StatCount) values(Status) AS XXX by ServiceName | where XXX != "Success"

This prints out a little more information than you need, but you can use that info to verify that the results are correct. This is my first shot at streamstats, so it can most likely be made shorter/more efficient.

Hope this helps,

Kristian

View solution in original post

kristian_kolb
Ultra Champion

Hi, assuming that you can live with "near-realtime" you can run a scheduled search each minute along the lines of:

your_source_or_sourcetype | head 10 | where Status="Success"

Then you set alerting for the saved search to trigger if "number of events" is zero.

Depending on the actual load on the system, you may end up with hundreds of consecutive failures between the scheduled searches, but you'll get notified within a minute anyway. Play around with scheduling parameters to run as often as needed.

UPDATE: upon closer inspection of your question, I see that several services write to the same log, right? In that case this will not work as intended. Have to think a bit further on that one.

UPDATE2:
Played around a little with a different kind of log and I think this does what you want when you run it a real-time search (assuming that "Success" is the opposite of "TechnicalFault";

index=prod sourcetype=esb-appaudit | streamstats count(Status) as StatCount global=f window=10 by ServiceName, Status | chart values(StatCount) values(Status) AS XXX by ServiceName | where XXX != "Success"

This prints out a little more information than you need, but you can use that info to verify that the results are correct. This is my first shot at streamstats, so it can most likely be made shorter/more efficient.

Hope this helps,

Kristian

ashleyherbert
Communicator

Thanks Kristian, Yeah I've got to go through the number of different Services we plan to have. Most likely I'll try to group them by the window size that they require, hopefully there should only be a couple of different searches that way.

I appreciate all your help.

Cheers,
Ash

0 Karma

kristian_kolb
Ultra Champion

Sorry, but i don't think it is possible to have different window sizes in the same search.

If you only have a few, fairly static, services it could be possible to make one search job per ServiceName, though this might consume more resources than you'd like. A search job typically uses one CPU core and some memory, so if you have 5 ServiceNames you'd like to monitor for failures in real-time, that would mean that 5 cores would be tied down for this constantly.

Please vote up and/or mark as "Answered" if you're satisified.

/kristian

0 Karma

ashleyherbert
Communicator

Here's the search I'm working with at the moment:

index=prod sourcetype=esb-appaudit | replace *Completed with Completed, *TechnicalFault with TechnicalFault, *BusinessFault with Completed *GatewaySysError with TechnicalFault in Status | streamstats count(Status) as StatCount, global=f window=400 by ServiceName, Status | stats last(StatCount) as TotalCount, values(Status) as Statuses by ServiceName | join type=left ServiceName [inputcsv ESBServiceThresholds.csv] | eval Alert=if(TotalCount>Threshold,"Alert",NULL) | search Alert=Alert Statuses!=Completed

Thanks for your help.
Cheers,
Ash

0 Karma

ashleyherbert
Communicator

Hi Kristian,
This looks promising, I've been playing around with it and it's mostly doing what I want. The next trick is to work out how I can have different thresholds for each ServiceName, but thats a different issue (I would somehow have to change the streamstats window size based on the Threshold...).

0 Karma

Ayn
Legend

My suggestion is to build transactions (Splunk's transactions that is, not to be confused with the transactions you are gathering from your logs), joining events based on ServiceName and specify that the transaction should be closed if an event with success status is found. That way, any transactions with more than 10 events in them will be the ones with more than 10 consecutive errors (because any successful event would close the transaction). Because transaction always outputs a field called eventcount, you can then search for transactions that have more than 10 events in them based on that. I don't know what the "success" status is in your case, but for simplicity I'm calling it "Success". In that case you'd do something like:

index=prod sourcetype=esb-appaudit | transaction ServiceName endswith=eval(Status=Success) | search eventcount>10

Ayn
Legend

You might be able to use streamstats for this. It has the option to check a window of events and perform statistics on that, so you could for instance do something like

... | streamstats count(eval(Status=TechnicalFault)) by ServiceName window=10

and take it from there. I can't say anything about whether this could prove useful as I haven't used streamstats much myself.

0 Karma

Ayn
Legend

OK, I see. What is the desired behaviour - I'm guessing that you want an alert only once the count has reached 11, but nothing after that? That could be a bit tricky, at least I can't think of a way to do it off the top of my head. If it would be OK to have an alert fired for every time 11 consecutive failures have been reported since LAST alert was fired, you could just add maxevents=11 to your transaction command.

0 Karma

ashleyherbert
Communicator

Hi Ayn
Thanks for your suggestion. I've been playing with it and it kind of works, however, I intend to run this as a realtime search/alert to notify us when there appears to be a problem with the system. With the transaction command, the transaction doesn't appear until it finds a 'Success' event (ie, once the system is working again), so I can't actually alert on that. If the system is actually having an issue, there won't be any 'Success' events to close the transactions.

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...