Alerting

Smoothing a running average

Esperteyu
Explorer

Hi,

So what I've been trying to do lately is to create an alert on top of a ratio errors/total and the option I focused for the moment, not that I wouldn't like to have something more accurate if I can though, is trying to alert if the last 10 minutes ratio exceeds in more than 20 percentage points the last 24 hours ratio.

For that I tried to use trendline with no luck as I need a by clause (as per https://answers.splunk.com/answers/692424/trendline-to-work-grouped-by-field.html) and a few other things and eventually came with "something"..
Then I started to think about smoothing the 24 hour average to avoid that some spikes could go undetected, and found the outlier command which I used in a very naive way, so the query I have at the moment is this one

index="logger" "Raw Notification" 
| bin _time span=10m
| eval _raw=replace(_raw,"\\\\\"","\"") 
| rex "\"RawRequest\":\"(?<raw_request>.+)\"}$" 
| eval json= raw_request 
| spath input=json output=country_code path=customer.billingAddress.countryCode 
| spath input=json output=card_scheme path=paymentMethod.card.cardScheme 
| spath input=json output=acquirer_name path=processing.authResponse.acquirerName 
| spath input=json output=transaction_status path=transaction.status 
| spath input=json output=reason_messages path=history{}.reasonMessage
| eval acquirer= card_scheme . ":" . acquirer_name . ":" . country_code
| eval final_reason_message=mvIndex(reason_messages, 1)
| eval error=if(like(transaction_status,"%FAILED%"),1,0)
| eval error_message=if(like(transaction_status,"%FAILED%"),final_reason_message, NULL()) 
| stats count as total sum(error) as errors mode(error_message) as most_common_error_message by _time, acquirer
| eval ten_minutes_error_rate=100*exact(errors)/exact(total) 
| outlier action=TF total errors
| sort by _time
| streamstats time_window=24h sum(total) as twentyfour_hours_total sum(errors) as twentyfour_hours_errors by acquirer
| eval twentyfour_hours_error_rate=100*exact(twentyfour_hours_errors)/exact(twentyfour_hours_total)
| eval outlier = ten_minutes_error_rate - twentyfour_hours_error_rate
| where outlier > 20

I would like to get critics on it, with some sample date I have it detected what I expected to be detected but I'm not sure if I am reinventing the wheel once I know about the outlier command (although I don't think it's that easy to just use it to detect the outliers in my case because, as I'm considering ratios, it's my understanding that the usage of standard deviations and averages have to be carefully thought and therefore I tried to use the ratio and not the average of the ratios), if that query makes any sense at all with or without the outlier....

Thanks

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...