Alerting

Smoothing a running average

Esperteyu
Explorer

Hi,

So what I've been trying to do lately is to create an alert on top of a ratio errors/total and the option I focused for the moment, not that I wouldn't like to have something more accurate if I can though, is trying to alert if the last 10 minutes ratio exceeds in more than 20 percentage points the last 24 hours ratio.

For that I tried to use trendline with no luck as I need a by clause (as per https://answers.splunk.com/answers/692424/trendline-to-work-grouped-by-field.html) and a few other things and eventually came with "something"..
Then I started to think about smoothing the 24 hour average to avoid that some spikes could go undetected, and found the outlier command which I used in a very naive way, so the query I have at the moment is this one

index="logger" "Raw Notification" 
| bin _time span=10m
| eval _raw=replace(_raw,"\\\\\"","\"") 
| rex "\"RawRequest\":\"(?<raw_request>.+)\"}$" 
| eval json= raw_request 
| spath input=json output=country_code path=customer.billingAddress.countryCode 
| spath input=json output=card_scheme path=paymentMethod.card.cardScheme 
| spath input=json output=acquirer_name path=processing.authResponse.acquirerName 
| spath input=json output=transaction_status path=transaction.status 
| spath input=json output=reason_messages path=history{}.reasonMessage
| eval acquirer= card_scheme . ":" . acquirer_name . ":" . country_code
| eval final_reason_message=mvIndex(reason_messages, 1)
| eval error=if(like(transaction_status,"%FAILED%"),1,0)
| eval error_message=if(like(transaction_status,"%FAILED%"),final_reason_message, NULL()) 
| stats count as total sum(error) as errors mode(error_message) as most_common_error_message by _time, acquirer
| eval ten_minutes_error_rate=100*exact(errors)/exact(total) 
| outlier action=TF total errors
| sort by _time
| streamstats time_window=24h sum(total) as twentyfour_hours_total sum(errors) as twentyfour_hours_errors by acquirer
| eval twentyfour_hours_error_rate=100*exact(twentyfour_hours_errors)/exact(twentyfour_hours_total)
| eval outlier = ten_minutes_error_rate - twentyfour_hours_error_rate
| where outlier > 20

I would like to get critics on it, with some sample date I have it detected what I expected to be detected but I'm not sure if I am reinventing the wheel once I know about the outlier command (although I don't think it's that easy to just use it to detect the outliers in my case because, as I'm considering ratios, it's my understanding that the usage of standard deviations and averages have to be carefully thought and therefore I tried to use the ratio and not the average of the ratios), if that query makes any sense at all with or without the outlier....

Thanks

0 Karma
Get Updates on the Splunk Community!

Detecting Remote Code Executions With the Splunk Threat Research Team

REGISTER NOWRemote code execution (RCE) vulnerabilities pose a significant risk to organizations. If ...

Observability | Use Synthetic Monitoring for Website Metadata Verification

If you are on Splunk Observability Cloud, you may already have Synthetic Monitoringin your observability ...

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk

Tuesday, May 14, 2024  |  11AM PT / 2PM ET Register to Attend Join us for this Tech Talk and learn how to ...