Hi, I have a requirement where I want to create an alert on some of my APIs which are being monitored in Splunk.
I've created a search which checks the success/failures of each API and then calculates the failure rate and if that is more than 10% then it triggers the alert.
Now what is happening is the alerts gets triggered even for bigger blips when they are only for short duration. Like there is a high increase in error rate for 5 mins and then it gets recovered itself. I don't want to trigger the alert in that situation because it will make unnecessary callouts to people for investigation which is not required.
How can i create alert which runs every 30 mins and looks into the failure rate consistently for each 5 mins in the last 30 minutes period. So if the failure rate is consistent for more than 15/20 mins then only trigger the alert.
This is my base search
index=api_prod (message.httpResponseCode=50* OR message.httpResponseCode=20*)
| rename message.serviceName as serviceName message.httpResponseCode as httpResponseCode
| stats count as totalrequests count(eval(like(httpResponseCode, "20%"))) as successrequest count(eval(like(httpResponseCode, "50%"))) as failedrequest by serviceName
| eval Total = successrequest + failedrequest
| eval failureRatePercentage = round(((failedrequest/totalrequests) * 100),2)
| where failureRatePercentage > 10
| fields - Total
|table serviceName,totalrequests,successrequest,failedrequest,failureRatePercentage
Any guidance is really appreciated.
Best Regards,
Shashank