Alerting

Alert when request time taken is above threshold for specified consecutive requests

alex_egyed
Engager

I'm trying to set up an alert for this use case:

When the request time taken for an API is above X seconds threshold for Y consecutive requests on a GET/POST/PUT request then send an alert.

The challenges that I'm facing are due to having multiple APIs, multiple HTTP methods, multiple seconds thresholds and multiple consecutive requests thresholds. The thresholds are declared in a .csv file which can be easily updated by anyone and then uploaded as a lookup table.

| api | GET_time_threshold_s | GET_count_consecutive_overtime_threshold | POST_time_threshold_s | POST_count_consecutive_overtime_threshold | PUT_time_threshold_s | PUT_count_consecutive_overtime_threshold |
| OrdersApi | 0.5 | 7 | 0.8 | 5 | 1.5 | 3 |

So far I came up with a solution that works just for a single API, but I'm unsure of what's the best solution that has less maintenance possible. I don't know how to pass a lookup table field to the window argument of streamstats command so I created a separate query to generate the search command.

Generate search query

| inputlookup api_lookup_with_thresholds.csv
| where api="OrdersApi"
| eval query="sourcetype=IIS host=\"Prod*\" api=\"OrdersApi\" 
| eval time_taken_s = round(time_taken/1000, 3) 
| lookup api_lookup_with_thresholds.csv api 
| eval is_GET_time_over_threshold=if(cs_method=\"GET\" AND time_taken_s >= GET_time_threshold_s, 1, 0), 
    is_POST_time_over_threshold=if(cs_method=\"POST\" AND time_taken_s >= POST_time_threshold_s, 1, 0), 
    is_PUT_time_over_threshold=if(cs_method=\"PUT\" AND time_taken_s >= PUT_time_threshold_s, 1, 0) 
| sort +_time 
| streamstats window=" + GET_count_consecutive_overtime_threshold  + " global=false sum(is_GET_time_over_threshold)  as rolling_over_GET_threshold  by api, cs_method,
| streamstats window=" + POST_count_consecutive_overtime_threshold + " global=false sum(is_POST_time_over_threshold) as rolling_over_POST_threshold by api, cs_method,
| streamstats window=" + PUT_count_consecutive_overtime_threshold  + " global=false sum(is_PUT_time_over_threshold)  as rolling_over_PUT_threshold  by api, cs_method | table _time, api, cs_method, time_taken_s, rolling_over_GET_threshold, rolling_over_POST_threshold, is_GET_time_over_threshold, is_POST_time_over_threshold" | return $query

The result would be a query like below that targets only OrdersApi.

Monitor search query

sourcetype=IIS host="Prod*" api="OrdersApi" 
| eval time_taken_s = round(time_taken/1000, 3) 
| lookup api_lookup_with_thresholds.csv api 
| eval is_GET_time_over_threshold=if(cs_method="GET" AND time_taken_s >= GET_time_threshold_s, 1, 0),
  is_POST_time_over_threshold=if(cs_method="POST" AND time_taken_s >= POST_time_threshold_s, 1, 0),
  is_PUT_time_over_threshold=if(cs_method="PUT" AND time_taken_s >= PUT_time_threshold_s, 1, 0) 
| sort +_time 
| streamstats window=7 global=false sum(is_GET_time_over_threshold)  as rolling_over_GET_threshold  by api, cs_method 
| streamstats window=5 global=false sum(is_POST_time_over_threshold) as rolling_over_POST_threshold by api, cs_method 
| streamstats window=3 global=false sum(is_PUT_time_over_threshold)  as rolling_over_PUT_threshold  by api, cs_method

Is there a way to execute the generated search command in another search? Is there a better way to solve the use case while keeping maintenance as low as possible? Should I think about using the API to generate all the searches automatically?
I'm trying to find a solution that when uploading the new .csv file doesn't require updating all the search queries.

As an alternative solution I was thinking of saving the search above as a savedsearch with api, get_window, post_window, put_window parameters and call it from another search, one for each API but I couldn't read the values from the lookup table and pass them to the saved search.

0 Karma
1 Solution

DalJeanis
Legend

1) Restructure your file as so -

reqapi reqtype reqelapsed reqcount

This is not completely necessary, but it will help your brain see the simplicity of the solution.

2) Then try this...

your search that gets _time, reqapi, reqtype and reqelapsed

| rename COMMENT as "first we put the records into order" 
| sort 0 reqapi reqtype _time   

| rename COMMENT as "now we look up the trigger time and flag the records which qualify for the trigger" 
| lookup mylookup reqapi reqtype OUTPUT reqtrigger reqcount
| eval overtime=if(reqelapsed>=reqtrigger,1,0)

| rename COMMENT as "use streamstats to check whether the record is different from the prior record" 
| streamstats current=f last(overtime) as priortime by reqapi reqtype
| eval newgroup=if(overtime=priortime,0,1) 
| streamstats sum(newgroup) as groupno by reqapi reqtype

| rename COMMENT as "figure out how many records belong to the group"
| rename COMMENT as "and let a trigger=1 group pass if it's bigger than the required count" 
| eventstats count as groupsize by reqapi reqtype groupno
| where (groupsize >= reqcount) AND (overtime=1)

View solution in original post

DalJeanis
Legend

1) Restructure your file as so -

reqapi reqtype reqelapsed reqcount

This is not completely necessary, but it will help your brain see the simplicity of the solution.

2) Then try this...

your search that gets _time, reqapi, reqtype and reqelapsed

| rename COMMENT as "first we put the records into order" 
| sort 0 reqapi reqtype _time   

| rename COMMENT as "now we look up the trigger time and flag the records which qualify for the trigger" 
| lookup mylookup reqapi reqtype OUTPUT reqtrigger reqcount
| eval overtime=if(reqelapsed>=reqtrigger,1,0)

| rename COMMENT as "use streamstats to check whether the record is different from the prior record" 
| streamstats current=f last(overtime) as priortime by reqapi reqtype
| eval newgroup=if(overtime=priortime,0,1) 
| streamstats sum(newgroup) as groupno by reqapi reqtype

| rename COMMENT as "figure out how many records belong to the group"
| rename COMMENT as "and let a trigger=1 group pass if it's bigger than the required count" 
| eventstats count as groupsize by reqapi reqtype groupno
| where (groupsize >= reqcount) AND (overtime=1)

DalJeanis
Legend

@alex_egyed - did you get everything you needed?

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...