I'm trying to set up an alert for this use case:
When the request time taken for an API is above X seconds threshold for Y consecutive requests on a GET/POST/PUT request then send an alert.
The challenges that I'm facing are due to having multiple APIs, multiple HTTP methods, multiple seconds thresholds and multiple consecutive requests thresholds. The thresholds are declared in a .csv file which can be easily updated by anyone and then uploaded as a lookup table.
| api | GET_time_threshold_s | GET_count_consecutive_overtime_threshold | POST_time_threshold_s | POST_count_consecutive_overtime_threshold | PUT_time_threshold_s | PUT_count_consecutive_overtime_threshold |
| OrdersApi | 0.5 | 7 | 0.8 | 5 | 1.5 | 3 |
So far I came up with a solution that works just for a single API, but I'm unsure of what's the best solution that has less maintenance possible. I don't know how to pass a lookup table field to the window
argument of streamstats
command so I created a separate query to generate the search command.
Generate search query
| inputlookup api_lookup_with_thresholds.csv
| where api="OrdersApi"
| eval query="sourcetype=IIS host=\"Prod*\" api=\"OrdersApi\"
| eval time_taken_s = round(time_taken/1000, 3)
| lookup api_lookup_with_thresholds.csv api
| eval is_GET_time_over_threshold=if(cs_method=\"GET\" AND time_taken_s >= GET_time_threshold_s, 1, 0),
is_POST_time_over_threshold=if(cs_method=\"POST\" AND time_taken_s >= POST_time_threshold_s, 1, 0),
is_PUT_time_over_threshold=if(cs_method=\"PUT\" AND time_taken_s >= PUT_time_threshold_s, 1, 0)
| sort +_time
| streamstats window=" + GET_count_consecutive_overtime_threshold + " global=false sum(is_GET_time_over_threshold) as rolling_over_GET_threshold by api, cs_method,
| streamstats window=" + POST_count_consecutive_overtime_threshold + " global=false sum(is_POST_time_over_threshold) as rolling_over_POST_threshold by api, cs_method,
| streamstats window=" + PUT_count_consecutive_overtime_threshold + " global=false sum(is_PUT_time_over_threshold) as rolling_over_PUT_threshold by api, cs_method | table _time, api, cs_method, time_taken_s, rolling_over_GET_threshold, rolling_over_POST_threshold, is_GET_time_over_threshold, is_POST_time_over_threshold" | return $query
The result would be a query like below that targets only OrdersApi.
Monitor search query
sourcetype=IIS host="Prod*" api="OrdersApi"
| eval time_taken_s = round(time_taken/1000, 3)
| lookup api_lookup_with_thresholds.csv api
| eval is_GET_time_over_threshold=if(cs_method="GET" AND time_taken_s >= GET_time_threshold_s, 1, 0),
is_POST_time_over_threshold=if(cs_method="POST" AND time_taken_s >= POST_time_threshold_s, 1, 0),
is_PUT_time_over_threshold=if(cs_method="PUT" AND time_taken_s >= PUT_time_threshold_s, 1, 0)
| sort +_time
| streamstats window=7 global=false sum(is_GET_time_over_threshold) as rolling_over_GET_threshold by api, cs_method
| streamstats window=5 global=false sum(is_POST_time_over_threshold) as rolling_over_POST_threshold by api, cs_method
| streamstats window=3 global=false sum(is_PUT_time_over_threshold) as rolling_over_PUT_threshold by api, cs_method
Is there a way to execute the generated search command in another search? Is there a better way to solve the use case while keeping maintenance as low as possible? Should I think about using the API to generate all the searches automatically?
I'm trying to find a solution that when uploading the new .csv file doesn't require updating all the search queries.
As an alternative solution I was thinking of saving the search above as a savedsearch
with api
, get_window
, post_window
, put_window
parameters and call it from another search, one for each API but I couldn't read the values from the lookup table and pass them to the saved search.
1) Restructure your file as so -
reqapi reqtype reqelapsed reqcount
This is not completely necessary, but it will help your brain see the simplicity of the solution.
2) Then try this...
your search that gets _time, reqapi, reqtype and reqelapsed
| rename COMMENT as "first we put the records into order"
| sort 0 reqapi reqtype _time
| rename COMMENT as "now we look up the trigger time and flag the records which qualify for the trigger"
| lookup mylookup reqapi reqtype OUTPUT reqtrigger reqcount
| eval overtime=if(reqelapsed>=reqtrigger,1,0)
| rename COMMENT as "use streamstats to check whether the record is different from the prior record"
| streamstats current=f last(overtime) as priortime by reqapi reqtype
| eval newgroup=if(overtime=priortime,0,1)
| streamstats sum(newgroup) as groupno by reqapi reqtype
| rename COMMENT as "figure out how many records belong to the group"
| rename COMMENT as "and let a trigger=1 group pass if it's bigger than the required count"
| eventstats count as groupsize by reqapi reqtype groupno
| where (groupsize >= reqcount) AND (overtime=1)
1) Restructure your file as so -
reqapi reqtype reqelapsed reqcount
This is not completely necessary, but it will help your brain see the simplicity of the solution.
2) Then try this...
your search that gets _time, reqapi, reqtype and reqelapsed
| rename COMMENT as "first we put the records into order"
| sort 0 reqapi reqtype _time
| rename COMMENT as "now we look up the trigger time and flag the records which qualify for the trigger"
| lookup mylookup reqapi reqtype OUTPUT reqtrigger reqcount
| eval overtime=if(reqelapsed>=reqtrigger,1,0)
| rename COMMENT as "use streamstats to check whether the record is different from the prior record"
| streamstats current=f last(overtime) as priortime by reqapi reqtype
| eval newgroup=if(overtime=priortime,0,1)
| streamstats sum(newgroup) as groupno by reqapi reqtype
| rename COMMENT as "figure out how many records belong to the group"
| rename COMMENT as "and let a trigger=1 group pass if it's bigger than the required count"
| eventstats count as groupsize by reqapi reqtype groupno
| where (groupsize >= reqcount) AND (overtime=1)
@alex_egyed - did you get everything you needed?