Splunk Search

How to calculate limits for agents based on their activity?

exmuzzy
Explorer

I want to receive notifications if agents lower or exceed their normal activity for the current day of the week and hour.
I'm going to achieve this by creating a daily lookup table with the counted perc90 number of events per day of the week / hour

The search to create daily lookup table looks like this

index=api* | stats count(eval(result=="OK")) as success, count(eval(result!="OK")) as fail by agent, method, date_month, date_mday, date_wday, date_hour, date_minute | stats min(success) as min_success, min(fail) as min_fail, perc90(success) as perc90_success, perc90(fail) as perc90_fail, max(success) as max_success, max(fail) as max_fail by agent, method, date_wday, date_hour, date_minute | strcat agent "@" method "@" date_wday date_hour ":" date_minute key | fields key, min_success, min_fail, perc90_success, perc90_fail, max_success, min_fail | outputlookup agent_limits append=False key_field=key

But now I'm in doubt how to move on.
How to get overview filling that agents activity in normal ranges?
should it be alert or dashboard or pivot?

how to make a correction manually, if the agent has gone beyond the ranges of his normal activity, but this is the same norm or just growth over time?

0 Karma

DalJeanis
Legend

You're a bit deep into the weeds. The first thing you need to do is understand your data.

What you have there is the minimum, maximum, and 90th percentile of activity for that particular hour and minute of that day of the week. Obviously, I don't know your data, but is that really a reasonable representation of how your data changes? Is 4 AM on Monday that much different from 4 AM on Tuesday? Is 4:03 AM on Wednesday really that different from 4:05 AM on Wednesday?

First, run this and look at the results. What is the shape of your data?

index = api*  agent=* method=*
| bin _time span=1m
| stats count(eval(result=="OK")) as success, count(eval(result!="OK")) as fail by agent, method, date_wday, date_hour, date_minute, _time
| eval total= success+fail
| eval good = round(100*success/total,0)
| eval bad = round(100*fail/total,0)
| stats min(success) as min_success, min(fail) as min_fail, min(total) as min_total, 
    perc90(success) as perc90_success, perc90(fail) as perc90_fail, perc90(total) as perc90_total, 
    max(success) as max_success, max(fail) as max_fail, max(total) as max_total, 
    avg(success) as avg_success, avg(fail) as avg_fail, avg(total) as avg_total, 
    stdev(success) as stdev_success, stdev(fail) as stdev_fail, stdev(total) as stdev_total, 
    perc90(good) as perc90_good, perc90(bad) as perc90_bad,
    min(good) as min_good, min(bad) as min_bad,
    max(good) as max_good, max(bad) as max_bad,
    avg(good) as avg_good, avg(bad) as avg_bad,
    stdev(good) as max_good, stdev(bad) as stdev_bad,
    max(_time) as _time
    by agent, method, date_wday, date_hour, date_minute 
| table _time agent method min_* perc* max_* avg* stdev* date*
| outputcsv mytest.csv 

That gives you a csv file of your results for this particular moment in time, back one full week. Then you can use the csv of results to explore the data. That will give you a picture of your average week. From there, there are literally hundreds of ways to go, depending on what you find.

You are trying to understand the features of your data, and find out what is most important. How does the total number of transactions ebb and flow? Are there times of day that are different in character? Are there agents or methods that are more volatile than others? More predictable?

0 Karma

DalJeanis
Legend

Is the number of transactions that are good most important, or the percentage?

How does the 90th percentile relate to the avg and the stdev? Basically, this is the z score of the 90th percentile | eval z_x=round( (perc90_x - avg_x)/stdev_x , 2) That's usually going to be about 1.28, but the amount by which it is different might tell you something about the distribution you are looking at.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...