Recently, we implemented AlertThrottle which is a terrific little app which limits (in our case) the amount of emails if something is past a particular value. The second half of our task is to identify events that occur below a threshold but is a significant jump in value.
Example: Disk is 20% full, it jumps to 50% in the matter of an hour but doesn't trigger the disk alert which is set at 85%.
I am looking for guidance or any suggestions of how to go about this. Basically we are comparing two values over a period of time and if it exceeds a limit within that moving time frame an alert is triggered.
You can calculate the percent difference over your time range and then alert if the difference is higher than your threshold amount. For example, let's say you want to know the % difference of the disk_space field:
your search | stats range(disk_space) as difference list(disk_space) as list | streamstats max(list) as maxSelect window=1 | eval percent_difference=((difference/maxSelect)*100)
Then have an alert condition that hits when percent_difference > 30 for a 30% increase alert.
This is really great @ftk! I re-purposed it for a SQL replication alert that is often very very spiky (Values from 0 up 25 000 and back to 0 in 3-minute span). I changed the logic a little to ensure we had an actual problem for a set period of time. The following search is for 5 data points over a 5min time frame.
index=sql source=blah sourcetype=sp_pendingcmds
| where pendingcmdcount>= 10000
| stats range(pendingcmdcount) as difference list(pendingcmdcount) as list
| streamstats latest(list) as maxSelect window=1, count(list) as listcount
| where listcount>=5
| eval percent_difference=((difference/maxSelect)*100)
I then set the alert to check percent_difference > 50
It's working a treat. I hope it helps someone else
You can calculate the percent difference over your time range and then alert if the difference is higher than your threshold amount. For example, let's say you want to know the % difference of the disk_space field:
your search | stats range(disk_space) as difference list(disk_space) as list | streamstats max(list) as maxSelect window=1 | eval percent_difference=((difference/maxSelect)*100)
Then have an alert condition that hits when percent_difference > 30 for a 30% increase alert.
This is pretty good, however you are still specifying a fixed threshold for the alert condition (in this case, 30%). How do you know if 30% is the right choice?
What's more effective is to use an anomaly detection approach to determine if the current data is statistically outside of the likelihood of occurrence based upon observed past behavior/values. That type of analysis is inherently hard to do on your own, but there's an app called Prelert Anomaly Detective that will do it for you!
I imagine you could do your stats on a by server basis. | stats range(disk_space) as difference list(disk_space) by host
this works for one host-filesystem pair but falls apart when the search contains results from many different filesystems/hosts. Is there anyway to account for that besides limiting the search to one host/filesystem?