Splunk Search

How to edit my real time alert to trigger when average CPU and memory usage exceeds 70% in a 2 minute span?

DPZ_Luke
Explorer

I want an alert thrown whenever a two minute interval shows the average CPU and average Memory usage both exceeding 70%.
But I am stymied in that the Append command does not work for real time and can't figure out an alternative.
I've tried many variations of the following, to no avail.

[host="HOST1" sourcetype="Perfmon:CPU Load" counter="% Processor Time" | bucket _time span=2m | where Value > 70] AND [host="HOST1" source="Perfmon:Memory" collection=Memory object=Memory counter="% Committed Bytes In Use" |bucket _time span=2m |where Value > 70]

This seems like something that should be very common.
Any ideas?

0 Karma
1 Solution

DalJeanis
SplunkTrust
SplunkTrust

My assumption is that you will always have at least one record for each host in every 10-second period. The code can be adjusted to longer pulse-time if needed.

  your base search providing _time, host, CPU% (where 70=79%) and memory% (where 70=79%).   
| table _time host pctCPU pctMemory
| bucket _time span=10s
| stats avg(pctCPU) as pctCPU1, max(pctCPU) as maxCPU1, 
    avg(pctMemory) as pctMemory1 max(pctMemory), as maxMemory1 by host _time 

The above code takes whatever records are returned by your base search, and chunks them up to a 10-second pulse, showing the average and max stats across that pulse.

Now you have one record per host every 10 seconds. You can proceed here two different ways. You can either use streamstats with a rolling two minute window, or you can set fixed two-minute windows.

Fixed Period Method ...Now we bucket it up to the 2 minute mark (but use eventstats to retain the individual 10s-level records for inspection later...

| bucket _time as time2 span=2m
| eventstats avg(pctCPU1) as pctCPU2, max(maxCPU1) as maxCPU2, 
    avg(pctMemory1) as pctMemory2, max(maxMemory1) as maxMemory2  by host time2

Rolling Window Method...or we use streamstats with a 2-minute rolling window (which is 12 of the ten-second chunks) ...

| streamstats avg(pctCPU1) as pctCPU2, max(maxCPU1) as maxCPU2, 
    avg(pctMemory1) as pctMemory2, max(maxMemory1) as maxMemory2  by host window=12

.. and in either of the above cases, the following selects the records to alert on...

| where  pctCPU2>=70 AND pctMemory2>=70

View solution in original post

0 Karma

DPZ_Luke
Explorer

Thanks for the update. I've made those changes.
However, it seems to be doing something unexpected. It is generating alerts for events that occurred yesterday that met the criteria. I kinda expected it to only trigger if new events are flagged.
It makes me wonder, is it rerunning the search for the entirety of log history at every cron interval and then triggering on each instance?
How do I configure it to only recent/new events?

0 Karma

DPZ_Luke
Explorer

Thanks all, between all the help, I got it working. Final Query:
* host="HOST1" sourcetype="Perfmon:CPU Load" object=Processor counter="% Processor Time" instance=_Total |bucket _time span=2m | eval PercentProcessorTime=Value | append [search host="HOST1" source="Perfmon:Memory" collection=Memory object=Memory counter="% Committed Bytes In Use" |bucket _time span=2m | eval PercentCommittedBytesInUse=Value] | streamstats avg(PercentProcessorTime) as "CPU",avg(PercentCommittedBytesInUse) as "Memory" by host window=2 | table _time CPU Memory | where CPU > 70 AND Memory > 70

with a scheduled alert running on cron expression */2 * * * * (every 2 minutes)

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Don't use streamstats in that scenario, just use stats. That "window=2" will just get you the last two records, whatever they are and what order they come in. They are not going to get you an actual average across 2 minutes, just across every pair of records (by host) that flow through the stream.

host="HOST1" (sourcetype="Perfmon:CPU Load" object="Processor" counter="% Processor Time") 
 OR (sourcetype="Perfmon:Memory" collection=Memory object=Memory counter="% Committed Bytes In Use")
 | eval PercentProcessorTime=if(sourcetype="Perfmon:CPU Load",Value,null())
 | eval PercentCommittedBytesInUse=if(sourcetype="Perfmon:Memory",Value,null())
 | bucket _time span=2m
 | stats avg(PercentProcessorTime) as CPU, avg(PercentCommittedBytesInUse) as Memory by host _time
 | where CPU>70 AND Memory>70

update March 24- even worse, it turns out that | streamstats avg(...) by host window=2 will calculate the average of however many of the last two records had the same host value as the current record, ie sometimes one, sometimes two. You can only use that combination if you have sorted or stated immediately prior to ensure that the records are in host and time order, or with the global=true parameter set so that splunk will apply window=2 to the by host parameter rather than just to ehte current stream.

0 Karma

DPZ_Luke
Explorer

Well, that is good to know... I have a search like this in a dashboard, modified to try for an alert every 3 minutes (cron expression). I made the values super low to ensure it hits constantly, as a test, but this is not even triggering even though the values are low enough.

  • host="HOST1" sourcetype="Perfmon:CPU Load" object=Processor counter="% Processor Time" instance=_Total |bucket _time span=2m| eval PercentProcessorTime=Value | append [search host="HOST1" source="Perfmon:Memory" collection=Memory object=Memory counter="% Committed Bytes In Use" | eval PercentCommittedBytesInUse=Value] | stats avg(PercentProcessorTime) as "CPU",avg(PercentCommittedBytesInUse) as "Memory" | table _time CPU Memory |where CPU > 5 AND Memory > 25

Any ideas?

0 Karma

woodcock
Esteemed Legend

The problem is that you are using real-time; don't. It is WAY more trouble that it is worth. Instead run a search over the last X minutes and then schedule it for every X/2 minutes. Try to make X as big as you can stomach.

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

My assumption is that you will always have at least one record for each host in every 10-second period. The code can be adjusted to longer pulse-time if needed.

  your base search providing _time, host, CPU% (where 70=79%) and memory% (where 70=79%).   
| table _time host pctCPU pctMemory
| bucket _time span=10s
| stats avg(pctCPU) as pctCPU1, max(pctCPU) as maxCPU1, 
    avg(pctMemory) as pctMemory1 max(pctMemory), as maxMemory1 by host _time 

The above code takes whatever records are returned by your base search, and chunks them up to a 10-second pulse, showing the average and max stats across that pulse.

Now you have one record per host every 10 seconds. You can proceed here two different ways. You can either use streamstats with a rolling two minute window, or you can set fixed two-minute windows.

Fixed Period Method ...Now we bucket it up to the 2 minute mark (but use eventstats to retain the individual 10s-level records for inspection later...

| bucket _time as time2 span=2m
| eventstats avg(pctCPU1) as pctCPU2, max(maxCPU1) as maxCPU2, 
    avg(pctMemory1) as pctMemory2, max(maxMemory1) as maxMemory2  by host time2

Rolling Window Method...or we use streamstats with a 2-minute rolling window (which is 12 of the ten-second chunks) ...

| streamstats avg(pctCPU1) as pctCPU2, max(maxCPU1) as maxCPU2, 
    avg(pctMemory1) as pctMemory2, max(maxMemory1) as maxMemory2  by host window=12

.. and in either of the above cases, the following selects the records to alert on...

| where  pctCPU2>=70 AND pctMemory2>=70
0 Karma

DPZ_Luke
Explorer

DalJeanis,
I could see your technique working, except with my example, how do you specify what value is pctCPU and which is pctMemory?

This is my base search.
host="HOST1" (sourcetype="Perfmon:CPU Load" object="Processor" counter="% Processor Time") OR (sourcetype="Perfmon:Memory" collection=Memory object=Memory counter="% Committed Bytes In Use")

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

First do this and find the name of the value field that has the CPU percent...

host="HOST1" (sourcetype="Perfmon:CPU Load" object="Processor" counter="% Processor Time") | head 1

Then do this and find the name of the value field that has the memory percent...

host="HOST1" (sourcetype="Perfmon:Memory" collection=Memory object=Memory counter="% Committed Bytes In Use") | head 1

...then this becomes your base search...

host="HOST1" (sourcetype="Perfmon:CPU Load" object="Processor" counter="% Processor Time") 
OR (sourcetype="Perfmon:Memory" collection=Memory object=Memory counter="% Committed Bytes In Use")
| eval pctCPU=if(sourcetype="Perfmon:CPU Load",    ....Memory field name...  , null())
| eval pctMemory=if(sourcetype="Perfmon:Memory",   .....CPU field name.....  , null())
0 Karma

niketn
Legend

[Updated - Included Stats command, changed span=1m]

You can try something like the following. I have used timechart but you can use stats. Also depending upon whether you want to use the query for Dashboard or Alert, you can set the time range for the search and might not need explicit span=1m or bucket _time span=1m

host="HOST1" (sourcetype="Perfmon:CPU Load" object="Processor" counter="% Processor Time") OR (sourcetype="Perfmon:Memory" collection=Memory object=Memory counter="% Committed Bytes In Use")
| search Value>70
| eval metrics=object." - ".Value
| timechart span=1m count as eventcount values(metrics) as metrics
| search eventcount >1

Stats Command

source="Perfmon:*" (object="Processor*" counter="% Processor Time") OR (object=Memory counter="% Committed Bytes In Use")
| search Value>70 
| eval metrics=object." - ".Value
| bin span=1m _time 
| stats count as eventcount values(metrics) as metrics by _time
| search eventcount>1
____________________________________________
| makeresults | eval message= "Happy Splunking!!!"
0 Karma

DPZ_Luke
Explorer

This is for an alert. I'm a newb so not sure when to use timechart or stats.
The above solution is closer but it fails because I record CPU every 1 minute and the span=2m is picking up the CPU value at, for example, 12:00:00 and 12:01:00 so it always hits even though that is two CPU hits and not one CPU/one Memory.
I've tried changing span to 1m and it works but fires off alerts every 5 seconds, for the same value, so that is no good.

0 Karma

niketn
Legend

@DPZ_Luke, first of all sorry search Value should have been greater than 70. I have corrected the same in my answer. I have also added query with stats command. Hoping you run the query for last 1 minute with span=1m.

Based on your scenario you should go for streamstats (@DalJeanis has suggested one). How soon do you want your search queries to run (or what is the schedule for your alert)?
What is the period you want to look back to identify threshold breach for both Memory and CPU?

For example you run the search:
1) every 2 minute for past 2 minute or
2) every 2 minutes for past 3 minutes or
3) every minute for past 1 minute or
4) every minute for past 2 minutes

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"
0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...