Splunk Search

How do I alert if cpu is greater than 97% for more than 15m?

matthew_foos
Path Finder

Splunkers,

Looking for some kind of time modifier that will allow the following alert to fire only if CPU has been at 97% or higher for more than 15 minutes.

Here is the search string I've started working with:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| stats max(cpu_load_percent) as load by host
| eval load = round(load, 2)
| where load >=97
| rename host as Host, load as "% Processor Time"

Any advice would be great.

Thanks.

0 Karma
1 Solution

Raschko
Communicator

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

View solution in original post

Raschko
Communicator

You can use the streamstats command with time_window instead of stats.

Try this:

index=perfmon sourcetype="Perfmon:CPU" counter="% Processor Time" instance=_Total
| sort 0 _time
| streamstats time_window=15min avg(cpu_load_percent) as last15min_load count by host
| eval last15min_load = if(count < 18,null,round(last15min_load, 2))
| WHERE(last15min_load >= 97)
| table host, _time, cpu_load_percent, last15min_load, count

The streamstats command checks events of the last 15 min (by host) and calculates load average.
Furthermore it yields the count of events for use in the next eval command.

The eval line checks if event count is higher than 18 to make sure there are enough logged events for average calculation.
Otherwise you will get alerts at every reboot as there will be only 1 event with high load.
I took 18, because thats the event count I get within 3min from one host (1 event / 10sec ).

HTH

Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...