Hello ,
I want to check for whether my processor has exceeded a certain % for a certain given time and then I want to send an alert .
I have the search query in this format . Kindly guide if possible please correct the query
index=windows sourcetype="perfmon:cputime" counter="% Processor Time" | timechart span="1s" avg(Value) | search count >10 by host
But I am not sure where to check for exceeding a certain percentage of value in this
Note that the perfmon data is not collected every second, so a span of "1s" in the timechart does not make sense.
What if you had a search that ran every 10 minutes and alerted for any host that had an average processor time in excess of 90%?
index=windows sourcetype="perfmon:cputime" counter="% Processor Time" earliest=-13m latest=-3m
| stats avg(Value) as AvgProcessorTime by host
| where AvgProcessorTime > 90
You can set the alert condition for # results > 0, as this will return one result for each host that has a high AvgProcessorTime. Note that there is a 3 minute "lag" in my search; this is to allow time for the data to be collected and returned across the environment. You can eliminate this if you want, but it will affect the accuracy.
Note that the perfmon data is not collected every second, so a span of "1s" in the timechart does not make sense.
What if you had a search that ran every 10 minutes and alerted for any host that had an average processor time in excess of 90%?
index=windows sourcetype="perfmon:cputime" counter="% Processor Time" earliest=-13m latest=-3m
| stats avg(Value) as AvgProcessorTime by host
| where AvgProcessorTime > 90
You can set the alert condition for # results > 0, as this will return one result for each host that has a high AvgProcessorTime. Note that there is a 3 minute "lag" in my search; this is to allow time for the data to be collected and returned across the environment. You can eliminate this if you want, but it will affect the accuracy.
@Iguinn this query is precise for the sudden CPU spike detection but from impact perspective, business is more interested to look at - if CPU reaches beyond threshold and stays there for 3 or more minutes which might impact server performance. any pointer for making it like that ?
Hi @saurabh_tek
I am trying to find a solution to the same problem as mentioned by you. I hope you were able to resolve it. If so, could you please let me know how to handle this? There are several threads with similar questions but none of it actually worked.
Thanks!
Thanks for the response and the query actually did worked well. I had one more query in Mind till now I only know that Splunk only sends the count of the events happened during the time duration , is there any way we can send the actual matching content in the email whenever the alert is fired ,i.e can we make the reporting more intuitive and clear ,sending the actual matching text in the email body [not in the case of perfmon data but in the case of parsing logs ]
Thanks in Advance
Tushar