Hello Splunkers. Noob here. I have an alert that fires when any three metrics (listed in title) goes above 75%. I just need to add into the alert what top offending processes are causing the overages. Here is my query so far, which does work to illustrate when either 3 main metrics goes above 75%.
index=blah (sourcetype="PerfDisk" OR sourcetype="PerfCPU" OR sourcetype="PerfMem" OR sourcetype="PerfProcess") (host=blah OR host=blah OR host=blah OR host=blah ) earliest=-5m
| stats avg(%CommittedBytes) as mem_use_prcnt
avg(cpuLoadPerc) as cpu_load_prcnt
avg(%DiskTime) as disk_utilization_prcnt
by host
| eval fire_it_up = case(cpu_load_prcnt > 75,1,
mem_use_prcnt > 75,1,
disk_utilization_prcnt > 75,1,
true() ,0)
| where fire_it_up > 0
| table all three metrics
Any ideas on getting the top offending processes causing the overages???. Any help is much appreciated.
Here is what I ended up doing:
index=win sourcetype="Perf:logDisk" instance!=_Total (host=myhost) earliest=-5m
| eval volume = instance
| stats avg(%_Disk_Time) as diskUse% by volume host
| join type=left host
[| search index=win sourcetype="Perf:Process" category=%_Processor_Time=* NOT(instance IN(_Total, Idle)) (host=myhost) earliest=-5m
| stats avg(%_Processor_Time) as %_Processor_Time by host instance
| sort -%_Processor_Time
| streamstats count by host
| where count=1
| eval %_Processor_Time=round('%_Processor_Time')
| eval Additional_InfoCPU = "Top Resource Task=" . instance . ", Task Time=" . '%_Processor_Time'
| fields host Additional_InfoCPU ]
then repeated that for a bunch of other metrics (mem%, cpu% etc etc) in separate subsearches
Then
| eval mem_use_% = round(mem_use_%, 2)
| eval cpu_load_% = round(cpu_load_%, 2)
| eval disk_utilization_% = round(disk_utilization_%, 2)
| eval Individual_DiskUse_%t = round(Individual_DiskUse_%, 2)
| eval fire_alert = case(cpu_load_% > 75,1,
mem_use_%> 75,1,
Individual_DiskUse_%> 75,1,
true() ,0)
| where fire_Alert>0
| stats values(volume) values(DiskUse_%) by everything you want
| Table it all out
hi @spluzer
I just need to add into the alert what top offending processes are causing the overages...well then you need to capture the process names under cpu,memory or disk . I am sure its mentiioned in your events somewhere?
you just cant go by sourcetype , all that would mean is if cpu spikes >75% we know its the PerfCPU sourcetype.
Perhaps you have more granular details than that, like under that source types which are the cpu process names?
Here is what I ended up doing:
index=win sourcetype="Perf:logDisk" instance!=_Total (host=myhost) earliest=-5m
| eval volume = instance
| stats avg(%_Disk_Time) as diskUse% by volume host
| join type=left host
[| search index=win sourcetype="Perf:Process" category=%_Processor_Time=* NOT(instance IN(_Total, Idle)) (host=myhost) earliest=-5m
| stats avg(%_Processor_Time) as %_Processor_Time by host instance
| sort -%_Processor_Time
| streamstats count by host
| where count=1
| eval %_Processor_Time=round('%_Processor_Time')
| eval Additional_InfoCPU = "Top Resource Task=" . instance . ", Task Time=" . '%_Processor_Time'
| fields host Additional_InfoCPU ]
then repeated that for a bunch of other metrics (mem%, cpu% etc etc) in separate subsearches
Then
| eval mem_use_% = round(mem_use_%, 2)
| eval cpu_load_% = round(cpu_load_%, 2)
| eval disk_utilization_% = round(disk_utilization_%, 2)
| eval Individual_DiskUse_%t = round(Individual_DiskUse_%, 2)
| eval fire_alert = case(cpu_load_% > 75,1,
mem_use_%> 75,1,
Individual_DiskUse_%> 75,1,
true() ,0)
| where fire_Alert>0
| stats values(volume) values(DiskUse_%) by everything you want
| Table it all out
@spluzer If your problem is resolved, please accept the answer to help future readers.