Alerting

Create a real-time alert that triggers when count > 2 within 1 minute

damonmanni
Path Finder

Use Case:
• Our Jira instance crashes intermittently when there is heavy load on the svr.
• The cause is The JVM Garbage Collection (GC) does not run effectively as server load increases, eventually crashing java. Also the cpu climbs to 390% usage due to JVM struggling and consuming resources.

Splunk Goal:
• To Monitor in real-time and Alert the Admin when splunk sees the GC beginning to struggle so that admin can do graceful restart of jira b/f it crashes.

KPI:
• Search for “Full GC” in logs, if more than 2 hits are found within 1 minute timespan, then JVM is heading out of control and send email alert to admin.

I need your help on:
From my attempts below, I think I am extracting what I need as a Report, but I don't know how to make the alert trigger only when the internal count (which is 5 in two results below ) > 2 and NOT the total event records found of (5+1=6) or (1+5+1=7) of that day.
It seems that the total # of events returned (6 or 7) will always trip the alert, which is not what we want.

Here is my setup:
Setup my realtime Alert as:
• Type:
alt text

My attempts:
• Query-1 This is to give me a few target dates that have known failures so I can use that data to test with:

index=jira sourcetype=gc host=mdc2vr8223 source="gc-" "[Full GC" | bucket _time span=1m | stats count by _time| eval occurred=if(count>2,"Possible GC issue occurring","GC ok") | table occurred, _time, count

occurred    _time   count

1 GC ok 2018-01-31 18:00:00 1
... etc...
15 GC ok 2018-02-14 18:55:00 1
16 GC ok 2018-02-15 23:00:00 1
17 Possible GC issue occurring 2018-02-19 07:48:00 5
18 GC ok 2018-02-19 08:08:00 1
19 GC ok 2018-02-21 10:12:00 1
20 Possible GC issue occurring 2018-02-21 10:14:00 5
21 GC ok 2018-02-21 10:28:00 1
22 GC ok 2018-02-25 15:00:00 1
23 GC ok 2018-03-01 03:00:00 1

• Query-2 – To simulate a Real-time trigger, I took Query-1 and ran it against a danger date above:

index=jira sourcetype=gc host=mdc2vr8223 source="gc-" "[Full GC" earliest="02/19/2018:00:00:00" latest="02/19/2018:23:00:00" | bucket _time span=1m | stats count by _time| eval occurred=if(count>2,"Possible GC issue occurring","GC ok") | table occurred, _time, count

occurred    _time   count

1 Possible GC issue occurring 2018-02-19 07:48:00 5
2 GC ok 2018-02-19 08:08:00 1

What am I missing?
cheers,
Damon

Tags (1)
0 Karma
1 Solution

starcher
Influencer

If making an alert never let non alert conditions create rows. It's fine for a report not an alert. Instead of "GC ok" use null() and put a | where isnnotnull(occurred) after the stats. Then you should get only rows where the conditions are met.

Also never run "real-time" searches. Run over short intervals. like every 5 minutes.

View solution in original post

0 Karma

starcher
Influencer

If making an alert never let non alert conditions create rows. It's fine for a report not an alert. Instead of "GC ok" use null() and put a | where isnnotnull(occurred) after the stats. Then you should get only rows where the conditions are met.

Also never run "real-time" searches. Run over short intervals. like every 5 minutes.

0 Karma

damonmanni
Path Finder

Thank you Starcher! I did everything you mentioned. That did the trick
cheers,
D

0 Karma

DalJeanis
Legend

@damonmanni - we converted starcher's comment to an answer. Please accept the answer so that your question will show as closed.

0 Karma
Get Updates on the Splunk Community!

Introducing the Splunk Community Dashboard Challenge!

Welcome to Splunk Community Dashboard Challenge! This is your chance to showcase your skills in creating ...

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Wednesday, May 29, 2024  |  11AM PST / 2PM ESTRegister now and join us to learn more about how you can ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...

We’re excited to announce a new Splunk certification exam being released at .conf24! If you’re headed to Vegas ...