Solved: How to build an ongoing alert that catches a sudde...

gingersoftware · ‎07-23-2018

Hi Guys,

I could really use an ongoing alert that catches a sudden rise (spike) in a certain error code (such as 404 or 502 etc...)
I tried giving this some thought on how to achieve that, and... Well... I could really use your help 🙂

From my understanding the search query should "know" or, "sense" the normal traffic (not sure for how long, maybe for 1hr, 2hrs) and alert when there is a spike in the error code compared to 1-2 hours ago.
I think the error code spike threshold should be more than 5% of total traffic, while occurring for longer than 90 seconds.

I appreciate your help.

HiroshiSatoh · ‎07-23-2018

I use predictions when I create alerts by statistical analysis. I think it is easier to adjust the prediction parameters according to the current situation rather than thinking about various logic.

index=(your index) ("404" OR "502" OR ･･･)
| timechart span=90s count 
| predict lower95=lower upper95=upper algorithm=LL count as predict
| where count>'upper(predict)'

※Adjustment point：span=90s、upper95、time range、（algorithm）

View solution in original post

Anam · ‎08-09-2018

Hi @gingersoftware

My name is Anam Siddique and I am the Community Content Specialist for Splunk Answers. Please accept the appropriate answer that worked for you so other members of the community can benefit from it. If none of the answers have worked for you so far please post further comments so someone can help you.

Thanks

felipesewaybric · ‎08-03-2018

Timewrap will do the trick.

woodcock · ‎08-03-2018

Check out this INCREDIBLE answer by @mmodestino here:

https://answers.splunk.com/answers/511894/how-to-use-the-timewrap-command-and-set-an-alert-f.html

I heard that he was going to create a blog post or app based on this, what is the evolution of this answer, @mmodestino?

HiroshiSatoh · ‎07-23-2018

I use predictions when I create alerts by statistical analysis. I think it is easier to adjust the prediction parameters according to the current situation rather than thinking about various logic.

index=(your index) ("404" OR "502" OR ･･･)
| timechart span=90s count 
| predict lower95=lower upper95=upper algorithm=LL count as predict
| where count>'upper(predict)'

※Adjustment point：span=90s、upper95、time range、（algorithm）

gingersoftware · ‎07-26-2018

Thanks,

Could you help me modify this script to fit your description?

tag=NginxLogs host=www1 OR host=www2 |stats count by status|eventstats sum(count) as total|eval perc=round((count/total)*100,2)|where status="404" AND perc>5

Thanks

Noah_Woodcock · ‎08-03-2018

predictions are the way to go.

HiroshiSatoh · ‎07-27-2018

For example, it is like this.

tag=NginxLogs host=www1 OR host=www2
|timechart span=1h count as total,count(eval(status="401")) as count
|eval perc=round((count/total)*100,2)
|fields - count,total
|predict lower95=lower upper95=upper algorithm=LL perc as predict
|where perc>'upper(predict)'

As it is a sample, please change the parameters in the actual environment and try it.
If you delete the WHERE clause, you can check it on the graph.

How to build an ongoing alert that catches a sudden rise (spike) in a certain error code?

Enter the Splunk Community Dashboard Challenge for Your Chance to Win!

.conf24 | Session Scheduler is Live!!

Introducing the Splunk Community Dashboard Challenge!