Create a general alert based on any item exceeding...

mdavis43 · ‎10-08-2012

Is there a way to create a general alert that can trigger when anything suddenly experiences a significant amount of increased log messages? For example...

any_host experiences large number of same event (login failures, multipath errors, read only file systems, etc) in a given time period

Its basically a catch all type of alert. This is in place of writing an alert for every possible 'large increase in activity'. We're writing alerts as we see things come in and it there is a lot of stuff falling through the cracks.

jtrucks · ‎04-11-2013

A method I'm working on for this involves doing the following:

Perform a search over a longer period of time and count your events, such as: source="mystuff" | timestats span=1m count | collect index=mysummary
update the above search at regular, short intervals to keep the data in the summary index up to date
perform a search against the summary index to do | timechart span=1h avg(count),stdev(count) ... as a subsearch
... inside a search that checks the the count across a short and current interval of time
then eval whether the current count is 3 times stdev above or below avg
Alert if above is true.

I'm working out the exact technical details, but the above is the approach I'm currently experimenting with making work.

--
Jesse Trucks
Minister of Magic

jcoates_splunk · ‎04-08-2013

or you could just use the predict command. it calculates a band of normalcy, then you eval if your real flow has left the band. I wrote a blog post showing how to do that here: http://blogs.splunk.com/2012/11/04/predict-detect/

jcoates_splunk · ‎04-11-2013

thanks for the follow up... indeed, non-normal data basically leaves you with two choices. A) don't do that, B) chop off the outliers. I hear that choice B plus an economics degree = profit! But seriously, if the consequences of a false positive are dire, it's best not to screw with algorithmic planning and detection unless you're going to add a layer of smarts to it. Bayesian smarts are fairly effective within a reasonable domain, but again you want to be thinking about the big picture of inputs and outcomes. For instance, SpamAssassin on your email is low impact, but HFT on your savings can leave a mark.

richcollier · ‎04-11-2013

Actually, 90 makes the band tighter - 99 is the widest (least sensitive). I also tried the trendline solution above as well - similar result (http://i.imgur.com/CVEsMzF.png). No worries - it's just that some data's behavior doesn't conform to a uniform Gaussian distribution, therefore using averages and +/- standard deviations can give misleading results.

jcoates_splunk · ‎04-10-2013

I think that goes the other way, as in 90 would have lower sensitivity than 99... not sure though. Also, here's another way to do it which doesn't look into the future (janked from a search Coccyx wrote):

... | trendline sma20(Sales) as trend | eventstats stdev(Sales) as stdev | eval trend=if(isnull(trend),Sales,trend) |  eval "High Prediction"=trend+(2*stdev) | eval "Low Prediction"=if(trend-(1.5*stdev)>0,trend-(1.5*stdev),0) | fields - stdev, trend

richcollier · ‎04-10-2013

Thanks for the suggestions! Yes, changing the algorithm to "LL" seems to be the best one for this kind of data, and I also changed the range to the 99th percentile (widest possible). The number of false alerts is much less than before (no longer 50), but still is about 4-5 for a 4 hour window. (http://i.imgur.com/vcHfltD.png)

jcoates_splunk · ‎04-10-2013

Hi Rich,

Agreed -- luckily there's some tuning options to that command. Here's the manual for reference: http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Predict

My initial inclination is to broaden the range from 95% percentile, using the upper and lower options. You might also try out different algorithms. I suspect that fare request response times have some periodicity, and LLP or LLT might work better?

Jack

richcollier · ‎04-10-2013

Hmmm...I tried the predict command as suggested in your blog on some response time data and during a 4-hour window of what I know are "normal" values, the upper95 and lower95 bands were crossed almost 50 times (http://i.imgur.com/IZIbgqY.png). That is a lot of false alerts.

richcollier · ‎04-08-2013

The Prelert Anomaly Detective app uses machine-learning algorithms to automatically learn the baseline rates of your events (or the values of performance metrics) and uses that information to detect anomalies in current data. It can auto-learn the base line in 3 modes:

over a wide search period that you define
comparing two discrete time periods against each other
ongoing in real-time (by using summary indexes that are created for you)

Sounds like it would be useful for your use-case!

tjensen · ‎11-06-2012

I'm also intrested into this topic. Do you find a Solution already?

mdavis43 · ‎10-16-2012

Digging on my own I think I've found out how to do it using summary indexes. I'll answer my own question back once we've developed the procedures.

sneighbour · ‎10-16-2012

I'd also be interested in something like this

Create a general alert based on any item exceeding a 'out of the norm' threshold.

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms