Alerting

Create a general alert based on any item exceeding a 'out of the norm' threshold.

mdavis43
Path Finder

Is there a way to create a general alert that can trigger when anything suddenly experiences a significant amount of increased log messages? For example...

any_host experiences large number of same event (login failures, multipath errors, read only file systems, etc) in a given time period

Its basically a catch all type of alert. This is in place of writing an alert for every possible 'large increase in activity'. We're writing alerts as we see things come in and it there is a lot of stuff falling through the cracks.

Tags (1)

jtrucks
Splunk Employee
Splunk Employee

A method I'm working on for this involves doing the following:

  1. Perform a search over a longer period of time and count your events, such as: source="mystuff" | timestats span=1m count | collect index=mysummary
  2. update the above search at regular, short intervals to keep the data in the summary index up to date
  3. perform a search against the summary index to do | timechart span=1h avg(count),stdev(count) ... as a subsearch
  4. ... inside a search that checks the the count across a short and current interval of time
  5. then eval whether the current count is 3 times stdev above or below avg
  6. Alert if above is true.

I'm working out the exact technical details, but the above is the approach I'm currently experimenting with making work.

--
Jesse Trucks
Minister of Magic

jcoates_splunk
Splunk Employee
Splunk Employee

or you could just use the predict command. it calculates a band of normalcy, then you eval if your real flow has left the band. I wrote a blog post showing how to do that here: http://blogs.splunk.com/2012/11/04/predict-detect/

jcoates_splunk
Splunk Employee
Splunk Employee

thanks for the follow up... indeed, non-normal data basically leaves you with two choices. A) don't do that, B) chop off the outliers. I hear that choice B plus an economics degree = profit! But seriously, if the consequences of a false positive are dire, it's best not to screw with algorithmic planning and detection unless you're going to add a layer of smarts to it. Bayesian smarts are fairly effective within a reasonable domain, but again you want to be thinking about the big picture of inputs and outcomes. For instance, SpamAssassin on your email is low impact, but HFT on your savings can leave a mark.

0 Karma

richcollier
Path Finder

Actually, 90 makes the band tighter - 99 is the widest (least sensitive). I also tried the trendline solution above as well - similar result (http://i.imgur.com/CVEsMzF.png). No worries - it's just that some data's behavior doesn't conform to a uniform Gaussian distribution, therefore using averages and +/- standard deviations can give misleading results.

0 Karma

jcoates_splunk
Splunk Employee
Splunk Employee

I think that goes the other way, as in 90 would have lower sensitivity than 99... not sure though. Also, here's another way to do it which doesn't look into the future (janked from a search Coccyx wrote):

... | trendline sma20(Sales) as trend | eventstats stdev(Sales) as stdev | eval trend=if(isnull(trend),Sales,trend) |  eval "High Prediction"=trend+(2*stdev) | eval "Low Prediction"=if(trend-(1.5*stdev)>0,trend-(1.5*stdev),0) | fields - stdev, trend
0 Karma

richcollier
Path Finder

Thanks for the suggestions! Yes, changing the algorithm to "LL" seems to be the best one for this kind of data, and I also changed the range to the 99th percentile (widest possible). The number of false alerts is much less than before (no longer 50), but still is about 4-5 for a 4 hour window. (http://i.imgur.com/vcHfltD.png)

0 Karma

jcoates_splunk
Splunk Employee
Splunk Employee

Hi Rich,

Agreed -- luckily there's some tuning options to that command. Here's the manual for reference: http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Predict

My initial inclination is to broaden the range from 95% percentile, using the upper and lower options. You might also try out different algorithms. I suspect that fare request response times have some periodicity, and LLP or LLT might work better?

Jack

0 Karma

richcollier
Path Finder

Hmmm...I tried the predict command as suggested in your blog on some response time data and during a 4-hour window of what I know are "normal" values, the upper95 and lower95 bands were crossed almost 50 times (http://i.imgur.com/IZIbgqY.png). That is a lot of false alerts.

0 Karma

richcollier
Path Finder

The Prelert Anomaly Detective app uses machine-learning algorithms to automatically learn the baseline rates of your events (or the values of performance metrics) and uses that information to detect anomalies in current data. It can auto-learn the base line in 3 modes:

  • over a wide search period that you define
  • comparing two discrete time periods against each other
  • ongoing in real-time (by using summary indexes that are created for you)

Sounds like it would be useful for your use-case!

tjensen
Explorer

I'm also intrested into this topic. Do you find a Solution already?

0 Karma

mdavis43
Path Finder

Digging on my own I think I've found out how to do it using summary indexes. I'll answer my own question back once we've developed the procedures.

0 Karma

sneighbour
Engager

I'd also be interested in something like this

Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...