All Apps and Add-ons

How to calc component availabilty data coming from nagios

auntyem
Explorer

I am trying to get component availability from nagios data coming into splunk for the previous month. I'm ultimately trying to get the % of time available. I've developed the following search to calculate down time for a single host (hardware names have been changed to protect the innocent):

index=nagios src_host=blahn1* HOST ALERT HARD NOT SERVICE* host="blah.umich.edu"|sort _time|delta _time AS duration p=1|where name="UP"|stats sum(duration) as total_down_time|table total_down_time

This appears to be working correctly. Then if I can calculate total up time (I'd like to eventually exclude maintenance windows but they aren't in nagios yet so can't), I can get my percentage.

My questions are as follows:

  • I'm sure i"m not the first person to do this. Is there a better way?
  • If not, I can't find an easy way to calculate the uptime for the month by looking at the timespan ofhte search. Again, I'm sure I'm missing something there.
  • My search above is for a single host (src_host) in nagios. I'd like to do this on a group of hosts, using tags to get the group. I've tried the following to do this but it isn't quite working, though it's close:

index=nagios tag=mxhosts HOST ALERT HARD NOT SERVICE* host="blah.umich.edu"|sort src_host, _time|delta _time AS duration p=1|delta src_host AS hostdiff|where name="UP" and ~something with hostdiff~|stats sum(duration) as total_down_time by src_host

Since I can use a 'by' in the delta command, I was trying use the hostdiff field created in the search as a way to filter out durations when the src_host changed (i.e. previous event was from a different host so 'delta' shouldn't count.

  • we have four nagios hosts monitoring everything. I'd like to just pick one, but if there's a netsplit that makes a host register an 'outage', I don't want that to be the one host. I somehow want the 'best' host at any given poitn in time. Any suggestions here?

Thanks

Tags (1)
0 Karma

lukeh
Contributor

Hi 🙂

Please upgrade to the latest release of Splunk for Nagios and let me know how you go 🙂

http://apps.splunk.com/app/352/

There are a number of new dashboards including:

Livestatus Host SLA

Livestatus Service SLA

All the best,

Luke 🙂

0 Karma
Get Updates on the Splunk Community!

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer at Splunk .conf24 ...

We’re excited to announce a new Splunk certification exam being released at .conf24! If you’re headed to Vegas ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...

Splunk Lantern is Splunk’s customer success center that provides advice from Splunk experts on valuable data ...

Combine Multiline Logs into a Single Event with SOCK: a Step-by-Step Guide for ...

Combine multiline logs into a single event with SOCK - a step-by-step guide for newbies Olga Malita The ...