All Apps and Add-ons

How to calc component availabilty data coming from nagios

auntyem
Explorer

I am trying to get component availability from nagios data coming into splunk for the previous month. I'm ultimately trying to get the % of time available. I've developed the following search to calculate down time for a single host (hardware names have been changed to protect the innocent):

index=nagios src_host=blahn1* HOST ALERT HARD NOT SERVICE* host="blah.umich.edu"|sort _time|delta _time AS duration p=1|where name="UP"|stats sum(duration) as total_down_time|table total_down_time

This appears to be working correctly. Then if I can calculate total up time (I'd like to eventually exclude maintenance windows but they aren't in nagios yet so can't), I can get my percentage.

My questions are as follows:

  • I'm sure i"m not the first person to do this. Is there a better way?
  • If not, I can't find an easy way to calculate the uptime for the month by looking at the timespan ofhte search. Again, I'm sure I'm missing something there.
  • My search above is for a single host (src_host) in nagios. I'd like to do this on a group of hosts, using tags to get the group. I've tried the following to do this but it isn't quite working, though it's close:

index=nagios tag=mxhosts HOST ALERT HARD NOT SERVICE* host="blah.umich.edu"|sort src_host, _time|delta _time AS duration p=1|delta src_host AS hostdiff|where name="UP" and ~something with hostdiff~|stats sum(duration) as total_down_time by src_host

Since I can use a 'by' in the delta command, I was trying use the hostdiff field created in the search as a way to filter out durations when the src_host changed (i.e. previous event was from a different host so 'delta' shouldn't count.

  • we have four nagios hosts monitoring everything. I'd like to just pick one, but if there's a netsplit that makes a host register an 'outage', I don't want that to be the one host. I somehow want the 'best' host at any given poitn in time. Any suggestions here?

Thanks

Tags (1)
0 Karma

lukeh
Contributor

Hi 🙂

Please upgrade to the latest release of Splunk for Nagios and let me know how you go 🙂

http://apps.splunk.com/app/352/

There are a number of new dashboards including:

Livestatus Host SLA

Livestatus Service SLA

All the best,

Luke 🙂

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...