Reporting

Host UF uptime

scout29
Explorer

Looking to create a report showing the uptime of all hosts in a specific index which ingest data via a UF. I would like to see over the past 30 days, what was the percentage of uptime per host in that index=abc. 

I am trying to create a metrics report showing the frequency a host is logging to Splunk.  

 

Labels (2)
0 Karma

Richfez
SplunkTrust
SplunkTrust

IMO this might get a lot more fiddly than you think. 

If the system is sending in logs all the time (like, let's use for example they're Apache web servers), then even if you restart one, it'll come back online and catch up almost immediately.  You can take that farther - even if you turn off the forwarder for a week, when it gets turned back on it'll catch back up (possibly in just a couple of minutes!).    A week of "not reporting" just gets swept under the rug.

If you actually have *gaps* in your logs, this is ... an entirely separate issue and you should solve that, because that's not a thing that should really happen and it's generally fixable.

A bit better might be to use some shenanigans with the _indextime field, but it's also unlikely to be easily converted to a proper "wasn't sending in data" type report.  It might be closer, but it's still a lot of extra work.

Even better than that might be to read some of the _internal indexes (the metrics one comes to mind) to find out there which 30 second periods it wasn't responding in (or whatever).  That would be more accurate.

But possibly best when it comes to that sort of information might be just getting the forwarder's messages that say "I'm restarting" and "I've restarted".  THOSE would be easier to calculate a proper "downtime" from.  You can start here:

index=_internal source=*splunkd.log* ("shutdown complete" OR "Splunkd starting")

As long as your internal retention is long enough, that should get you each stop/start sequence - you'll have to then eval some fields, do some stats with a 'by host' and so on, but it should get close. 

You could also use the above by host and | collect it to a summary index to keep just that information around for longer than retention on _internal would normally allow.

It will NOT tell you if the host was actually offline though - if you disconnect the network cable... well, you should try it and see what logs show up for that!  I don't think it can tell you it disconnected, only that it reconnected but maybe there's a "seconds I was unable to talk to you" in that message too.  (Don't think so).

Anyway, I'm sure that's a lot more words than you were expecting, but I wanted to explain the pitfalls of some of the first attempts people make for this sort of answer, since they don't often work that well.  🙂

Also happy to continue helping once you've explored a bit on your own, so if you get stuck ... post again in here!

0 Karma
Get Updates on the Splunk Community!

Join Us for Splunk University and Get Your Bootcamp Game On!

If you know, you know! Splunk University is the vibe this summer so register today for bootcamps galore ...

.conf24 | Learning Tracks for Security, Observability, Platform, and Developers!

.conf24 is taking place at The Venetian in Las Vegas from June 11 - 14. Continue reading to learn about the ...

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...