What is the Optimal Solution for Monitoring Host U...

_gkollias · ‎04-22-2015

I would like to graph by month/day of the week how many times we have restarted two servers in particular.

Rather than using _internal, I want to use the index=os data coming in for these two hosts so that the search/ graph can be shared across a large distribution and not just the VIP Splunkers with admin access (only two of us :-))

The search I have will let us know when the hosts aren't forwarding data:

index=_internal host=host01* OR host=host02*  earliest=-15m
| eval Age = (now() - _time )
| stats first(Age) as Age, first(_time) as LastTime by host
| convert ctime(LastTime) as "Last Active On"
| eval Status= case(Age < 180,"Running",Age > 180,"DOWN")
| table host, Age, "Last Active On", Status

I want to be able to bucket the time and visualize it via a timechart to show how many times the server was "restarted" over time.

Any recommendations on a search that could help with this would be greatly appreciated.

Thanks in advance!

hcbomb · ‎04-22-2015

timechart only takes 1 type of field/value and runs that against time.

Your requirements sound like it's nothing more than a binary UP/DOWN status here.

Why not just run a count of events from the host? If there's no events within a minute period, maybe an outage (since your time window is only 15min)?

 index=_internal host=host01* OR host=host02*  earliest=-15m | timechart span=1m count by host

If you wanted to set alerts on this, you'd need to set some thresholds and throttling around the count value. Presuming scripts similar to those found in the Windows and Linux Add-ons that poll every 30s, you could generate a similar script to track this. From a pretty simple setup, this should suffice your needs hopefully!

_gkollias · ‎04-23-2015

Thanks - I see where you are going with this, but I actually want to go back a lot further than this to see how many times the servers may have been "restarted". Also I am trying to use OS data from the UNIX/Linux app for this to that I can share the searches with a broad distribution. The search above really truncates when I run it over time because of the size of data being pulled up.

hcbomb · ‎05-01-2015

Yes, the search I provided isn't intended for a whole scale, for all time, type of analysis. It's for a snapshot for operations.

You are heading down the correct path in trying to leverage as much out of the *nix app as possible. There's some good stuff in there in terms of searches, tags, and event types that will make your work more simpler where it counts.

If it's purely restarts and those are logged properly, you should be able to find them in the splunkd logs.

To deal with size of data for aggregation, you'll need to build better filters to fit your needs. I merely gave you somewhere to start, but if you want something highly aggressive to fit your needs, you should be able to build that over time. Look for the specific types of logs you need. My search would've helped in that, if there's log gaps then likely a Splunk instance is down. Seems you'll need to look specifically for starting up logs in that case.

What is the Optimal Solution for Monitoring Host Uptime using OS Data Over a Long Period of Time?

New in Observability Cloud - Explicit Bucket Histograms

Updated Team Landing Page in Splunk Observability

New! Splunk Observability Search Enhancements for Splunk APM Services/Traces and ...