Splunk Search

How to report on how long a field equaled a specific value, and show the result as a percentage of time? (aka true uptime)

dbray_sd
Path Finder

Here is the sample set of data, simplified:

Aug  8 11:00:00 host=host1 status_code=UP
Aug  8 12:20:00 host=host1 status_code=UP
Aug  8 14:15:00 host=host1 status_code=UP
Aug  8 15:00:02 host=host1 status_code=DOWN
Aug  8 15:05:02 host=host1 status_code=DOWN
Aug  8 15:10:02 host=host1 status_code=DOWN
Aug  8 15:15:02 host=host1 status_code=UP
Aug  8 16:20:00 host=host1 status_code=UP
Aug  8 16:50:00 host=host1 status_code=UP

Basically, it is checking the host to be up (or down). If it notices the host is down, it will recheck within 5 minutes (and continues until it is UP again). It logs the UP at various times throughout the day. Original thought to get uptime would be this:

status_code=DOWN OR status_code=UP host_name="host1" | 
sort host_name | 
eval HOSTUP=if(status_code="UP",1,0) |  
eval HOSTDOWN=if(status_code="DOWN",1,0) | 
eval UPTIME=(HOSTUP/(HOSTUP+HOSTDOWN))*100 | 
eval DOWNTIME=(HOSTDOWN/(HOSTUP+HOSTDOWN))*100 | 
stats avg(UPTIME) AS UPTIME avg(DOWNTIME) as DOWNTIME by host_name |  
eval UPTIME=round(UPTIME,2) |  eval DOWNTIME =round(DOWNTIME ,2)

However, that is not true uptime over the timed search. That is just averaging the number of UPs vs the number of DOWNs, and is an incorrect representation of true uptime. With the UPs being inconsistent, that will not work.

So, I need to figure out how to calculate how long (based on the search time interval) the status_code equaled DOWN. From that, I could calculate the percentage (over time) the host was down, and thereby calculate the uptime percentage. I was attempting some transaction searches, but I can not seem to get the syntax correct. The end results should be host was down for 10 minutes out of the time search, let's say it was "Last 24hrs", which is only 0.1% (0.0069444e) of the total minutes (1440) of the day.

Any suggestions?

0 Karma
1 Solution

sundareshr
Legend

Try this

status_code=DOWN OR status_code=UP host_name="host1"  | streamstats range(_time) as delta count reset_on_change=true by status_code | where count=1 | reverse | delta _time as duration | where status_code="DOWN" | table _time status_code duration

View solution in original post

0 Karma

sundareshr
Legend

Try this

status_code=DOWN OR status_code=UP host_name="host1"  | streamstats range(_time) as delta count reset_on_change=true by status_code | where count=1 | reverse | delta _time as duration | where status_code="DOWN" | table _time status_code duration
0 Karma

dbray_sd
Path Finder

Thank you. The above mixed with the some extra evals will give exactly what I needed. A true uptime (donwtime) of the device. Here is the complete end result:

status_code=DOWN OR status_code=UP host_name="host1" |
addinfo |
streamstats range(_time) as delta count reset_on_change=true by status_code | 
where count=1 | 
reverse | 
delta _time as Duration | 
where status_code="DOWN" | 
eval TotalTime=(info_max_time - info_min_time) |
eval PercDown=round((Duration / TotalTime * 100),3) | 
eval PercUp=(100.000 - PercDown) | 
table host_name PercUp PercDown

There might be a better way, and there is probably some more tweaks to make this work with multiple hosts. But at least I'm one step closer.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...