Splunk Search

How to report on how long a field equaled a specific value, and show the result as a percentage of time? (aka true uptime)

dbray_sd
Path Finder

Here is the sample set of data, simplified:

Aug  8 11:00:00 host=host1 status_code=UP
Aug  8 12:20:00 host=host1 status_code=UP
Aug  8 14:15:00 host=host1 status_code=UP
Aug  8 15:00:02 host=host1 status_code=DOWN
Aug  8 15:05:02 host=host1 status_code=DOWN
Aug  8 15:10:02 host=host1 status_code=DOWN
Aug  8 15:15:02 host=host1 status_code=UP
Aug  8 16:20:00 host=host1 status_code=UP
Aug  8 16:50:00 host=host1 status_code=UP

Basically, it is checking the host to be up (or down). If it notices the host is down, it will recheck within 5 minutes (and continues until it is UP again). It logs the UP at various times throughout the day. Original thought to get uptime would be this:

status_code=DOWN OR status_code=UP host_name="host1" | 
sort host_name | 
eval HOSTUP=if(status_code="UP",1,0) |  
eval HOSTDOWN=if(status_code="DOWN",1,0) | 
eval UPTIME=(HOSTUP/(HOSTUP+HOSTDOWN))*100 | 
eval DOWNTIME=(HOSTDOWN/(HOSTUP+HOSTDOWN))*100 | 
stats avg(UPTIME) AS UPTIME avg(DOWNTIME) as DOWNTIME by host_name |  
eval UPTIME=round(UPTIME,2) |  eval DOWNTIME =round(DOWNTIME ,2)

However, that is not true uptime over the timed search. That is just averaging the number of UPs vs the number of DOWNs, and is an incorrect representation of true uptime. With the UPs being inconsistent, that will not work.

So, I need to figure out how to calculate how long (based on the search time interval) the status_code equaled DOWN. From that, I could calculate the percentage (over time) the host was down, and thereby calculate the uptime percentage. I was attempting some transaction searches, but I can not seem to get the syntax correct. The end results should be host was down for 10 minutes out of the time search, let's say it was "Last 24hrs", which is only 0.1% (0.0069444e) of the total minutes (1440) of the day.

Any suggestions?

0 Karma
1 Solution

sundareshr
Legend

Try this

status_code=DOWN OR status_code=UP host_name="host1"  | streamstats range(_time) as delta count reset_on_change=true by status_code | where count=1 | reverse | delta _time as duration | where status_code="DOWN" | table _time status_code duration

View solution in original post

0 Karma

sundareshr
Legend

Try this

status_code=DOWN OR status_code=UP host_name="host1"  | streamstats range(_time) as delta count reset_on_change=true by status_code | where count=1 | reverse | delta _time as duration | where status_code="DOWN" | table _time status_code duration
0 Karma

dbray_sd
Path Finder

Thank you. The above mixed with the some extra evals will give exactly what I needed. A true uptime (donwtime) of the device. Here is the complete end result:

status_code=DOWN OR status_code=UP host_name="host1" |
addinfo |
streamstats range(_time) as delta count reset_on_change=true by status_code | 
where count=1 | 
reverse | 
delta _time as Duration | 
where status_code="DOWN" | 
eval TotalTime=(info_max_time - info_min_time) |
eval PercDown=round((Duration / TotalTime * 100),3) | 
eval PercUp=(100.000 - PercDown) | 
table host_name PercUp PercDown

There might be a better way, and there is probably some more tweaks to make this work with multiple hosts. But at least I'm one step closer.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...