Splunk Search

Service down time stats

tmarlette
Motivator

Ok guys, I'm trying to figure out how to basically create a report of service down time durations.

Let say I run the report for the past 48 hours, and this report would bring up each instance and the columns of the table would look like this:
outage start,outage stop,total duration (in minutes),host,service name

Let's say there were two instances where services were down, and two different times in the day.

The report would pull up both of them as individual rows within the table.

I'm pretty sure I'm going to be using buckets somehow, but I'm searching for the easiest way to pull up each of the 'down' instances, and their duration in a table for a period of time.

To give you some more information, I am just looking for a 'State' change, from 'up' to 'down' and the duration of it until the next 'up' change. I have the field extracted within each event already.


I have tried the answer below, and it's almost what i'm after.

I'm trying to keep it as simple as possible so basically i'm looking for the following fields for each 'outage'

Start time, Stop Time, duration, service name.

This is the query I am using:

sourcetype=WMI:Service Name=<servicename> | streamstats current=false last(State) as last_service_status last(_time) as time_of_change by Name | where State!="last_service_status" | eval outage=now()-time_of_change | eval duration=strftime(outage, "%H:%M") | rename State as current_service_status | table time_of_change, Name, last_service_status, current_service_status, duration

and this is an image of the results.

alt text

Is there a way to peel these fields out into a table of the 'outages' and duration's by service name?

Tags (3)
0 Karma

somesoni2
SplunkTrust
SplunkTrust

Try this (add rename at the end per your need)

your base search  | streamstats current=false last(State) as last_service_status last(_time) as time_of_down by Name,host | where State!=last_service_status AND NOT State="Down" | streamstats current=false last(_time) as time_of_up by Name,host  | where isnotnull(time_of_up) | eval duration=time_of_up - time_of_down | convert ctime(time_of_*) | table host, Name, time_of_*,duration
0 Karma

MuS
SplunkTrust
SplunkTrust

Hi tmarlette,

this will be tricky to answer without knowing the real data of your events, but I show you some example. Here I assume that the events contain the following data:

time, service_name, service_status

You should have a time field, some service name field and at least one status field if the service is up or down.

Now we start some streamstats-Fu:

yourBaseSearchHere 
| streamstats current=false last(service_status) as last_servcie_status last(_time) as time_of_change by service_name 
| where service_status!=last_service_status 
| eval outage=now()-time_of_change 
| eval duration=strftime(outage, "%H:%M") 
| rename service_status as current_service_status 
| table time_of_change, service, last_service_status, current_service_status, duration

This will show a table with the time of the status changes for each service and how long the time between the status changes was, so you would get not only down time but also up times as well.

Don't nail me on the two eval's for the time operations, it just an example and you would have to adapt to match your real world events.

Hope this helps to get you started ...

cheers, MuS

tmarlette
Motivator

I would, but this isn't quite answered yet. This looks like it's giving me the duration of each minute (likely because we poll once a minute). I think I have to massage this a bit still in order to get what I'm looking for. Every minute is too much data for suits to look at, and i'm attempting to appease them.

0 Karma

MuS
SplunkTrust
SplunkTrust

Sure, this is done by using a stats or chart instead of table, use this at the end instead of table:

| stats values(time_of_change) AS time_of_change values(last_service_status) AS last_service_status values(current_service_status) AS current_service_status values(duration) AS duration by service

btw: you're welcome, please tick the tick to mark this as answered

0 Karma

tmarlette
Motivator

Thank you sir! I will give this a shot. you are accurate in your assuming of the event data. Those are the only fields of interest for this exercise.

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...