I'm trying to capture the amount of time a particular host is idle. I can do that in a timechart easily enough but I can't get a summary of that data to work. I have the following:
index=jobs sourcetype=active_jobs | timechart span=1s distinct_count(jobId) as jobCount by host
Which works fine but includes all results. If I add a | where jobCount=0
then I get no results. If I remove the by host
but add the
| where jobCount=0
then I get the correct count but with out the ability to group by host.
What I'm trying to do is then summarize that data into something to say:
host1 had 100 seconds with no events
host2 had 496 seconds with no events
host3 had 126 seconds with no events
I've tried buckets, xyseries, chart, stats and I can't seem to get it right. Seems to me a |stats count by host
should do the trick but it results in 0 events
You are trying to count the times a dog did not bark.
Try this - For each job that runs, produce a record showing start time and end time. Then break that into two records, a +1 record for start and a -1 record for end. Next, use streamstats
or accum
to count the +1s and -1s, and filter for records where the accumulated count is <=0.
(your search that produces every job's startTime and endTime in epoch format for every host)
| eval mydata=startTime.",+1 ".endTime.",-1"
| table host mydata
| makemv mydata
| mvexpand mydata
| makemv delim="," mydata
| eval _time=mvindex(mydata,0)
| eval jobcount=mvindex(mydata,1)
| table _time host jobcount
Records are normally returned most recent first, so we need to sort them into order before we run our accum
. We include host
in the sort fields because it will make the streamstats
slightly more efficient.
| sort 0 host _time
| streamstats sum(jobcount) as netcount by host
Now, if any records have netcount<0, then it means you started at a point where a job was already running and caught the end of the job but not the start. That should not happen, but if it does you just fix your base search to only put out jobs for both the start and end time records. You can start a bit before the time frame you are interested in, and end a bit after, but you need to make sure the starts and ends are paired. Any idle times will therefore have a netcount of exactly 0.
However, before we filter for that, we need to know how long the netcount of 0 condition lasted. The delta
command gives us the difference between a record and the previous record, so we need to reverse the order so that the more recent record, that ended the idle duration, is seen first.
| reverse
| delta _time as duration
| eval duration=-duration
| where netcount<=0
The above would work for a single host. However, delta
doesn't have a by *fieldname*
option, so here's my workaround. The streamstats
command allows window
and by *fieldname*
, and the average of the current _time
and the prior _time
is halfway between the two, so twice the difference is the duration. This is the earlier event, so _time
is less than avgtime
.
| reverse
| streamstats window=2 global=t avg(_time) as avgtime by host
| eval duration=2*(avgtime-_time)
And now we're ready to get your stats...
| where netcount<=0
| stats sum(duration) as idletime by host
You are trying to count the times a dog did not bark.
Try this - For each job that runs, produce a record showing start time and end time. Then break that into two records, a +1 record for start and a -1 record for end. Next, use streamstats
or accum
to count the +1s and -1s, and filter for records where the accumulated count is <=0.
(your search that produces every job's startTime and endTime in epoch format for every host)
| eval mydata=startTime.",+1 ".endTime.",-1"
| table host mydata
| makemv mydata
| mvexpand mydata
| makemv delim="," mydata
| eval _time=mvindex(mydata,0)
| eval jobcount=mvindex(mydata,1)
| table _time host jobcount
Records are normally returned most recent first, so we need to sort them into order before we run our accum
. We include host
in the sort fields because it will make the streamstats
slightly more efficient.
| sort 0 host _time
| streamstats sum(jobcount) as netcount by host
Now, if any records have netcount<0, then it means you started at a point where a job was already running and caught the end of the job but not the start. That should not happen, but if it does you just fix your base search to only put out jobs for both the start and end time records. You can start a bit before the time frame you are interested in, and end a bit after, but you need to make sure the starts and ends are paired. Any idle times will therefore have a netcount of exactly 0.
However, before we filter for that, we need to know how long the netcount of 0 condition lasted. The delta
command gives us the difference between a record and the previous record, so we need to reverse the order so that the more recent record, that ended the idle duration, is seen first.
| reverse
| delta _time as duration
| eval duration=-duration
| where netcount<=0
The above would work for a single host. However, delta
doesn't have a by *fieldname*
option, so here's my workaround. The streamstats
command allows window
and by *fieldname*
, and the average of the current _time
and the prior _time
is halfway between the two, so twice the difference is the duration. This is the earlier event, so _time
is less than avgtime
.
| reverse
| streamstats window=2 global=t avg(_time) as avgtime by host
| eval duration=2*(avgtime-_time)
And now we're ready to get your stats...
| where netcount<=0
| stats sum(duration) as idletime by host
I ended up solving this in a different way by essentially inserting a record basically saying there's no data. It was much simpler in the end.
Your solution is very cool and I'll mark is as the answer. I'm going to try it on some other data when I get time.
All of the above is just a method to calculate that record that you've inserted. If you have a simpler way to do that, then please post it as an answer to the question and accept it instead. I love learning new algorithms.
Hmmm. Okay, if you don't care about subsecond intervals, then here's another way...
index=jobs sourcetype=active_jobs
| bin _time span=1s
| stats dc(jobId) as jobCount by _time host
| appendpipe
[| stats values(host) as host
| addinfo
| eval TheStart=strftime(info_min_time,"%m/%d/%Y:%H:%M:%S")
| eval TheEnd=strftime(info_max_time,"%m/%d/%Y:%H:%M:%S")
| map search="| gentimes start=$TheStart$ end=$TheEnd$ increment=1s
| eval _time=starttime
| table _time
| eval jobCount=0
| mvexpand host "
]
| stats sum(jobCount) as jobCount by _time host
| where (jobCount = 0) AND (_time < info_max_time)
| stats count as IdleSeconds by host
The data comes from a scripted input calling an API at a set interval. I'm simply inserting a record akin to "no record" if nothing is found. So I'm totally cheating and not using a Splunk method. That's why I left your answer as is. I won't always be able to do that and calculating where data is NOT is something that's come up for me in the past as well.