Solved: Re: Count seconds with no events by host

jercra · ‎04-05-2017

I'm trying to capture the amount of time a particular host is idle. I can do that in a timechart easily enough but I can't get a summary of that data to work. I have the following:

index=jobs sourcetype=active_jobs | timechart span=1s distinct_count(jobId) as jobCount by host

Which works fine but includes all results. If I add a | where jobCount=0 then I get no results. If I remove the by host but add the | where jobCount=0 then I get the correct count but with out the ability to group by host.

What I'm trying to do is then summarize that data into something to say:
host1 had 100 seconds with no events
host2 had 496 seconds with no events
host3 had 126 seconds with no events

I've tried buckets, xyseries, chart, stats and I can't seem to get it right. Seems to me a |stats count by host should do the trick but it results in 0 events

DalJeanis · ‎04-05-2017

You are trying to count the times a dog did not bark.

Try this - For each job that runs, produce a record showing start time and end time. Then break that into two records, a +1 record for start and a -1 record for end. Next, use streamstats or accum to count the +1s and -1s, and filter for records where the accumulated count is <=0.

(your search that produces every job's startTime and endTime in epoch format for every host)
| eval mydata=startTime.",+1 ".endTime.",-1"
| table host mydata
| makemv mydata
| mvexpand mydata
| makemv delim="," mydata
| eval _time=mvindex(mydata,0)
| eval jobcount=mvindex(mydata,1)
| table _time host jobcount

Records are normally returned most recent first, so we need to sort them into order before we run our accum. We include host in the sort fields because it will make the streamstats slightly more efficient.

| sort 0 host _time
| streamstats sum(jobcount) as netcount by host

Now, if any records have netcount<0, then it means you started at a point where a job was already running and caught the end of the job but not the start. That should not happen, but if it does you just fix your base search to only put out jobs for both the start and end time records. You can start a bit before the time frame you are interested in, and end a bit after, but you need to make sure the starts and ends are paired. Any idle times will therefore have a netcount of exactly 0.

However, before we filter for that, we need to know how long the netcount of 0 condition lasted. The delta command gives us the difference between a record and the previous record, so we need to reverse the order so that the more recent record, that ended the idle duration, is seen first.

| reverse
| delta _time as duration
| eval duration=-duration
| where netcount<=0

The above would work for a single host. However, delta doesn't have a by *fieldname* option, so here's my workaround. The streamstats command allows window and by *fieldname*, and the average of the current _time and the prior _time is halfway between the two, so twice the difference is the duration. This is the earlier event, so _time is less than avgtime.

| reverse
| streamstats window=2 global=t avg(_time) as avgtime by host
| eval duration=2*(avgtime-_time)

And now we're ready to get your stats...

| where netcount<=0
| stats sum(duration) as idletime by host

View solution in original post

DalJeanis · ‎04-05-2017

You are trying to count the times a dog did not bark.

Try this - For each job that runs, produce a record showing start time and end time. Then break that into two records, a +1 record for start and a -1 record for end. Next, use streamstats or accum to count the +1s and -1s, and filter for records where the accumulated count is <=0.

(your search that produces every job's startTime and endTime in epoch format for every host)
| eval mydata=startTime.",+1 ".endTime.",-1"
| table host mydata
| makemv mydata
| mvexpand mydata
| makemv delim="," mydata
| eval _time=mvindex(mydata,0)
| eval jobcount=mvindex(mydata,1)
| table _time host jobcount

Records are normally returned most recent first, so we need to sort them into order before we run our accum. We include host in the sort fields because it will make the streamstats slightly more efficient.

| sort 0 host _time
| streamstats sum(jobcount) as netcount by host

Now, if any records have netcount<0, then it means you started at a point where a job was already running and caught the end of the job but not the start. That should not happen, but if it does you just fix your base search to only put out jobs for both the start and end time records. You can start a bit before the time frame you are interested in, and end a bit after, but you need to make sure the starts and ends are paired. Any idle times will therefore have a netcount of exactly 0.

However, before we filter for that, we need to know how long the netcount of 0 condition lasted. The delta command gives us the difference between a record and the previous record, so we need to reverse the order so that the more recent record, that ended the idle duration, is seen first.

| reverse
| delta _time as duration
| eval duration=-duration
| where netcount<=0

The above would work for a single host. However, delta doesn't have a by *fieldname* option, so here's my workaround. The streamstats command allows window and by *fieldname*, and the average of the current _time and the prior _time is halfway between the two, so twice the difference is the duration. This is the earlier event, so _time is less than avgtime.

| reverse
| streamstats window=2 global=t avg(_time) as avgtime by host
| eval duration=2*(avgtime-_time)

And now we're ready to get your stats...

| where netcount<=0
| stats sum(duration) as idletime by host

jercra · ‎04-06-2017

I ended up solving this in a different way by essentially inserting a record basically saying there's no data. It was much simpler in the end.

Your solution is very cool and I'll mark is as the answer. I'm going to try it on some other data when I get time.

DalJeanis · ‎04-06-2017

All of the above is just a method to calculate that record that you've inserted. If you have a simpler way to do that, then please post it as an answer to the question and accept it instead. I love learning new algorithms.

Hmmm. Okay, if you don't care about subsecond intervals, then here's another way...

index=jobs sourcetype=active_jobs 
| bin _time span=1s
| stats  dc(jobId) as jobCount by _time host 
| appendpipe 
    [| stats values(host) as host 
     | addinfo 
     | eval TheStart=strftime(info_min_time,"%m/%d/%Y:%H:%M:%S")
     | eval TheEnd=strftime(info_max_time,"%m/%d/%Y:%H:%M:%S") 
     | map search="| gentimes start=$TheStart$ end=$TheEnd$ increment=1s 
                   | eval _time=starttime 
                   | table _time 
                   | eval jobCount=0 
                   | mvexpand host "
     ]
| stats  sum(jobCount) as jobCount by _time host 
| where (jobCount = 0) AND (_time < info_max_time)
| stats count as IdleSeconds by host

jercra · ‎04-06-2017

The data comes from a scripted input calling an API at a set interval. I'm simply inserting a record akin to "no record" if nothing is found. So I'm totally cheating and not using a Splunk method. That's why I left your answer as is. I won't always be able to do that and calculating where data is NOT is something that's come up for me in the past as well.

Count seconds with no events by host

Detecting Remote Code Executions With the Splunk Threat Research Team

Observability | Use Synthetic Monitoring for Website Metadata Verification

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk