If I want to see if an issue has been happening for at least a set period of time, how would I go about asking splunk for as least X time between events?
For example, if I wanted to make sure that the issue has been happening for at least 48 hours. And I had an event show up 50 hours ago, and one that happened last minute, that would be a hit.
This is assuming that the window of time that the search is in, is large enough to cover the difference. (100 hour window, run once a hour would catch the above...)
Can this be done?
I need to test both suggestions. I will probably do that over the weekend. Sorry for the delay.
Like this:
YOUR BASE SEARCH FOR EVENTS HERE
| streamstats current=f last(_time) AS next_time BY host
| eval delta=next_time - _time
| where delta < (48 * 60 * 60)
This took a list of 4 hosts, and brought it down to a list of one hosts. I think that at least one of the hosts had been having issues for more that 48 hours. I will need to verify though. All of the hosts have had a number of events.
Without the time check, there are 3 hosts that match the criteria. When I use the time check above, the list shrinks to just one, even though 2 hosts have events that are more than 48 hours apart. If the issue that causes the event to occur is intermittent, would eliminating the current, possibly cause the result to return as a false? (Not sure how current plays into the equation)
Okay, the answer is going to depend on how well you can define two statements:
This event shows that an issue exists (for a particular host)
This event shows that an issue does NOT exist (for a particular host)
I'm assuming for the example that the issue is a problem with a host computer, but the concept is the same if the issue is for a user, or for an ice cream freezer, or a traffic camera, or the oil level in your lawnmower. (Hey, it's the IOT, we're getting there.)
| makeresults count=48
| eval host=mvappend("Host1","Host2")
| mvexpand host
| eval status=if(random()%41+random()%29>20,"down","up")
| streamstats count as recno by host
| eval _time=relative_time(now(),"-2d@d")+7200*recno
| table _time host status
| rename COMMENT as "The above just generates random test data every 2 hours for the last 2 days"
| rename COMMENT as "with either host=Host1 or host=Host2, and either status=up or status=down"
| sort 0 _time host
| streamstats current=f last(status) as priorstatus last(_time) as priortime by host
| eval statuschange=if(status=coalesce(priorstatus,"nothing"),0,1)
| eval priortime=coalesce(priortime,_time)
| streamstats sum(statuschange) as statusgroup by host
| rename COMMENT as "Use this one if you want to assume the first reading in this group started immediately after the last reading in the prior group"
| stats count as readingcount max(_time) as _time min(priortime) as starttime by host status statusgroup
| eval duration=_time-coalesce(starttime,_time)
| rename COMMENT as "Use this one if you want to use the actual first reading in this group as the start time"
| rename COMMENT as "| stats count as readingcount max(_time) as _time min(_time) as starttime range(_time) as duration by host status statusgroup"
| eval starttime=strftime(starttime,"%Y-%m-%d %H:%M:%S")
| eval endtime=strftime(_time,"%Y-%m-%d %H:%M:%S")
| eval duration=strftime(duration,"%H:%M:%S")
| sort 0 _time host
| table _time host status starttime endtime duration
Yes, the event is in intermittent error that happens ate random intervals. I am attempting to make sure that it is a 'longer term issue', and so I want to make sure it has been happening for at least 48 hours before triggering the alert. (it is a minor issue, so it can wait)
So, in this scenario, 'bad_thing' will happen on a host at random times throughout the day, it will be fine for a while, and then happen again. I am ok with this triggering for 1 event, and then another happening 48.1 hours later - for a total of two events.