Averages skewed by empty time buckets

seanel · ‎05-12-2015

When creating a report of the max count/minute and average count/minute by host for a specific error there seems to be no easy way to get splunk to calculate an average that includes all the minutes where count=0

So the base search looks something like:

index=abcd Environment=PROD  "Error Code XXXX" | bucket _time span=1m | stats count latest(Cluster) as Cluster by _time,host

Lets keep it simple and say I choose last 30 days as my timescale, and the only errors were on 1 host on 1 day for a few minutes, perhaps 50 errors per minute for 3 minutes.

Then a later calculation of max(count) will report correctly of course, but then if i do stats avg(count) for the 30 days the answer splunk gives is 50/minute which is of course totally wrong as it only calculates using the minutes it found any events. It should be more like 0.0035/minute (150 total / 43200 minutes)

So the problem is that avg does not include all the empty time buckets.

I could change the search and simply do a count by error type per minute for all hosts but this is a MUCH heavier search (a few million events as opposed to a few thousand in my search.. or 12 minutes to run the search as opposed to 8 seconds for my version)

I have also tried using timechart with a span=1m too, but even this will only provide stats starting at the point of first minute where count>0 which could be the last day of the month and again skew the results badly.

It seems strange that the logic of the avg function does not take into account the range of the search.

woodcock · ‎06-08-2015

Use stats to drop time intervals and use timechart to keep them:

index=abcd Environment=PROD  "Error Code XXXX" | timechart span=1m count latest(Cluster) AS Cluster BY host

sideview · ‎05-13-2015

bucket _time followed by stats count by _time foo followed by makecontinuous _time can come close. makecontinuous will create buckets to fill in internal gaps, but not if there are gaps on the end of the data. You could of course fabricate buckets and conditionally eval for them the starttime and the endtime of the timerange respectively. But I think that's overkill here.

Better I think is to use timechart and then untable to "stats-ify" the timechart output.

index=abcd Environment=PROD  "Error Code XXXX" | timechart span=1min count by host | untable _time host count | stats avg(count) as avgCountPerMin by host

Timechart, not surprisingly, is really good at times. And bucket and stats and makecontinuous come kinda close, but in the end timechart is usually the best tool (sometimes with a bit of untable!)

You can even do this if timechart is carrying along both your count and some second computed statistic and still using a split by field. although you need some eval to unpack the second computed statistic. Here's a somewhat nutty example. Run it pipe by pipe to see what's going on. You end up with a full set of buckets, it's in "stats form", it's split by host and it has computed both the count and the max() of a particular field:

index=_internal | head 50000  | timechart count max(eps) by host | untable _time field_and_host value | eval field=mvindex(split(field_and_host,":"),0) | eval host=mvindex(split(field_and_host,":"),1) | eval {field}=value | fields - field_and_host field value | stats values(*) as * by host _time

seanel · ‎05-15-2015

Fantastic answers, but still there is a basic timechart problem.

If I search for last 7 days, but the first example of a count >0 is 3am two days ago... then thats when the I start to get zero values. There simply is no buckets created for the 5 days of data before that.

Thats the example I am staring at right now and the results of avg reflect the problem with one server showing an avg of over 2 per minute when the total errors for that server is 483 over the entire week (so the actual avg per minute over the week should be more like 0.048

sideview · ‎05-18-2015

That's very strange and I think it may actually have to do with some specifics in your search. Can you post your entire search? the timechart command definitely should have the full set of empty buckets. Except in the "all time" case where it has always had the "shrink-wrap" behavior, timechart will always create empty time buckets on the end as appropriate.

eg if you run this search you should see a full 24 hours of 30 minute buckets? index=_internal | head 1 | timechart count

jeffland · ‎05-12-2015

As a temporary solution, you could run a subsearch that creates dummy data (and thus timestamps for the missing data points). See the answer here.

I must admit I don't like the idea of a subsearch, I'm still trying to figure out a nicer way to do this.

seanel · ‎05-12-2015

@jeffland
Yeah - i have see these kind of answers too, but they quickly become too complex when a bunch of grouping is needed to be done too.

Seems like an oversight in the logic of the function to me. As I say in my comment above, perhaps simplest is to calculate number of minutes between earliest and latest but I cant for the life of me find out how to refer to that in the search... I have tried earliest earliestTime searchEarliestTime indexEarliestTime etc ... but nothing works

jeffland · ‎05-12-2015

Quick update: a simpler way to do it is to use timechart which magically does all you want. See this example:

index=_internal group=per_source_thruput | timechart span=1m count(group) AS c | stats avg(c)

No matter how many empty 1-minute timeslots there are, the average calculated at the end is the actual average over all of them. Now all that's left to do is figure out what timechart does that cannot be done manually. I've so far been able to create an empty event at a given earlier moment by using

| append [ | stats count | fields - count | eval _time=relative_time(now(), "-5h")]

This would create an event at five hours ago. Still need to figure out why makecontinuous after that doesn't fill the gap until the first real event.

seanel · ‎05-12-2015

Been there too, even with timechart the first result is not until the first count>0

For example I just chose -30d and the first result comes from 5/5/2015

It is the same as | makecontinous _time which does the same thing, and in the example I have where I am also reporting by cluster, all those cluster values are blank where the count is not >0 meaning I cant, for example, get an average per cluster

jeffland · ‎05-12-2015

Oh, yeah I also just noticed that timechartis limited to 1000 bins. You could of course increase that limit, but then again this is not really a solution to the actual problem.

jeffland · ‎05-12-2015

Phew. I just noticed that makecontinuous does work, but only when you place it at the end of the search (not before the bucket command, which I felt would be the right place). Maybe this should be mentioned in the docs somewhere, or am I the only one who would want to first make something continuous and then discretize/bucket it, not the other way round?

Anyway, see this example:

index=_internal group=per_source_thruput | append [ | stats count | fields - count | eval _time=relative_time(now(), "-30d")] | bucket _time span=10m | chart count(group) as c by _time | makecontinuous _time span=10m | fillnull c value=0

I just ran it over the last 30 days and there are indeed results from 30 days ago, and because that was a sunday and this machine didn't run that day it starts with 0 as value for c.

seanel · ‎05-12-2015

Additional question which may provide a solution... can I reference the earliest/latest input to the search in an eval to calculate how many minutes are in my timescale? I tried all kinds of things like earliestTime/latestTime and various variations I found in the job inspector but found nothing that worked.

Averages skewed by empty time buckets

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...