Incorrect per_second results on sparse data sets

beaumaris · ‎04-27-2011

We have a report that shows bandwidth over time. The data is obtained from a summary index that counts the total number of bytes (as NumBytes) transferred for each Server. The summary search runs every hour so we wind up with 1 summary index entry for each Server timestamped at the hour boundaries. When we run the search to produce the timechart it looks like this

index="summary" report="bandwidth_by_service_hourly" | stats sum(NumBytes) as TotalBytes by _time,Server | eval TotalMbits=(TotalBytes*8)/1024/1024 |  timechart limit=0 per_second(TotalMbits) by Provider

The idea here is to convert Bytes into Megabits, and then use per_second() to show the rate. This seems to work great except for small datasets. The problem we uncovered today occurs when we run the report for only a 1 day interval. Splunk autosizes the x-axis to try and display 48 data points, each at 30-minute increments. However because our data is summarized on the 1-hour boundaries, Splunk takes 1 hours worth of Bytes and divides it by (30min X 60sec/min) instead of (60min X 60sec/min) -- which effectively doubles the real bandwidth we are trying to show.

We do not want to put a span=1h on the timechart because if we set the TimeRange to more than a single day, this will prevent Splunk from rolling up the data properly, resulting in a very spiky graph with points at 1-hour intervals (pretty ugly for 30/60/90 day reports, etc).

Looking for suggestions on how to handle both the small dataset and the full time ranges we need to support. Thanks,

Tom E

gkanapathy · ‎04-27-2011

Yeah. per_second (and per_hour and per_minute) are kind of bad that way.

Since you know that your data is grouped per hour (hourly summary), and there's an equal number in every time interval, I'd just calculate your per second number and just display it instead of having timechart per_second() (fail to) compute it:

index="summary" report="bandwidth_by_service_hourly" 
| stats sum(NumBytes) as TotalBytes by _time,Server 
| eval TotalMbits=(TotalBytes*8)/1024/1024 
| eval TotalMbitsPerSec=TotalMbits/3600
|  timechart limit=0 avg(TotalMbitsPerSec) as TotalMBitsPerSec by Server

Or more concisely:

index="summary" report="bandwidth_by_service_hourly" 
|  timechart limit=0 avg(eval(NumBytes*8)/1024/1024/3600) as TotalMBitsPerSec by Server

beaumaris · ‎04-27-2011

That does not seem to work - tried the more concise version and instead of the results being 2X the correct value, they are now 1/2X the correct value. I tried using sum() instead of avg() and does yield the correct value for this test dataset. Does that make sense, and if so do you think it will hold up on a much larger dataset over a longer TimeRange?

gkanapathy · ‎04-27-2011

I suppose I'm assuming that NumBytes comes out zero when there isn't a measurement, or your average will be off. If it's not the case, you might make it a zero by using coalesce(NumBytes,0)

Incorrect per_second results on sparse data sets

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk