At every set interval (while testing, 30 min interval), a search is issued to get min, max, and mean values of some perf counters. Those values are sent to a summary index, and this is where strange things start to happen.
In the live data indexes, those perf counters keep coming in frequently, nothing is missing. If I issue the "summary search" manually, I always get the right data, but when it is run by the Splunk scheduler, data gets into the index in an erratic manner. Here is the disturbing "pattern":
- initially, summary data was less than a day "late". The last 30 minutes samples would show up as 18-hours late data in the summary index.
- a few days later, it was around two days late, in the summary index
- still later, it was four days late
- all of a sudden, with no change, most data was 4+ days late, and there would be an isolated "peak" which would be only around 15 hours late.
- now it's again, at best, 2 days late.
When I set this up in the lab environment, I had no issue, and it runs just like it is supposed to run. However, when the exact same mechanism is set up in production, we get that strange behavior.
Here, it's not a matter of back filling older data. The "live" data is available.
I've seen several similar issues, including one which recommends to "delete the summary data for the time frame and then use back filling instead".
One last point. It seems the scheduler also behaves erratically, and does not respect the set schedule, neither the frequency, nor the time frame in which it should run, but when it does run almost right, summary inserted data is still way late.
This is happening on a Splunk Enterprise 6.3.3
... View more