With dc(mykey) as DC1
, I can plot how many distinct values of mykey
is incurred for the fixed time span. If values of mykey
never repeat over time, accum DC1 as DC_accum
will give me cumulative count of distinct values of mykey
over time. But that would be too trivial. In most practical cases, mykey
values are partially repeating. How can I plot cumulative count of distinct values of mykey
over time?
Following suggestions from How do you chart a cumulative sum, I tried
| reverse
| streamstats dc(mykey) as DC_cumulative
| timechart max(DC_cumulative)
This gives me an "improved" result, meaning it gives a plateau that equals the total DC. Does this really do what I want? Not fully fluent in streamstats, this whole "stats" without _time makes me nervous.
Try something like this
your base search | reverse | dedup mykey | timechart dc(mykey) as dc | streamstats sum(dc) as DC_cumulative
The dedup will keep only the first occurrence of mykey so any overlap of mykey will get eliminated. Might be expensive as you're using reverse and dedup.
As commented under @somesoni2's answer, running timechart before streamstats is more efficient than the original recipe. However, one non-obvious benefit of running streamstats before timechart (the original method) is that it allows a groupby clause whereas the former doesn't. So, here is an alternative, nearly as efficient answer if you need groupby:
| dedup mykey | streamstats dc(mykey) as DC_cumulative by group_key | timechart max(DC_cumulative) by group_key
Further more, if you need to retain the side effect of obtaining an interval distinct count (that the other method has), you can do
| dedup mykey | streamstats dc(mykey) as DC_cumulative by group_key | timechart dc(mykey) max(DC_cumulative) by group_key
Try something like this
your base search | reverse | dedup mykey | timechart dc(mykey) as dc | streamstats sum(dc) as DC_cumulative
The dedup will keep only the first occurrence of mykey so any overlap of mykey will get eliminated. Might be expensive as you're using reverse and dedup.
I initially thought that adding dedup would increase cost, but timechart before streamstats would reduce cost of streamstats. So I played these scenarios out with my original recipe over 5.5 million records - 155K cumulative distinct values, 2K to 3K distinct values in each of 49 surveyed intervals. Amazingly, as dedup reduces load on subsequent searches with such duplication rate, adding dedup accelerates my original recipe, too.
| reverse | dedup mykey | timechart dc(mykey) as dc | streamstats sum(dc) as DC_cumulative
| dedup mykey | streamstats dc(mykey) as DC_cumulative | timechart max(DC_cumulative)
| streamstats dc(mykey) as DC_cumulative | timechart max(DC_cumulative)
The side effect of placing timechart before streamstats is an added series dc
. Not only can this be easily filtered out, but also dc
is a useful metric that I had to go out of my way to add back, adding even more cost.
Of course, actual savings/cost will depend on data characteristics. The one in this comparison has an extreme duplication ratio, very different from my target data. (Chosen for low cost of raw search.) But I believe that whenever duplication is significant, the savings is positive. Great job!