Solved: How to chart cumulative distinct count over time?

yuanliu · ‎01-12-2016

With dc(mykey) as DC1, I can plot how many distinct values of mykey is incurred for the fixed time span. If values of mykey never repeat over time, accum DC1 as DC_accum will give me cumulative count of distinct values of mykey over time. But that would be too trivial. In most practical cases, mykey values are partially repeating. How can I plot cumulative count of distinct values of mykey over time?

Following suggestions from How do you chart a cumulative sum, I tried

| reverse
| streamstats dc(mykey) as DC_cumulative
| timechart max(DC_cumulative)

This gives me an "improved" result, meaning it gives a plateau that equals the total DC. Does this really do what I want? Not fully fluent in streamstats, this whole "stats" without _time makes me nervous.

somesoni2 · ‎01-12-2016

Try something like this

your base search | reverse | dedup mykey | timechart dc(mykey) as dc | streamstats sum(dc) as DC_cumulative

The dedup will keep only the first occurrence of mykey so any overlap of mykey will get eliminated. Might be expensive as you're using reverse and dedup.

View solution in original post

yuanliu · ‎01-20-2016

As commented under @somesoni2's answer, running timechart before streamstats is more efficient than the original recipe. However, one non-obvious benefit of running streamstats before timechart (the original method) is that it allows a groupby clause whereas the former doesn't. So, here is an alternative, nearly as efficient answer if you need groupby:
| dedup mykey | streamstats dc(mykey) as DC_cumulative by group_key | timechart max(DC_cumulative) by group_key
Further more, if you need to retain the side effect of obtaining an interval distinct count (that the other method has), you can do
| dedup mykey | streamstats dc(mykey) as DC_cumulative by group_key | timechart dc(mykey) max(DC_cumulative) by group_key

somesoni2 · ‎01-12-2016

Try something like this

your base search | reverse | dedup mykey | timechart dc(mykey) as dc | streamstats sum(dc) as DC_cumulative

The dedup will keep only the first occurrence of mykey so any overlap of mykey will get eliminated. Might be expensive as you're using reverse and dedup.

yuanliu · ‎01-20-2016

I initially thought that adding dedup would increase cost, but timechart before streamstats would reduce cost of streamstats. So I played these scenarios out with my original recipe over 5.5 million records - 155K cumulative distinct values, 2K to 3K distinct values in each of 49 surveyed intervals. Amazingly, as dedup reduces load on subsequent searches with such duplication rate, adding dedup accelerates my original recipe, too.

(@somesoni2) 75s: | reverse | dedup mykey | timechart dc(mykey) as dc | streamstats sum(dc) as DC_cumulative
(Modified) 86s: | dedup mykey | streamstats dc(mykey) as DC_cumulative | timechart max(DC_cumulative)
(Original) 179s: | streamstats dc(mykey) as DC_cumulative | timechart max(DC_cumulative)

The side effect of placing timechart before streamstats is an added series dc. Not only can this be easily filtered out, but also dc is a useful metric that I had to go out of my way to add back, adding even more cost.

Of course, actual savings/cost will depend on data characteristics. The one in this comparison has an extreme duplication ratio, very different from my target data. (Chosen for low cost of raw search.) But I believe that whenever duplication is significant, the savings is positive. Great job!

How to chart cumulative distinct count over time?

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk

.conf24 | Personalize your .conf experience with Learning Paths!

Threat Hunting Unlocked: How to Uplevel Your Threat Hunting With the PEAK Framework ...