We have a few busy indexes that can only retain about 20 days worth of logs. The corner-office-types want unique user counts for the last month. The first thing that popped into my head was doing a daily stats count by user
and collecting that into a summary index. But, I'm seeing upwards of 2m unique users each day. So that collect
will be pretty massive.
Is there a better way? Maybe dumping that stats
command into a kvstore?
At what point will collect
fall over from too many lines of input?
Other considerations?
My solution:
outputlookup
I use the userid as the _key, but I needed a way to isolate to a month. Eg, November ended on a Thursday. I don't want the data to be polluted or overwritten with December data before analysts had a chance to check. So I'm prepending every userid with the month:
sourcetype=my_sourcetype |
stats latest(_time) as timestamp by userid |
eventstats last(timestamp) as lasttime |
eval _key=strftime(lasttime,"%b")."-".userid |
fields _key timestamp |
outputlookup users_kv_lookup append=true
Now with a simple inputlookup
command I can get uniques for the month of November:
| inputlookup users_kv_lookup where _key="Nov-*" | stats dc(_key) as uniques
I included the timestamp for... reasons. Maybe it will be useful to see the last login time. I thought having multiple logs for each user was too much though. We have 5m users logging in multiple times per day, each session with 10s - 100s of hits. That could get ugly.
EDIT - Dawg, do u even optimize?!
It seems like overwriting existing uids with the same info is time-consuming / resource-intensive. Why not only put the new ones into the collection? Okay then. Duuuh!
sourcetype=my_sourcetype |
stats latest(_time) as TS by userid |
eventstats last(TS) as lasttime |
eval _key=strftime(lasttime,"%b")."-".userid |
lookup users_kv_lookup _key |
where isnull(timestamp) |
fields _key TS | rename TS as timestamp |
outputlookup users_kv_lookup append=true
Any with existing records are excluded with the where isnull()
clause. The speedup was at least 2 orders of magnitude. Woot woot.
My solution:
outputlookup
I use the userid as the _key, but I needed a way to isolate to a month. Eg, November ended on a Thursday. I don't want the data to be polluted or overwritten with December data before analysts had a chance to check. So I'm prepending every userid with the month:
sourcetype=my_sourcetype |
stats latest(_time) as timestamp by userid |
eventstats last(timestamp) as lasttime |
eval _key=strftime(lasttime,"%b")."-".userid |
fields _key timestamp |
outputlookup users_kv_lookup append=true
Now with a simple inputlookup
command I can get uniques for the month of November:
| inputlookup users_kv_lookup where _key="Nov-*" | stats dc(_key) as uniques
I included the timestamp for... reasons. Maybe it will be useful to see the last login time. I thought having multiple logs for each user was too much though. We have 5m users logging in multiple times per day, each session with 10s - 100s of hits. That could get ugly.
EDIT - Dawg, do u even optimize?!
It seems like overwriting existing uids with the same info is time-consuming / resource-intensive. Why not only put the new ones into the collection? Okay then. Duuuh!
sourcetype=my_sourcetype |
stats latest(_time) as TS by userid |
eventstats last(TS) as lasttime |
eval _key=strftime(lasttime,"%b")."-".userid |
lookup users_kv_lookup _key |
where isnull(timestamp) |
fields _key TS | rename TS as timestamp |
outputlookup users_kv_lookup append=true
Any with existing records are excluded with the where isnull()
clause. The speedup was at least 2 orders of magnitude. Woot woot.
Do you need the count for each user - or the distinct count of users?
If the latter, have you considered using estdc
? If the unique user count can be approximated (1-2% error), give it a try!
Lastly, if you want to report on data that is older than your index retention, a summary index is a good choice. It might be easier to use KVStore as you mention, just make sure you remember to add append=t to your searches or your overwrite your old lookup!
I can't use estdc() because the data is not there. I need a uniques for the whole month, but only have about 19 days of retention. I'll have to run some tests to see how collect and/or the kvstore handles millions of records.