Splunk Search

What's the best way to get unique counts from long term data?

twinspop
Influencer

We have a few busy indexes that can only retain about 20 days worth of logs. The corner-office-types want unique user counts for the last month. The first thing that popped into my head was doing a daily stats count by user and collecting that into a summary index. But, I'm seeing upwards of 2m unique users each day. So that collect will be pretty massive.

Is there a better way? Maybe dumping that stats command into a kvstore?

At what point will collect fall over from too many lines of input?

Other considerations?

0 Karma
1 Solution

twinspop
Influencer

My solution:

  • Created a collection (KV store) with one defined field, timestamp
  • Created the lookup for that collection
  • Created a saved search that runs every 4 hours...
  • ... and sends the results to outputlookup

I use the userid as the _key, but I needed a way to isolate to a month. Eg, November ended on a Thursday. I don't want the data to be polluted or overwritten with December data before analysts had a chance to check. So I'm prepending every userid with the month:

sourcetype=my_sourcetype | 
stats latest(_time) as timestamp by userid | 
eventstats last(timestamp) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
fields _key timestamp | 
outputlookup users_kv_lookup append=true

Now with a simple inputlookup command I can get uniques for the month of November:

| inputlookup users_kv_lookup where _key="Nov-*" | stats dc(_key) as uniques

I included the timestamp for... reasons. Maybe it will be useful to see the last login time. I thought having multiple logs for each user was too much though. We have 5m users logging in multiple times per day, each session with 10s - 100s of hits. That could get ugly.

EDIT - Dawg, do u even optimize?!

It seems like overwriting existing uids with the same info is time-consuming / resource-intensive. Why not only put the new ones into the collection? Okay then. Duuuh!

sourcetype=my_sourcetype | 
stats latest(_time) as TS by userid | 
eventstats last(TS) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
lookup users_kv_lookup _key | 
where isnull(timestamp) |
fields _key TS | rename TS as timestamp | 
outputlookup users_kv_lookup append=true

Any with existing records are excluded with the where isnull() clause. The speedup was at least 2 orders of magnitude. Woot woot.

View solution in original post

0 Karma

twinspop
Influencer

My solution:

  • Created a collection (KV store) with one defined field, timestamp
  • Created the lookup for that collection
  • Created a saved search that runs every 4 hours...
  • ... and sends the results to outputlookup

I use the userid as the _key, but I needed a way to isolate to a month. Eg, November ended on a Thursday. I don't want the data to be polluted or overwritten with December data before analysts had a chance to check. So I'm prepending every userid with the month:

sourcetype=my_sourcetype | 
stats latest(_time) as timestamp by userid | 
eventstats last(timestamp) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
fields _key timestamp | 
outputlookup users_kv_lookup append=true

Now with a simple inputlookup command I can get uniques for the month of November:

| inputlookup users_kv_lookup where _key="Nov-*" | stats dc(_key) as uniques

I included the timestamp for... reasons. Maybe it will be useful to see the last login time. I thought having multiple logs for each user was too much though. We have 5m users logging in multiple times per day, each session with 10s - 100s of hits. That could get ugly.

EDIT - Dawg, do u even optimize?!

It seems like overwriting existing uids with the same info is time-consuming / resource-intensive. Why not only put the new ones into the collection? Okay then. Duuuh!

sourcetype=my_sourcetype | 
stats latest(_time) as TS by userid | 
eventstats last(TS) as lasttime | 
eval _key=strftime(lasttime,"%b")."-".userid | 
lookup users_kv_lookup _key | 
where isnull(timestamp) |
fields _key TS | rename TS as timestamp | 
outputlookup users_kv_lookup append=true

Any with existing records are excluded with the where isnull() clause. The speedup was at least 2 orders of magnitude. Woot woot.

0 Karma

aljohnson_splun
Splunk Employee
Splunk Employee

Do you need the count for each user - or the distinct count of users?

If the latter, have you considered using estdc? If the unique user count can be approximated (1-2% error), give it a try!

Lastly, if you want to report on data that is older than your index retention, a summary index is a good choice. It might be easier to use KVStore as you mention, just make sure you remember to add append=t to your searches or your overwrite your old lookup!

0 Karma

twinspop
Influencer

I can't use estdc() because the data is not there. I need a uniques for the whole month, but only have about 19 days of retention. I'll have to run some tests to see how collect and/or the kvstore handles millions of records.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...