I'm using streamstats
to pair up events by username so that timestamps, IP's, latitudes, and longitudes can be analyzed for land-speed violations as a possible indicator of account compromise. However, I'm running into an issue where this works perfectly on small datasets (thousands of events), and even some large datasets (millions of events) . . . but when the number of users represented in the event set climbs into the hundreds of thousands, streamstats
seems to be dropping events. I have a few conjectures, but I would like to understand what specific limitation it is running into, and the best way to work around it.
The relevant portion of the full query is as follows:
| streamstats current=t global=f window=2 earliest(client_ip) as client_ip_1 latest(client_ip) as client_ip_2 earliest(_time) as time_1 latest(_time) as time_2 earliest(timestamp) as timestamp_1 latest(timestamp) as timestamp_2 earliest(latlon) as latlon_1 latest(latlon) as latlon_2 by username
I collected the following data, which may be of some use:
+ Can handle 100,000 events and 100,000 users
+ Can handle 200,000 events and 50,000 users
+ Can handle 500,000 events and 50,000 users
+ Can handle 1,000,000 events and 50,000 users
+ Can handle 3,000,000 events and 50,000 users
+ CANNOT handle 200,000 events and 150,000 users
The above leads me to believe that the number of groups the "by" clause forces streamstats to split the event set into is the biggest factor. However, any more concrete information you can provide would be greatly appreciated.
My suggestion would be like this
1) Create an hourly summary index search which will run every hour and collect the latest login/location info for each user into the summary index. The search could be like this
your base search with selecting data for last hour i.e. earliest=-1h@h latest=@h | dedup username
| table _time username client_ip timestamp latlon
2) Run this search using cron 7 * * * * (every hour at 7th min) and save the results to a summary index.
3) Assuming you're ok with a delay of (min) of 1 hour and 7 min to detect the violation, you can use following query
index=YourSummaryIndex source=YourSummaryIndexSearchName earliest=-2h@h latest=now
| stats earliest(client_ip) as client_ip_1 latest(client_ip) as client_ip_2 earliest(_time) as time_1 latest(_time) as time_2 earliest(timestamp) as timestamp_1 latest(timestamp) as timestamp_2 earliest(latlon) as latlon_1 latest(latlon) as latlon_2 by username | further calculation based on _1 and _2 fields
This way you're always looking at only 2hrs worth of data with 2x number of users events.
Many limits can be changed in drumroll limits.conf, but significant changes like adding a few zeroes is usually a bad idea.
Regarding the actual question, the big issue is high cardinality bucketing over a large span of events - or, in English, keeping lots and lots of things in memory.
If you're still looking for a solution and - ideally - have a dummy AWS/Splunk Cloud instance to play with... 😄
I briefly considered a similar solution, but the problem with this approach is that it can only compare two logins per user, per hour. The algorithm cannot afford to have any data gaps/holes like that. Every login per user needs to be compared with each adjacent login, regardless of when they occur (could be many in one hour, or they could be separated by multiple hours).
Streamstats was the only method I ran into for comparing every single pair of logins by user. The only other thing I've thought of is maybe scheduling multiple hourly scans.... each running the same algorithm, but for a different partition of the user-base (i.e., one scan looking at usernames starting with a-b, the next c-d, etc.). The obvious problem with that is it's clunky.... who wants to manage 13 different hourly reports?
To add to that, it would be good to know if the memory limit for these intensive commands can be increased. Does anyone know if that is possible and/or where those limits are defined?
The streamstats (and join and transaction) commands are memory intensive commands and have certain limitation on the available resources (not sure about the exact limit). If the amount of resources required by search exceeds the limit, it, without showing any error, drops the results. These command should be used with smaller data sets only for reliable results. Could you explain more on your requirement and current search and can you see if you could reduce the amount of data to be processed?
Thanks, the scenario is as follows: there are about 700,000 users, each of whom generate about 8 login events per day. The login events associated with each user account need to be paired off, geolocated, and checked for land-speed violations. Since two events that constitute a single violation for one user account could potentially occur up to 18 hours apart, it's difficult to look at smaller partitions of the event set without losing the ability to detect violations.
Currently, I've been using a KV Store to cache the most recent event for each user that is more than an hour old, and running an hourly scan that uses the above streamstats call to examine all those cached events concurrently with the last hour's worth of activity. This avoids needing to look at an entire 18 hours of history, but it still doesn't solve the problem at proper scale.
That approach works like a charm until the user-base grows into the hundreds of thousands. At that point, caching one event per user means that the KV store will always be contributing hundreds of thousands of events to every search (even if they were run every 15 mins, instead of every hour) . . . and when streamstats needs to split the dataset into that many groups, it's definitely hitting memory limits.
Do you have any thoughts on how to reduce the event set before the streamstats call? Or make the whole process more scalable? It would be a shame if Splunk just isn't the right tool for this kind of analysis.