I am looking for a way to do a statistical sampling of events in order to provide rapid insight into millions of events.
An example: We have login events from a specific sourcetype. Millions of logins per day, want to look at how often users are logging in over a 7 day period (% of 1 login per user, % of 2 logins per user, % of 3 logins per user, etc). Right now this is a very long query to run even as a scheduled search.
< search string > | STATS count as logins by userID | STATS count as users by logins
This gives me a graph of how many users are in each category of login attempts (1,2,3,4,5,etc). But it takes a very long time to run. Any suggestions on how to take a statistical sampling would be appreciated.
<< EDIT >>
The suggestion has been to use summary indexing for this. How would you generate summary indexes to achieve this type of information? If, for example, we have 2 million unique users per day who login about 5 million times, shrinking 5 million daily events down to 2 million daily events (stats count by userID) is the only way I can think of to check for repeat visits on subsequent days.
This still involves looking at 14+ million events in the summary index (which, it would seem, defeats the purpose of a summary index) each time the query needs to run.
Any tips on how to optimize this kind of search?
... View more