Solved: What are the best practices for indexing MASSIVE p...

DalJeanis · ‎03-03-2017

At my organization, we often need to research older information in massive proxylogs - about a billion records a day, more or less - within data that goes back six months. Yes, that much data.

I'm trying to find reasonable ways to accelerate finding the answer when we are asked an urgent question like "What users or processes have accessed THIS IP in the last N days?" or "What IPs has THIS USER accessed in the last N days?" or "What users or processes accessed THIS IP in THIS DATE RANGE?"

Now, any given event might have three or four different associated IP addresses, and might have two or more different users or process names associated with it. From the SQL world, I'd be creating a covering index, and from here it looks like I'm looking for advice on whether to use a bloomfilter or a summary index, or whether there are any other similar options and what the overall characteristics of each solution might be.

At the moment, to me it seems like a summary index would work and would not duplicate too much information. Each unique user found in an event and each unique IP found in that event would be (in essence) mvexpanded for all combinations, and one summary record per day would be created for each combination. I'd also keep the full time of the first record for each combo for that day, and look at keeping the last one. Don't know whether I care about the event count, but it wouldn't take too much space.

It may be that a sub-day granularity would be more effective and not much more costly. Maybe I'd do time-and-space trials to see what duration of span would be most appropriate (1d 12h 8h 6h 4h 3h 2h 1h etc) - my gut tells me that the 12h-4h range would be pretty efficient, even if the span happened to break in the middle of the workday somewhere in the world. The collection process might first use "transaction" to aggregate the records and then bin the start times into 4h to 8h chunks to give the best tradeoff among granularity, access time and index space.

The same question might also be asked of us in terms of hosts or URLs, but I'm thinking that a daily many-to-many crosstab from baseURL to IP address would provide enough traction to make use of the above index strategy.

That's my high-level musing about this.

So,
(1) What are the caveats if we instituted such an index?
(2) Am I trying to reinvent the wheel, here?
(3) What else am I missing?

woodcock · ‎03-03-2017

I agree that a Summary Index is the way to go. I agree with your thinking on how to go about it; it should be very straightforward. I would use stats instead of sistats because the output is smaller and you do not need any complex things like avg.

View solution in original post

woodcock · ‎03-03-2017

I agree that a Summary Index is the way to go. I agree with your thinking on how to go about it; it should be very straightforward. I would use stats instead of sistats because the output is smaller and you do not need any complex things like avg.

What are the best practices for indexing MASSIVE proxylogs?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!