Solved: Repetive queries of LARGE indexes

tlmayes · ‎07-27-2017

We have an environment that indexes approximately 600GB / day. I have been tasked with creating queries that correlate event hostname and IP information with static lookups to produce a dashboard of hosts/IP's that appear in the indexes. The static lookups will change weekly (or other interval) based on threats.

The challenge is that the indexes that must be crawled are extremely large, and each time I perform this crawl/query I must again crawl the entire history to ensure we correlate using the most recent lookup data. Is not practical to this many events on a regular basis (~ 300 billion), but also must ensure the correlation is accurate.

Interested in ideas on how to solve this and gain effeciencies? I am considering a way to identify fields within each index considered interesting, capturing these events at ingestion, and storing that data in a summary index or KV Store (ex. src_ip, name_query, dst_ip, etc), with a pointer there to the original event for ease of retrieval (or store the _raw field). My concern is taking the UF's or HF's with additional work. The alternative might be to do the same, but with a query that runs across a days worth of events.

Interested in how others address similar problems

mattymo · ‎07-27-2017

Hi tlmayes!

The good news is many have ventured this path and there are a few really great options to achieve the efficiencies you are looking for!

Here is a quick list I would run thru for optimizing this (i'll update if i find good how to's or conf talks on the matter):

1) tstats! tstats! tstats!

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Tstats

https://answers.splunk.com/answers/186938/what-is-tstats-and-why-is-so-much-faster-than-stat.html

tstats at your ingestion rate may be all you need, but if you need even more optimization:

2) tstats to feed Summary indexing

You could schedule a tstats search that feeds a summary index. Check out the meta woot app for not only a great example of this, but just an overall great app https://splunkbase.splunk.com/app/2949/

3) Accelerated Data models

http://docs.splunk.com/Documentation/Splunk/6.6.2/Knowledge/Aboutdatamodels

At the end of the day its about letting the work be amortized in increments rather than exactly when you need the results...

- MattyMo

View solution in original post

mattymo · ‎07-27-2017

Hi tlmayes!

The good news is many have ventured this path and there are a few really great options to achieve the efficiencies you are looking for!

Here is a quick list I would run thru for optimizing this (i'll update if i find good how to's or conf talks on the matter):

1) tstats! tstats! tstats!

https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Tstats

https://answers.splunk.com/answers/186938/what-is-tstats-and-why-is-so-much-faster-than-stat.html

tstats at your ingestion rate may be all you need, but if you need even more optimization:

2) tstats to feed Summary indexing

You could schedule a tstats search that feeds a summary index. Check out the meta woot app for not only a great example of this, but just an overall great app https://splunkbase.splunk.com/app/2949/

3) Accelerated Data models

http://docs.splunk.com/Documentation/Splunk/6.6.2/Knowledge/Aboutdatamodels

At the end of the day its about letting the work be amortized in increments rather than exactly when you need the results...

- MattyMo

tlmayes · ‎07-27-2017

Thanks for the quick response. Meta Woot may provide great help on several fronts. Reading up on the use of Data Models and how this would help. Not much background but a great opportunity to implement.

Regarding the use of tstats, is this the right tool to use for the collection and storage (summary or KV) of strings (hostnames/IP's)? I suspect a deeper investigation of Meta Woot under the hood will answer this.

Agree with you final comment about amortizing but struggling with the best way to get there without too much failure

mattymo · ‎07-27-2017

tstats is definitely the first spell to cast. Once you have that in your toolbox, you have a few different avenues that will all work to further optimize. It will utilize index time extractions (host, sourcetype, source) etc, and perform lightning fast on its own and can be further enhanced by summarizing the results.

- MattyMo

Repetive queries of LARGE indexes

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life