I've got some data in an index that has a retention time that is intentionally short, but some of the data in that index is of higher value and I want to retain it for a longer period. I've been looking at setting up a scheduled search that uses 'collect', but I don't see a mechanism to run a scheduled search such that there's a high level of fidelity in the data - no duplicates and no holes. Since this data is more valuable we want to make sure we get it all!
Is there a simple mechanism to do such a thing? I'm thinking I want to make the base search reach far enough back in time to not miss any data that has shown up since the last run, then deduplicate against the existing data in the target index (which might be complicated without _raw) and then 'collect' whatever is left into the target.
Now that I clarified the question, it occurred to me that the solution is simple: the subsearch needs to return a value for earliest.
This seems to do what I want:
[search index=target-index| head 1 | eval _time=_time + 0.001 | stats latest(_time) as earliest] latest=-1m@m index=source-index | collect index=target-index addtime=true
I had an issue where I'd get a single line of duplication every time it runs, since the event returned by the subsearch is included in the collect-ed search. Adding a bit of time seems to do the trick.
Now that I clarified the question, it occurred to me that the solution is simple: the subsearch needs to return a value for earliest.
This seems to do what I want:
[search index=target-index| head 1 | eval _time=_time + 0.001 | stats latest(_time) as earliest] latest=-1m@m index=source-index | collect index=target-index addtime=true
I had an issue where I'd get a single line of duplication every time it runs, since the event returned by the subsearch is included in the collect-ed search. Adding a bit of time seems to do the trick.
This suffers a bit when an indexer restarts. Hmm. It really needs to run with _index_earliest. No idea how to pass that as of yet.
The only downside here is that I was using _index_earliest, since that gives me some certainty about catching events that are delayed in reaching the indexers, for whatever reason. Since the latest event in the target index may or may not have been indexed at a time near to _time for that event, there's some slop there. Also, I can't seem to pass _index_earliest from a subsearch, although I can pass earliest. So there's some edge conditions where I might miss some events, but it should be pretty darn close. It might even be good enough.
So, time is still a complicated problem. I've got a saved search that runs once a minute that does something like:
_index_earliest=-2m@m _index_latest=-1m@m | collect index=target-index addtime=true
This works great as long as the search head that runs the search is up when the search is scheduled to run. If it's not (due to a restart, for example), there's a gap. What I really want is to be able to say something like:
earliest=[search index=target-index | head 1 | fields _time]
...but that's not valid syntax, of course. Still not really sure how to dynamically insert a time in the 'earliest' term that corresponds to the last entry in the target.
If your wanting to filter out some "noise" in one index and only keep the important stuff in a separate index which also increases the reporting speed, a summary index is perfect for this
https://wiki.splunk.com/Community:Summary_Indexing
*Summary indexes do NOT count against your license
So, is the standard practice to just wait a while (say, a day), and then do something like
search earliest=-2d@d latest=-1d@d | collect
That seems a bit haphazard. I guess I'm looking for more insurance that I get exactly what I want without any possibility of a data problem, and collect doesn't really do any checking - it just moves data to a summary index of your choice.
It's worth noting that collect does NOT count against your license UNLESS you change the sourcetype of your data from the default (stash, I think), in which case it DOES count against your license. I think that's a bug, but I've verified with support that this is how it presently works.
Why not specify EXACTLY what you want to to summerize? This way you will not miss anything
Example, say you have data from a single source /etc/xxx/logs/12_1_2016.log
and you want to use a stats command by a certain field. If you had a lot of "noise", this may take awhile to filter out the noise and only return what your looking for.
Your populating search will look like
index=foo source="/etc/xxx/logs/12_1_2016.log | stats count by FIELD"
If you wanted the "insurance" of knowing you got everything, then why not just run the search first and verify you got everything as expected? Splunk is a great tool and do exactly what you ask it to do.. If your query is not correct then obviously you will miss some data..
Also, you can run the populating search every 5 minutes or every day if you wanted to.
Lastly, why do you keep using the collect
command? Just run a search which returns the results as expected and summarize it into a new index to increase reporting speed