What is the best strategy to handing overlapping d...

Marinus · ‎07-29-2010

Some sources will produce data that overlaps i.e. you get some of the data you already indexed. This can have quite a negative effect on search performance especially if you have to dedup whole events. Is there a best practice to deal with such a scenario.

highiqboy · ‎11-17-2010

@ cfrantsen - I think maverick is saying you can utilize Splunk's alerting feature, only instead of specifying to send an email alert when the search runs and finds the duplicates, you will choose the last option in the scheduled saved search popup window and tell splunk to simply add the dedup'd search results into a secondary index that you create on the indexes management page exactly for this purpose.

cfrantsen · ‎10-29-2010

Regarding option c, what is the best way to "save off the results into a new index"?

Marinus · ‎08-23-2010

Thanks for the feedback, very useful. Lets assume that you can't control your source to remove duplicates. One of the ideas that I've been toying with is to create a new search command like dedup that will dedup the search -1 event so that you can delete i.e. * | dedup2 fieldx | delete. You could run that on a regular basis, I just don't know what it's actually going to do in the index and if over time it's actually going to help.

splunkedout · ‎08-19-2010

my first thought was the same as maverick's option c above. this is what i would do if i was really constrained.

maverick · ‎08-18-2010

Couple ways you can approach this type of scenario.

First is to address the root cause or source of the duplicate events and try to resolve that. Overlapping events, where one event is an exact replica of one or more other events generated elsewhere, is not typical, at least not in my experience. Perhaps you might think about posting a separate question to describe more details around that topic and we can help you resolve.

Second is to assume that you cannot resolve the duplicate events issue and just optimize search performance in other ways, such as

a) turning off auto key/value extraction

b) piping to the "fields" command and only listing the two or three fields you need in your search results (i.e. this will avoid using auto key/value extraction by default)

c) you could set up a scheduled saved search to perform the dedups every few minutes on all of the events in real-time in the background and save off the results (i.e. which will be all unique events only) into a new index you create for this purposes, and then base your actual ad-hoc searches on that new index, instead of on the main index.

What is the best strategy to handing overlapping data?

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk