Splunk Search

What is the best strategy to handing overlapping data?

Marinus
Communicator

Some sources will produce data that overlaps i.e. you get some of the data you already indexed. This can have quite a negative effect on search performance especially if you have to dedup whole events. Is there a best practice to deal with such a scenario.

highiqboy
Explorer

@ cfrantsen - I think maverick is saying you can utilize Splunk's alerting feature, only instead of specifying to send an email alert when the search runs and finds the duplicates, you will choose the last option in the scheduled saved search popup window and tell splunk to simply add the dedup'd search results into a secondary index that you create on the indexes management page exactly for this purpose.

0 Karma

cfrantsen
Explorer

Regarding option c, what is the best way to "save off the results into a new index"?

0 Karma

Marinus
Communicator

Thanks for the feedback, very useful. Lets assume that you can't control your source to remove duplicates. One of the ideas that I've been toying with is to create a new search command like dedup that will dedup the search -1 event so that you can delete i.e. * | dedup2 fieldx | delete. You could run that on a regular basis, I just don't know what it's actually going to do in the index and if over time it's actually going to help.

0 Karma

splunkedout
Explorer

my first thought was the same as maverick's option c above. this is what i would do if i was really constrained.

0 Karma

maverick
Splunk Employee
Splunk Employee

Couple ways you can approach this type of scenario.

First is to address the root cause or source of the duplicate events and try to resolve that. Overlapping events, where one event is an exact replica of one or more other events generated elsewhere, is not typical, at least not in my experience. Perhaps you might think about posting a separate question to describe more details around that topic and we can help you resolve.

Second is to assume that you cannot resolve the duplicate events issue and just optimize search performance in other ways, such as

a) turning off auto key/value extraction

b) piping to the "fields" command and only listing the two or three fields you need in your search results (i.e. this will avoid using auto key/value extraction by default)

c) you could set up a scheduled saved search to perform the dedups every few minutes on all of the events in real-time in the background and save off the results (i.e. which will be all unique events only) into a new index you create for this purposes, and then base your actual ad-hoc searches on that new index, instead of on the main index.

Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...