Splunk Search

How do I check for the existence of an event before indexing to avoid duplicate events?

andrewtrobec
Motivator

Hello,

I'm busy trying to find a way to ensure that duplicate records are not indexed. So far all I've managed to do is find a search that will remove duplicate values once they have been indexed (and consumed license):

index="index_name"
| eval eid=_cd
| search [ search index="index_name"
 | streamstats count by _raw
 | search count>1
 | eval eid=_cd
 | fields eid ]
| delete

Is there any way in transforms.conf or props.conf to check for the existence of an event before deciding to index?

Thank you and best regards,

Andrew

Tags (1)
0 Karma
1 Solution

gcusello
SplunkTrust
SplunkTrust

Hi andrewtrobec,
No there isn't any way to configure Splunk for this, Splunk already check if it already indexed a file (fishbuckets), but if you have the same event in two different files, you index it twice!

The only way that I can think (but I didn't tried to do this!) is, using SDK, to perform a search to check an event before indexing, but, as you can think, this make very slow the ingestion process, in addition what's the time period in your check search? one minute, one hour or more? there's the high risk to overload your system so the cost of the oversetting is lower than license!

Also the way to delete redundant logs it's dangerous because you risk to delete good events! probably it' should be better to dedup results at search time; remember that using "delete" command you don't save disk space because it's a logical deletion, not physical!

I think that you should check at first what's the license overload that probably it isn't so high, after you should try to re-engineer your inputs.

Bye.
Giuseppe

View solution in original post

gcusello
SplunkTrust
SplunkTrust

Hi andrewtrobec,
No there isn't any way to configure Splunk for this, Splunk already check if it already indexed a file (fishbuckets), but if you have the same event in two different files, you index it twice!

The only way that I can think (but I didn't tried to do this!) is, using SDK, to perform a search to check an event before indexing, but, as you can think, this make very slow the ingestion process, in addition what's the time period in your check search? one minute, one hour or more? there's the high risk to overload your system so the cost of the oversetting is lower than license!

Also the way to delete redundant logs it's dangerous because you risk to delete good events! probably it' should be better to dedup results at search time; remember that using "delete" command you don't save disk space because it's a logical deletion, not physical!

I think that you should check at first what's the license overload that probably it isn't so high, after you should try to re-engineer your inputs.

Bye.
Giuseppe

andrewtrobec
Motivator

@cusello Thanks for the information, very useful. Is there a way to physically delete logically deleted events?

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi andrewtrobec,
No, for my knowledge, the only way to physically delete events from an index is the "splunk clean eventdata -index index_name" command but in this way you delete the full index.

You have to wait for the retention time!
For this reason the delete command isn't a good way to delete, it's better to maintain events and dedup them at serach time!

Bye.
Giuseppe

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...