Deployment Architecture

Events in Index are Getting Duplicated Even Though They're Exactly The Same

bmw417
New Member

I've been reading around the docs and other questions, and from what I can tell, Splunk is supposed to be taking an MD5 hash of every event going on, and if an incoming event matches an already existing index, it will drop it and not duplicate it. However, I'm getting the exact opposite result, and it's very important for my project to not spend extra resources on unnecessary actions such as reindexing the exact same events any number of times. I've included a screenshot of what I'm talking about - I took an md5 of the incoming _raw variable on the second run of the same Go file that communicates via TCP to my index cve. and as you can see, the hashes of the duplicated events are exactly the same, yet they're duplicated. Any help is appreciated.alt text

Thanks!

0 Karma

richgalloway
SplunkTrust
SplunkTrust

You've been mis-informed. There is little in Splunk to prevent duplication of events. File monitors will track their position in the file to avoid re-reading the same data and a SHA1 is calculated on the first and last sections of the file to know if it's changed, but there is nothing like what you describe. There is no hash of each event and Splunk certainly is not searching all other events for repeated hash values before indexing data. It's up to the user to handle duplicates, either at index time or at search time.

If you'll share how this data is onboarded we can offer ways to avoid the duplicates.

---
If this reply helps you, Karma would be appreciated.
0 Karma

bmw417
New Member

My data is added via a TCP port (9400 for this index), and I have a stream of data coming in from a Go program with one newline at the end that marks the end of the event. The data is collected from the Mitre CVE GitHub page by a web scraper/crawler that will gather the link to the CVE*.json file, the title of the file, time the CVE was reported, and then the actual contents of the json file - all 4 of these fields are comma delimited, and the first three (called link, title, and contenttime respectively) have field extractions set. For example, the first CVE ever in their database (CVE-1999-0001.json) will output a stream that looks like this:

https://raw.githubusercontent.com/CVEProject/cvelist/master/1999/0xxx/CVE-1999-0001.json, CVE-1999-0001.json, 30 Dec 99 00:00 -0600, { "CVE_data_meta": { "ASSIGNER": "cve@mitre.org", "ID": "CVE-1999-0001", "STATE": "PUBLIC" }, "affects": { "vendor": { "vendor_data": [ { "product": { "product_data": [ { "product_name": "n/a", "version": { "version_data": [ { "version_value": "n/a" } ] } } ] }, "vendor_name": "n/a" } ] } }, "data_format": "MITRE", "data_type": "CVE", "data_version": "4.0", "description": { "description_data": [ { "lang": "eng", "value": "ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets." } ] }, "problemtype": { "problemtype_data": [ { "description": [ { "lang": "eng", "value": "n/a" } ] } ] }, "references": { "reference_data": [ { "name": "http://www.openbsd.org/errata23.html#tcpfix", "refsource": "CONFIRM", "url": "http://www.openbsd.org/errata23.html#tcpfix" }, { "name": "5707", "refsource": "OSVDB", "url": "http://www.osvdb.org/5707" } ] }}

Optimally, there would be something I could put in props.conf or whatever settings file applies that will prevent duplicates, triplicates, etc of the data that matches the same hash that's already in the index at index time - I know how to use dedup and such at search time, but there has to be a way (hopefully?) to prevent duplicates in the first place so I don't waste IO indexing it, then CPU/MEM/IO running a program and going back over my data at the end of the day and running a search to delete the duplicate events myself.

Thanks again for the help

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...