Why is my UDP-input data appearing duplicated in t...

kavraja · ‎10-06-2014

I've carried out two searches to find out splunk is indexing duplicate search results which are from the same host, source and sourcetype.

I know I can use dedup but that only deletes it from the search results which still uses up disk space. I want to know if it's possible to stop splunk from indexing duplicates?

The two searches i ran to confirm that there were duplicates were:

 - mysearch | stats count values(host) values(source) values(sourcetype) values(index) by _raw | WHERE count>1
 - mysearch | convert ctime(_indextime) AS indextime | table _time indextime _raw

Any help in the right direction would be greatly appreciated. Thanks

jrodman · ‎10-09-2014

If the data is coming in via UDP then the most likely scenario is duplicate data is coming in via UDP.

The other possible scenario is this data is coming in over UDP to a forwarder, and something bad is happening in the chain of forwarders to your indexer. For example, if you have useACK enabled in your forwarder, and the communication to the indexer layer keeps failing, the forwarder will resend data to other indexers to ensure a complete record.

However, given the reliability of the problem that seems unlikely.

As next investigative steps I would suggest:

Review the _indextime value of a set of duplicates, ie search that gets duplicates | eval visibleindextime=strftime(_indextime, "%Y-%m-%d %H:%M:%S")
Run tcpdump, wireshark, or similar on the incoming udp stream to see if duplication is there.

kavraja · ‎10-09-2014

I ran the first search and got the following results:

Top 10 Values Count

2014-10-10 08:34:14 - 349

2014-10-10 08:34:58 - 220

2014-10-10 08:33:56 - 181

2014-10-10 08:35:46 - 174

and so on and the duplication does appear on the other platforms. I'll keep investigating and see what solution i come up with.
Thanks

evinasco · ‎08-31-2018

could you resolve this issue???

thanks

Ayn · ‎10-06-2014

You need to provide more information about the original source, otherwise it's very hard to say what causes these events to be duplicates in your index. Splunk will not duplicate events, but it will happily index events if they occur twice.

kavraja · ‎10-08-2014

The original source for the logs is content keeper. So whenever a user logs into the vpn for example, the same logs shows up over 150 times with the same time stamp. So basically 150 logs every second from a single user is filling up disk space. An example of the logs coming through are:

10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default
10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default
10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default

The ip and port are all the same.
source = udp:516
sourcetype = syslog

Why is my UDP-input data appearing duplicated in the index?

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!