Getting Data In

Why is my UDP-input data appearing duplicated in the index?

kavraja
Path Finder

I've carried out two searches to find out splunk is indexing duplicate search results which are from the same host, source and sourcetype.

I know I can use dedup but that only deletes it from the search results which still uses up disk space. I want to know if it's possible to stop splunk from indexing duplicates?

The two searches i ran to confirm that there were duplicates were:

 - mysearch | stats count values(host) values(source) values(sourcetype) values(index) by _raw | WHERE count>1
 - mysearch | convert ctime(_indextime) AS indextime | table _time indextime _raw

Any help in the right direction would be greatly appreciated. Thanks

Tags (2)
0 Karma

jrodman
Splunk Employee
Splunk Employee

If the data is coming in via UDP then the most likely scenario is duplicate data is coming in via UDP.

The other possible scenario is this data is coming in over UDP to a forwarder, and something bad is happening in the chain of forwarders to your indexer. For example, if you have useACK enabled in your forwarder, and the communication to the indexer layer keeps failing, the forwarder will resend data to other indexers to ensure a complete record.

However, given the reliability of the problem that seems unlikely.

As next investigative steps I would suggest:

  1. Review the _indextime value of a set of duplicates, ie search that gets duplicates | eval visibleindextime=strftime(_indextime, "%Y-%m-%d %H:%M:%S")
  2. Run tcpdump, wireshark, or similar on the incoming udp stream to see if duplication is there.
0 Karma

kavraja
Path Finder

I ran the first search and got the following results:

Top 10 Values Count

2014-10-10 08:34:14 - 349

2014-10-10 08:34:58 - 220

2014-10-10 08:33:56 - 181

2014-10-10 08:35:46 - 174

and so on and the duplication does appear on the other platforms. I'll keep investigating and see what solution i come up with.
Thanks

0 Karma

evinasco
Communicator

could you resolve this issue???

thanks

0 Karma

Ayn
Legend

You need to provide more information about the original source, otherwise it's very hard to say what causes these events to be duplicates in your index. Splunk will not duplicate events, but it will happily index events if they occur twice.

0 Karma

kavraja
Path Finder

The original source for the logs is content keeper. So whenever a user logs into the vpn for example, the same logs shows up over 150 times with the same time stamp. So basically 150 logs every second from a single user is filling up disk space. An example of the logs coming through are:

10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default
10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default
10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default

The ip and port are all the same.
source = udp:516
sourcetype = syslog

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...