Getting Data In

Why is my UDP-input data appearing duplicated in the index?

kavraja
Path Finder

I've carried out two searches to find out splunk is indexing duplicate search results which are from the same host, source and sourcetype.

I know I can use dedup but that only deletes it from the search results which still uses up disk space. I want to know if it's possible to stop splunk from indexing duplicates?

The two searches i ran to confirm that there were duplicates were:

 - mysearch | stats count values(host) values(source) values(sourcetype) values(index) by _raw | WHERE count>1
 - mysearch | convert ctime(_indextime) AS indextime | table _time indextime _raw

Any help in the right direction would be greatly appreciated. Thanks

Tags (2)
0 Karma

jrodman
Splunk Employee
Splunk Employee

If the data is coming in via UDP then the most likely scenario is duplicate data is coming in via UDP.

The other possible scenario is this data is coming in over UDP to a forwarder, and something bad is happening in the chain of forwarders to your indexer. For example, if you have useACK enabled in your forwarder, and the communication to the indexer layer keeps failing, the forwarder will resend data to other indexers to ensure a complete record.

However, given the reliability of the problem that seems unlikely.

As next investigative steps I would suggest:

  1. Review the _indextime value of a set of duplicates, ie search that gets duplicates | eval visibleindextime=strftime(_indextime, "%Y-%m-%d %H:%M:%S")
  2. Run tcpdump, wireshark, or similar on the incoming udp stream to see if duplication is there.
0 Karma

kavraja
Path Finder

I ran the first search and got the following results:

Top 10 Values Count

2014-10-10 08:34:14 - 349

2014-10-10 08:34:58 - 220

2014-10-10 08:33:56 - 181

2014-10-10 08:35:46 - 174

and so on and the duplication does appear on the other platforms. I'll keep investigating and see what solution i come up with.
Thanks

0 Karma

evinasco
Communicator

could you resolve this issue???

thanks

0 Karma

Ayn
Legend

You need to provide more information about the original source, otherwise it's very hard to say what causes these events to be duplicates in your index. Splunk will not duplicate events, but it will happily index events if they occur twice.

0 Karma

kavraja
Path Finder

The original source for the logs is content keeper. So whenever a user logs into the vpn for example, the same logs shows up over 150 times with the same time stamp. So basically 150 logs every second from a single user is filling up disk space. An example of the logs coming through are:

10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default
10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default
10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default

The ip and port are all the same.
source = udp:516
sourcetype = syslog

0 Karma
Get Updates on the Splunk Community!

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...