I've carried out two searches to find out splunk is indexing duplicate search results which are from the same host, source and sourcetype.
I know I can use dedup but that only deletes it from the search results which still uses up disk space. I want to know if it's possible to stop splunk from indexing duplicates?
The two searches i ran to confirm that there were duplicates were:
- mysearch | stats count values(host) values(source) values(sourcetype) values(index) by _raw | WHERE count>1
- mysearch | convert ctime(_indextime) AS indextime | table _time indextime _raw
Any help in the right direction would be greatly appreciated. Thanks
If the data is coming in via UDP then the most likely scenario is duplicate data is coming in via UDP.
The other possible scenario is this data is coming in over UDP to a forwarder, and something bad is happening in the chain of forwarders to your indexer. For example, if you have useACK enabled in your forwarder, and the communication to the indexer layer keeps failing, the forwarder will resend data to other indexers to ensure a complete record.
However, given the reliability of the problem that seems unlikely.
As next investigative steps I would suggest:
search that gets duplicates | eval visibleindextime=strftime(_indextime, "%Y-%m-%d %H:%M:%S")
I ran the first search and got the following results:
Top 10 Values Count
2014-10-10 08:34:14 - 349
2014-10-10 08:34:58 - 220
2014-10-10 08:33:56 - 181
2014-10-10 08:35:46 - 174
and so on and the duplication does appear on the other platforms. I'll keep investigating and see what solution i come up with.
Thanks
could you resolve this issue???
thanks
You need to provide more information about the original source, otherwise it's very hard to say what causes these events to be duplicates in your index. Splunk will not duplicate events, but it will happily index events if they occur twice.
The original source for the logs is content keeper. So whenever a user logs into the vpn for example, the same logs shows up over 150 times with the same time stamp. So basically 150 logs every second from a single user is filling up disk space. An example of the logs coming through are:
10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default
10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default
10/9/14 8:36:17.000 AM Oct 9 08:36:17 xxx.xxx.xx.xx.143 09-10-2014; 08:36:15, 26, xxx.xxx.xx.xx, user, 54, 1, text/html, http//somevpn.net default
The ip and port are all the same.
source = udp:516
sourcetype = syslog