Getting Data In

Why is Splunk is indexing duplicate data with my current universal forwarder configuration?

shahzadarif
Path Finder

I'm setting up Splunk Infrastructure and one of the issues I'm facing is the duplicate data. I've tested this in my test environment which is running a Splunk Universal forwarder which forwards data to an Indexer.
The Splunk Universal forwarder is set to monitor just one directory. I've confirmed duplication by clearing all the indexes. Copied the log file to the monitored location, noted down the number of events which showed up in the indexer search and few minutes later, I copied the same file in the same location, now number of events are exactly x2 of the original number of events.
This is how my inputs.conf file looks like on UF.

[splunk@splunk_universalforwarder local]$ cat inputs.conf 
[monitor:///apps/webdata/splunkdata]
host_regex = /apps/webdata/splunkdata/(\w+)
disabled = false
index = main
sourcetype = stblogs

My understanding is Splunk by default is set to not index duplicate data. Why isn't it doing it in my case? What could I do to fix this? Thanks

shahzadarif
Path Finder

I've resolved the issue.
Our Python script was generating sub-directories under the monitored directory and all the files were getting saved under those sub-directories. Getting rid of those sub-directories has sorted the issue out.
Thanks for all your help everyone.

fdi01
Motivator

helo Mr shahzadarif to do it see this link:

http://answers.splunk.com/answers/210739/why-is-my-forwarder-sending-data-from-monitored-fi.html

or , while waiting for a better solution, let met tell you that you can also do it after indexing:
1- after identifying the duplicated event or file.
2-build a query that fetch what you want to remove and pipe it with delete.
3- you can scheduled that search to run periodically.


or
you can use dedup command to filter out event duplicates without adding every field in the dedup command
ex:

index-your_index_name sourdedup _raw  sourcetype = stblogs    ...|dedup _raw|
0 Karma

shahzadarif
Path Finder

I don't have macros.conf file in the location you've mentioned.
I'm afraid second solution won't work in my case. I'll might be getting hundred or duplicate files a day and while I ran Splunk for few days, every single day we had exceeded our Splunk licence limit of 100GB by some margin and it was due to all the duplicates Splunk had indexed.
I don't understand why Splunk's default behaviour of not indexing duplicate data isn't working in my case? In my test cases nothing has changed in the files which were indexed more than once, exact same data with exact same file name. Is there a good documentation on how Splunk figures out what has already been indexed?

0 Karma

shahzadarif
Path Finder

Thanks for providing the link.
I don't think it applies to my scenario. The logs files which we're processing in Splunk are parsed using a Python script on Splunk forwarder nodes. So we don't have Windows or Samba in our environment.
If it helps this is what my corresponding props.conf file looks like:

bash-4.1$ cat ./etc/apps/search/local/props.conf
[stblogs]
NO_BINARY_CHECK = true
category = Custom
disabled = false
pulldown_type = true
TIME_FORMAT = %Y:%m:%d %H:%M:%S
TIME_PREFIX = date=[
description = LineBreak-Timestamp
SHOULD_LINEMERGE = true
BREAK_ONLY_BEFORE = date=

0 Karma

fdi01
Motivator

If you could, modify $SPLUNK_HOME/etc/apps/splunk_deployment_monitor/default/macros.conf and change this:

[forwarder_metrics]
definition = index="_internal" source="metrics.lo" group=tcpin_connections | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server build version os arch guid

To this:
[forwarder_metrics]
definition = index="_internal" source="metrics.lo" group=tcpin_connections NOT eventType=* | eval sourceHost=if(isnull(hostname), sourceHost,hostname) | eval connectionType=case(fwdType=="uf","universal forwarder", fwdType=="lwf", "lightweight forwarder",fwdType=="full", "heavy forwarder", connectionType=="cooked" or connectionType=="cookedSSL","Splunk forwarder", connectionType=="raw" or connectionType=="rawSSL","legacy forwarder")| eval build=if(isnull(build),"n/a",build) | eval version=if(isnull(version),"pre 4.2",version) | eval guid=if(isnull(guid),sourceHost,guid) | eval os=if(isnull(os),"n/a",os)| eval arch=if(isnull(arch),"n/a",arch) | fields connectionType sourceIp sourceHost sourcePort destPort kb tcp_eps tcp_Kprocessed tcp_KBps splunk_server build version os arch guid

0 Karma

fdi01
Motivator

Hi, while waiting for a better solution, let met tell you that you can do it after indexing:
1- after identifying the duplicated event or file.
2-build a query that fetch what you want to remove and pipe it with delete.
3- you can scheduled that search to run periodically.


or
you can use dedup command to filter out event duplicates without adding every field in the dedup command
ex:

index-your_index_name sourdedup _raw sourcetype = stblogs ...|dedup _raw|

0 Karma

shahzadarif
Path Finder

Could someone suggest a possible fix for this? Its causing delay in taking Splunk to live/production environment.
Thanks

0 Karma

srinathd
Contributor
Add crcSalt=<SOURCE> in the inputs.conf configuration

shahzadarif
Path Finder

Just to confirm, I need to do this on Splunk forwarders input.conf right?

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...