From a Windows box where the Universal Forwarder is installed, we're picking up a CSV extract (table.csv) every 24 hours.
Each CSV has ONE UNIQUE row entry which contains an ID
This is the forwarder configuration:
[monitor://D:\Splunk\Extract\table.csv]
disabled=0
followTail=0
index=alpha
sourcetype=alpha_sourcetype
Problem Statement
We're getting duplicates IF the CSV extract is posited twice ( WE REPLACE the OLD extract with the NEW - but Splunk indexes both the OLD and the NEW because Splunk is always listening - and indexing)
Splunk therefore logs two entries for the same record (READ SAME ID) with BOTH timestamps - ONE FOR EACH TIME the CSV was replaced
5/23/12 "2012-05-23 10:20:05.100000",
5/23/12 "2012-05-23 10:19:05.100000",
We only need the latest of these two.
Is there any way to configure the forwarder so that everytime a new version of extract is posted - that it only ever indexes the LATEST copy - in a 24 hour time period ?
If a forwarder cannot be configured in that way - how would we modify the following query to only pick the latest entry
index=alpha ID=* | sort ID
this ID is identical - there's just ONE RECORD ever
The CSV gets replaced - which is why splunk indexes it twice - splunk indexes the same ID twice
Seems to me the only possible way to do this is to wait until the end of the 24 hour period to see if a new version shows up, since I assume it's impossible to know ahead of time if it is coming. This more or less defeats the point of having a forwarder monitor a file. If you're going to do that, write you own script and move the file into the batch directory once you've determined that it's safe to index it.
I guess you can also just use:
| dedup ID
since that just returns the most recent entry for each ID, but you really haven't described your data enough to know if that actually would work.
Thanks for your answer !
We were hoping to not use dedup - but rather coerce splunk into giving us only the latest (or the appropriate term being LAST) set of timestamps for each record (read each ID) and ignore the FIRST or EARLIER timestamps it indexed
can you clarify, this ID is identical between two versions of the extract, or does it change? also, i assume you're getting duplicates of the entire file, not just one record, is that correct?