I'm indexing a bunch of CSV files provided by an external vendor over ftp ( mapped or synched to my local drive ) there may be duplicate rows across different files. the requirement is to take the row from the file with the latest timestamp. I can achieve this by either:
a) ensuring that the order in which splunk indexes my data is in the same order of the file timstamps. can someone suggest how I can do this without having to rewrite in a script the entire 'scan directory for updated files' logic that splunk nicely provides?
b) Can I add an extra field 'fileTimeStamp'? how would I specify this into my props.conf?
c) lookup the file timestamps as a 'lookup' at search time. but if a file is newly updated at search time, but it has not been indexed yet, I may see misleading results.
suggestions please?
No you cannot selectively ask splunk to monitor a part of a file, or the order of them.
A) the simple solution is a dedup in the events.
source=mypath/to/my/folder/* | dedup _raw
see http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/dedup
B ) No. the mod time of the file is not indexed. The closest you have is the _indextime (when the events is received at the indexer)
A solution is to index all and to use the timestamp of the events:
source=mypath/to/my/folder/* | stats latest(_raw) AS _raw by source
or the indextime
source=mypath/to/my/folder/* | eval oldtime=_time | eval _time=_indextime | stats latest(oldtime) AS oldtime latest(_raw) AS _raw by source
C) use the _indextime for the same purpose.