Index files in order of timestamp or record file t...

leecaf · ‎04-26-2013

I'm indexing a bunch of CSV files provided by an external vendor over ftp ( mapped or synched to my local drive ) there may be duplicate rows across different files. the requirement is to take the row from the file with the latest timestamp. I can achieve this by either:

a) ensuring that the order in which splunk indexes my data is in the same order of the file timstamps. can someone suggest how I can do this without having to rewrite in a script the entire 'scan directory for updated files' logic that splunk nicely provides?

b) Can I add an extra field 'fileTimeStamp'? how would I specify this into my props.conf?

c) lookup the file timestamps as a 'lookup' at search time. but if a file is newly updated at search time, but it has not been indexed yet, I may see misleading results.

suggestions please?

mataharry · ‎06-05-2013

No you cannot selectively ask splunk to monitor a part of a file, or the order of them.

A) the simple solution is a dedup in the events.
source=mypath/to/my/folder/* | dedup _raw

see http://docs.splunk.com/Documentation/Splunk/5.0.3/SearchReference/dedup

B ) No. the mod time of the file is not indexed. The closest you have is the _indextime (when the events is received at the indexer)

A solution is to index all and to use the timestamp of the events:

source=mypath/to/my/folder/* | stats latest(_raw) AS _raw by source

or the indextime

source=mypath/to/my/folder/* | eval oldtime=_time | eval _time=_indextime | stats latest(oldtime) AS oldtime latest(_raw) AS _raw by source

C) use the _indextime for the same purpose.

Index files in order of timestamp or record file timestamp as a field

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

New in Observability Cloud - Explicit Bucket Histograms