Getting Data In

Universal Forwarders and indexing

asarolkar
Builder

From a Windows box where the Universal Forwarder is installed, we're picking up a CSV extract (table.csv) every 24 hours.

Each CSV has ONE UNIQUE row entry which contains an ID

This is the forwarder configuration:

[monitor://D:\Splunk\Extract\table.csv]

disabled=0

followTail=0

index=alpha

sourcetype=alpha_sourcetype

Problem Statement

We're getting duplicates IF the CSV extract is posited twice ( WE REPLACE the OLD extract with the NEW - but Splunk indexes both the OLD and the NEW because Splunk is always listening - and indexing)

Splunk therefore logs two entries for the same record (READ SAME ID) with BOTH timestamps - ONE FOR EACH TIME the CSV was replaced

5/23/12 "2012-05-23 10:20:05.100000",
5/23/12 "2012-05-23 10:19:05.100000",

We only need the latest of these two.

Is there any way to configure the forwarder so that everytime a new version of extract is posted - that it only ever indexes the LATEST copy - in a 24 hour time period ?

If a forwarder cannot be configured in that way - how would we modify the following query to only pick the latest entry

index=alpha ID=* | sort ID

0 Karma

asarolkar
Builder

this ID is identical - there's just ONE RECORD ever

The CSV gets replaced - which is why splunk indexes it twice - splunk indexes the same ID twice

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Seems to me the only possible way to do this is to wait until the end of the 24 hour period to see if a new version shows up, since I assume it's impossible to know ahead of time if it is coming. This more or less defeats the point of having a forwarder monitor a file. If you're going to do that, write you own script and move the file into the batch directory once you've determined that it's safe to index it.

I guess you can also just use:

| dedup ID

since that just returns the most recent entry for each ID, but you really haven't described your data enough to know if that actually would work.

asarolkar
Builder

Thanks for your answer !

We were hoping to not use dedup - but rather coerce splunk into giving us only the latest (or the appropriate term being LAST) set of timestamps for each record (read each ID) and ignore the FIRST or EARLIER timestamps it indexed

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

can you clarify, this ID is identical between two versions of the extract, or does it change? also, i assume you're getting duplicates of the entire file, not just one record, is that correct?

0 Karma