Getting Data In

CSV Monitoring issues

pkeller
Contributor

[monitor:///home/paul/training_status/]
whitelist = (.csv$|.CSV$)
blacklist = .filepart$
index=training_index
sourcetype=training_status
crcSalt = &ltSOURCE&gt

The file gets updated once per week. In many cases, the file is not being fully consumed. The most recent update missed 19 records (which were consumed the last time the file was updated )

Splunkd.log shows:

04-06-2017 07:39:19.584 -0700 INFO WatchedFile - Will begin reading at offset=4234 for file='/home/paul/training_status/filename.csv

So, my uneducated guess would be that splunkd is seeing data that it's already consumed and thus ignoring those 19 records before it starts ingesting.

How do I prevent this? I thought setting crcSalt=&ltSOURCE&gt was supposed to handle this.

Thank you.

Tags (2)
0 Karma
1 Solution

DalJeanis
Legend

crcSalt=<SOURCE> instructs splunk to use the entire filepath and name, in addition to the first 256 bytes, to determine if it has already indexed a file. If you are not changing the filename, then splunk will start indexing wherever it left off (or wherever the data is changed).

If you want the same records to be consumed again each time the file is updated, then the easy ways are (A) put a timestamp on the file name, (B) add an update timestamp column to each row of the csv, or (C) add a timestamp to the header in the file.

Alternately, assuming the "source of record" for the file is someplace safe, you could have splunk delete the file when it is finished indexing, so that any file found will be "new".

View solution in original post

DalJeanis
Legend

crcSalt=<SOURCE> instructs splunk to use the entire filepath and name, in addition to the first 256 bytes, to determine if it has already indexed a file. If you are not changing the filename, then splunk will start indexing wherever it left off (or wherever the data is changed).

If you want the same records to be consumed again each time the file is updated, then the easy ways are (A) put a timestamp on the file name, (B) add an update timestamp column to each row of the csv, or (C) add a timestamp to the header in the file.

Alternately, assuming the "source of record" for the file is someplace safe, you could have splunk delete the file when it is finished indexing, so that any file found will be "new".

pkeller
Contributor

Thank you. This makes things very clear. - Cheers

Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...