We have a process to identify, capture, and write high priority/urgent events to a csv file that gets overwritten every time the process executes. The contents may not change for days. However, splunk is indexing the whole file every time the process runs--even if the contents of the file haven't changed.
The program that creates the csv file calls an external vendor SOAP web service. I could add a whole bunch of logic to the program to persist a timestamp and use it as a filter for new service responses; But we prefer to index data as received/logged. I'm not sure what CRC values splunk is using to determine how to read the file. The next step is to see if those values are available in a debug/log file.
Has anybody ran into this and know of a solution? ie I don't rule out user error. I'm fairly new to Splunk.
Any help would be appreciated.
Problem solved. The issue was a bug in the the program: it was opening and closing the file multiple times during the process. I'm guessing this is why the CRC wasn't matching. I refactored the code and all is well.
Problem solved. The issue was a bug in the the program: it was opening and closing the file multiple times during the process. I'm guessing this is why the CRC wasn't matching. I refactored the code and all is well.
By default Splunk will build a CRC from the first 256 bytes of a file, regardless of if it is a CSV or other.
Is it possible that even though the contents aren't changing some other heading information is changing?
If you have a look at - http://docs.splunk.com/Documentation/Splunk/latest/admin/inputsconf You can edit the amount of bytes it builds the CRC from by changing initCrcLength.
Otherwise you may be better to actually store the CSV in a lookups folder and search it as a lookup?
This way you could build a dashboard that uses | inputlookup to pull in the CSV and then do a search against that for certain criteria.
I'm also fairly new to splunk and have been searching for an answer to this problem for the last two days. I see several similar questions with no clear answer. I am also using csv files that are frequently overwritten with same or new data and each time splunk re-indexes the data creating duplicates in splunk. I'm getting the following in the splunkd.log:
WatchedFile - Checksum for seekptr didn't match, will re-read entire file='D:\fd\myfile.csv'.
Not sure if this is related. Any help is appreciated!