Getting Data In

splunk indexing the same files again and again and again and ...

nathanh42
Explorer

I have a Splunk universal forwarder on a client machine. I have a deployed app that looks like this..

  [monitor:///export/home/storeadm/r*]
  disabled = true
  followTail = 0
  index = contentkeeper
  source = contentkeeper_passed
  sourcetype = contentkeeper_passed
  whitelist = (/r.*\.csv.gz$|/r.*\.csv$)

On the indexer there is a corresponding props.conf entry

  [contentkeeper_passed]
  REPORT-ckpassed = ckpassed_extractions

And a corresponding transforms.conf entry

  [ckpassed_extractions]
  DELIMS=","
  FIELDS="Time","Category","IP-Address","Username","Bytes","Status","Content-Type","Url","Policy","Category-Description"

The data files are all compressed (.csv.gz) so the second whitelist match is superfluous. There are a few months of data sitting in that directory.

The volume of data is quite small (only 10s of MB per day). PS: sorry about the timestamps. I touched the files as a test, but usually the files have an incrementing daily timestamp.

-rw-r--r--   1 storeadm storeadm    1.3M Mar 15 15:41 r29-12-2011.csv.gz
-rw-r--r--   1 storeadm storeadm     38M Mar 15 15:41 r30-01-2012.csv.gz
-rw-r--r--   1 storeadm storeadm    2.5M Mar 15 15:41 r30-10-2011.csv.gz
-rw-r--r--   1 storeadm storeadm     44M Mar 15 15:41 r30-11-2011.csv.gz
-rw-r--r--   1 storeadm storeadm    781K Mar 15 15:41 r30-12-2011.csv.gz

However my license quota is often exceeded, typically more than 20GB (that's GIGABYTES!) per day. I don't think it's the months of data that's the problem. The entire directory is only 3.6GB.

 $ du -sh /export/home/storeadm/
 3.6G   /export/home/storeadm

I think the problem is Splunk is re-indexing the same files.

  $  grep "reading path" splunkd.log | awk '{print $8}' | sort | uniq -c
  ...
   4 path=/export/home/storeadm/r30-10-2011.csv.gz
   2 path=/export/home/storeadm/r30-11-2011.csv.gz
   6 path=/export/home/storeadm/r30-12-2011.csv.gz
   2 path=/export/home/storeadm/r31-10-2011.csv.gz
   2 path=/export/home/storeadm/r31-12-2011.csv.gz

These are the kinds of entries I'm grepping over.

  03-17-2012 06:42:20.258 +1100 INFO  ArchiveProcessor - handling file=/export/home/storeadm/r09-03-2012.csv.gz
  03-17-2012 06:42:20.295 +1100 INFO  ArchiveProcessor - reading path=/export/home/storeadm/r09-03-2012.csv.gz (seek=0 len=50548338)
  03-17-2012 07:28:09.552 +1100 INFO  ArchiveProcessor - Finished processing file '/export/home/storeadm/r09-03-2012.csv.gz', removing from stats

What should I do to check whether Splunk is re-indexing the same files, contributing to my license problem? Is there some search I can run over the metrics index?

0 Karma

fgilain
Engager

I finally had to use a splunk forwarder from my source server instead of remote mounting the share with logs...all works now.

0 Karma

mdurkin
New Member

FG,

I'm having the exact same problem as you are. I was wondering if you ever found a solution. I have a ticket open with Splunk, but they haven't been able to get a solution for me yet. Our situation is also similar in that the log files are not on local disk, mine are mounted by NFS.

Thanks,

-MD

0 Karma

sowings
Splunk Employee
Splunk Employee

You might want CRC salts. The most common usage of this is to include the file path as part of the CRC used for Splunk to answer the question "Has this data already been indexed?" See docs on inputs.conf here.

I wonder also whether these files are being rotated daily, and if so, are they immediately compressed? If Splunk sees that the base file it's reading doesn't have the same CRC as the last time it looked, it will attempt to read forward in the file to find a line which matches the last one it saw. If it doesn't, it believes that the file is completely new (these are the 'seekptr' messages). Furthermore, Splunk doesn't always do a good job of figuring out where it left off if the rotated version of the file has been compressed. It's best if you can leave the "most-recently rotated" version uncompressed, and then compress on the next rotation cycle.

Some more information can be found here.

0 Karma

fgilain
Engager

No compression here.
The Filezilla FTP server creates everyday a new log file named :

fzs-YYYY-MM-DD.log

so my monitored directory mounted on the splunk server contains files like :

fzs-2012-07-23.log
fzs-2012-07-22.log
fzs-2012-07-21.log
fzs-2012-07-20.log
...
..

My "/splunk/splunk/etc/apps/search/local/inputs.conf" file contains the following section :

[monitor:///mnt/s-ftpde-01/filezilla/*.log]
disabled = false
followTail = 0
sourcetype = LOGS-FILEZILLA-SERVEUR
index = index_de_filezilla
host = s-ftpde-01.de.lan
host_segment =
crcSalt =
whitelist = [^/]*.log$

NB : what could my "crcSalt" parameter be setup to in order to avoid re-indexing ?
NB2 : Should i use the "ignoreOlderThan" parameter too (with a 1 day value ?) ?

FG

0 Karma

fgilain
Engager

Same problem here...Filezilla FTP server logs mounted from a remote Windows system to the local splunk server keeps indexing the same files continiously...

Get the "WatchedFile - Checksum for seekptr didn't match, will re-read entire file" in my log too.

FG

0 Karma

cvajs
Contributor

check to make sure you dont have multiple sources set to the same path, etc.

maybe turn to strace or inotify or lsof ?

"watch lsof /export/home/storeadm"

your du says 3.6GB of zip, but you say its indexing 20GB. can you verify total data size with "gzip -l *" in that dir.

seems like others also having a reindex problem, see http://splunk-base.splunk.com/answers/43076/why-are-my-logfiles-re-indexing-due-to-a-failed-seekptr-...

cvajs
Contributor

do those gz's change at all over time?

nathanh42
Explorer

check to make sure you dont have multiple sources set to the same path, etc.

There's only a single source. Confirmed with the command "splunk list monitor".

your du says 3.6GB of zip, but you say its indexing 20GB. can you verify total data size with "gzip -l *" in that dir.

Uncompressed size is 37GB.

3915683331 37491057664 89.6% (totals)

However the Splunk logs show multiple "reading path" statements for the same files. If it was only 37GB I could live with that. The problem is it keeps going back to all the previous files it has already indexed and indexing them again!

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...