Solved: Splunk reindexing files when using remote shared f...

abonuccelli_spl · ‎04-08-2014

Hi,

I've mounted some NFS and nfs locally to Splunk some files I want to monitor remotely.

Problem is files are continuously re-indexed and I am getting duplicated events.

In the logs I can see this is continuously re-indexed:

02-21-2014 14:16:17.914 +0200 INFO  WatchedFile - Will begin reading at offset=0 for file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 14:16:38.557 +0200 INFO  BatchReader - Removed from queue file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 14:23:52.077 +0200 INFO  WatchedFile - Checksum for seekptr didn't match, will re-read entire file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 14:23:52.078 +0200 INFO  WatchedFile - Will begin reading at offset=0 for file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 14:24:11.745 +0200 INFO  BatchReader - Removed from queue file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 15:01:17.065 +0200 INFO  WatchedFile - Checksum for seekptr didn't match, will re-read entire file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 15:01:17.065 +0200 INFO  WatchedFile - Will begin reading at offset=0 for file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 15:01:37.700 +0200 INFO  BatchReader - Removed from queue file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 15:07:51.908 +0200 INFO  WatchedFile - Checksum for seekptr didn't match, will re-read entire file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 15:07:51.908 +0200 INFO  WatchedFile - Will begin reading at offset=0 for file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 15:08:12.253 +0200 INFO  BatchReader - Removed from queue file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 15:13:41.283 +0200 INFO  WatchedFile - Checksum for seekptr didn't match, will re-read entire file='/mnt/nfs-mounted-share/app1/logs/main.log'.
02-21-2014 15:13:41.284 +0200 INFO  WatchedFile - Will begin reading at offset=0 for file='/mnt/nfs-mounted-share/app1/logs/main.log'.

By typing 'mount' I can see /mnt/nfs-mounted-share is mounted with options:

//192.168.123.234/logs on /mnt/nfs-mounted-share/app1/logs type nfs (rw,relatime,username=john,uid=0,noforceuid,gid=0,noforcegid,addr=192.168.123.234,file_mode=0755,dir_mode=0755,nounix,rsize=61440,wsize=65536,actimeo=1)

I went on the server where the file is hosted and verified the file has't rotated!
(4ff7ceef43b51e85487a8f6bf6a00062 == 4ff7ceef43b51e85487a8f6bf6a00062)

$ while true; do date; echo 'InitCrc:'; head -1 /opt/app1/logs/main.log|md5sum; echo 'SeekCrc:'; tail -1 /opt/app1/logs/main.log | md5sum; echo ''; sleep 60; done
Tue Apr  8 15:00:48 BST 2014
InitCrc:
4ff7ceef43b51e85487a8f6bf6a00062  -
SeekCrc:
90e3811d7d2a94d16d1b43a6b470f407  -

Tue Apr  8 15:01:48 BST 2014
InitCrc:
4ff7ceef43b51e85487a8f6bf6a00062  -
SeekCrc:
b3837abba7bbda9c912b6c16fef20f23  -

Tue Apr  8 15:02:48 BST 2014
InitCrc:
4ff7ceef43b51e85487a8f6bf6a00062  -
SeekCrc:
dcffef3590881f5fbad551869d2d140d  -

Tue Apr  8 15:03:48 BST 2014
InitCrc:
4ff7ceef43b51e85487a8f6bf6a00062  -
SeekCrc:
87d93c46410f4dbf900e58ae28370983  -

Tue Apr  8 15:04:48 BST 2014
InitCrc:
4ff7ceef43b51e85487a8f6bf6a00062  -
SeekCrc:
6e90a42e914f3a07b967de219f26864b  -

Tue Apr  8 15:05:48 BST 2014
InitCrc:
4ff7ceef43b51e85487a8f6bf6a00062  -
SeekCrc:
6e90a42e914f3a07b967de219f26864b  -

Same thing is happening on another setup with CIFS mount!
What is happening?

abonuccelli_spl · ‎04-08-2014

Most likely filesystem attributes are caching file attributes which will tamper Splunk ability to detect actual CRCs values for initptr/seekptr/modtime.
In a healthy scenario, SeekPtr(scrc) matches what Splunk has saved in fishbucket:

"03-24-2014 17:08:05.226 +0000 DEBUG WatchedFile - seeking /mnt/nfs-mounted-share/app1/logs/main.log to off=37624"
"03-24-2014 17:08:05.227 +0000 DEBUG WatchedFile - Reached EOF: fname=/mnt/nfs-mounted-share/app1/logs/main.log fishstate=key=0xa8ad07cc1926ac37 **sptr**=37722 **scrc**=0x9a0b880ed85b3357 **fnamecrc**=0x794111686abf86a7 **modtime**=1395680878"
"03-24-2014 17:08:08.221 +0000 DEBUG TailingProcessor - Deferred notification for path='/mnt/nfs-mounted-share/app1/logs/main.log'."
"03-24-2014 17:08:08.221 +0000 DEBUG TailingProcessor -   Will attempt to read file: /mnt/nfs-mounted-share/app1/logs/main.log from existing fd."
"03-24-2014 17:08:08.226 +0000 DEBUG TailingProcessor - About to read data (Reusing existing fd for file='/mnt/nfs-mounted-share/app1/logs/main.log')."

In a fs-cached scenario, what happens is:

Application writes file until offset 100
Splunk reads the file until offset 100 – calculates and save X=MD5(seekptr(100))
Application writes another 50 bytes – last offset becomes 150
Splunk reads the file and NFS/CIFS tells that last offset is still 100 (cached information), no change
Splunk reads the file again, NFS has refreshed attributes (either server or client sied) and now tells Splunk last offset is 150.
Splunk compares X which is different from MD5(seekptr(150)),says:

"INFO WatchedFile - Checksum for seekptr didn't match, will re-read entire file"

and re-index the whole file.
Options specified in /etc/fstab are often ignored or not honoured depending on what version of various related components are in use.

The correct way to check what options are being used is to run mount on client side

$ mount | egrep logs
nfsserver:/logs on /mnt/nfs-mounted-share/logs type nfs (rw,noexec,nosuid,nodev,noac,vers=4,addr=192.168.32.23,clientaddr=0.0.0.0)

when using NFS,
Options like actimeo=1 should not be used clientside:

   acregmin=n     The minimum time (in seconds) that the NFS client caches
                  attributes  of  a  regular file before it requests fresh
                  attribute information from a server.  If this option  is
                  not specified, the NFS client uses a 3-second minimum.

   acregmax=n     The maximum time (in seconds) that the NFS client caches
                  attributes of a regular file before  it  requests  fresh
                  attribute  information from a server.  If this option is
                  not specified, the NFS client uses a 60-second  maximum.

   acdirmin=n     The minimum time (in seconds) that the NFS client caches
                  attributes of  a  directory  before  it  requests  fresh
                  attribute  information from a server.  If this option is
                  not specified, the NFS client uses a 30-second  minimum.

   acdirmax=n     The maximum time (in seconds) that the NFS client caches
                  attributes of  a  directory  before  it  requests  fresh
                  attribute  information from a server.  If this option is
                  not specified, the NFS client uses a 60-second  maximum.

   actimeo=n      Using  actimeo sets all of acregmin, acregmax, acdirmin,
                  and acdirmax to the same value.  If this option  is  not
                  specified,  the NFS client uses the defaults for each of
                  these options listed above.

http://manpages.ubuntu.com/manpages/hardy/man5/nfs.5.html

To disable caching nfs clientside 'noac' option instead should be enforced.

View solution in original post

abonuccelli_spl · ‎04-08-2014

Most likely filesystem attributes are caching file attributes which will tamper Splunk ability to detect actual CRCs values for initptr/seekptr/modtime.
In a healthy scenario, SeekPtr(scrc) matches what Splunk has saved in fishbucket:

"03-24-2014 17:08:05.226 +0000 DEBUG WatchedFile - seeking /mnt/nfs-mounted-share/app1/logs/main.log to off=37624"
"03-24-2014 17:08:05.227 +0000 DEBUG WatchedFile - Reached EOF: fname=/mnt/nfs-mounted-share/app1/logs/main.log fishstate=key=0xa8ad07cc1926ac37 **sptr**=37722 **scrc**=0x9a0b880ed85b3357 **fnamecrc**=0x794111686abf86a7 **modtime**=1395680878"
"03-24-2014 17:08:08.221 +0000 DEBUG TailingProcessor - Deferred notification for path='/mnt/nfs-mounted-share/app1/logs/main.log'."
"03-24-2014 17:08:08.221 +0000 DEBUG TailingProcessor -   Will attempt to read file: /mnt/nfs-mounted-share/app1/logs/main.log from existing fd."
"03-24-2014 17:08:08.226 +0000 DEBUG TailingProcessor - About to read data (Reusing existing fd for file='/mnt/nfs-mounted-share/app1/logs/main.log')."

In a fs-cached scenario, what happens is:

Application writes file until offset 100
Splunk reads the file until offset 100 – calculates and save X=MD5(seekptr(100))
Application writes another 50 bytes – last offset becomes 150
Splunk reads the file and NFS/CIFS tells that last offset is still 100 (cached information), no change
Splunk reads the file again, NFS has refreshed attributes (either server or client sied) and now tells Splunk last offset is 150.
Splunk compares X which is different from MD5(seekptr(150)),says:

"INFO WatchedFile - Checksum for seekptr didn't match, will re-read entire file"

and re-index the whole file.
Options specified in /etc/fstab are often ignored or not honoured depending on what version of various related components are in use.

The correct way to check what options are being used is to run mount on client side

$ mount | egrep logs
nfsserver:/logs on /mnt/nfs-mounted-share/logs type nfs (rw,noexec,nosuid,nodev,noac,vers=4,addr=192.168.32.23,clientaddr=0.0.0.0)

when using NFS,
Options like actimeo=1 should not be used clientside:

   acregmin=n     The minimum time (in seconds) that the NFS client caches
                  attributes  of  a  regular file before it requests fresh
                  attribute information from a server.  If this option  is
                  not specified, the NFS client uses a 3-second minimum.

   acregmax=n     The maximum time (in seconds) that the NFS client caches
                  attributes of a regular file before  it  requests  fresh
                  attribute  information from a server.  If this option is
                  not specified, the NFS client uses a 60-second  maximum.

   acdirmin=n     The minimum time (in seconds) that the NFS client caches
                  attributes of  a  directory  before  it  requests  fresh
                  attribute  information from a server.  If this option is
                  not specified, the NFS client uses a 30-second  minimum.

   acdirmax=n     The maximum time (in seconds) that the NFS client caches
                  attributes of  a  directory  before  it  requests  fresh
                  attribute  information from a server.  If this option is
                  not specified, the NFS client uses a 60-second  maximum.

   actimeo=n      Using  actimeo sets all of acregmin, acregmax, acdirmin,
                  and acdirmax to the same value.  If this option  is  not
                  specified,  the NFS client uses the defaults for each of
                  these options listed above.

http://manpages.ubuntu.com/manpages/hardy/man5/nfs.5.html

To disable caching nfs clientside 'noac' option instead should be enforced.

nmohammed · ‎04-07-2016

I am running in the same issue. But we have the logs written on the CIFS share which are mounted onto Linux servers. I am running Splunk UF on these servers, that are monitoring the logs from this mount. we are seeing a great delay in the events being searchable (~3 hours delay until the data is searchable)

Not sure where the delay is happening, either on Splunk UF monitoring the files or Indexing by indexers. can you please advise what can be done in this scenario?

CIFS ---> Mount on Linux ---> Splunk UF ---> indexers ---> Data Searchable in SH

Splunk reindexing files when using remote shared filesystem

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!