Splunk Search

input monitor scanning too much files and causes Splunk indexing troubles

guilmxm
SplunkTrust
SplunkTrust

Hi,

I have to monitor specific files over a NFS share containing itself thousands of files, this causes troubles to Splunk which seems to be scanning all files in the NFS share and stop indexing other inputs.

Files that i want to monitor can be accessed by:

/mnt/MYMOUNT/logs/*/*/exploit/prod/RQ_TB_*.res

where first wilcard represents the day like YYYY_MM_DD and the second hostnames XXXXXX (alphanumeric)

Where files to monitor can be named like RQ_TB_XXXXX.res (XXXXX for the server hostname)

If a i set the monitor like this, Splunk start to scan the share and reports thousands and thousands of files, which causes Splunk to stop indexing other monitors... (some file descriptors limits i guess ?)

I've tried to set some whitelist regex like:

whitelist = RQ[\_]TB[_][a-zA-Z0-9]\.(res)$

OR:

whitelist = \.(res)$

But still Splunk reports thousands of files within the manager where i should have around 700 files

How can i prevent Splunk from scanning all the files as only a few should match ?

This seems to also generate a useless system load...

Thank your very much for any help !

0 Karma
1 Solution

guilmxm
SplunkTrust
SplunkTrust

For those who be interested in such a case, i could not find a correct pure Splunk answer to this case.

I have ended by creating an rsync mirror workflow that would rsync files that i wanted to monitor from my NFS share, and then create required splunk inputs.

Works perfectly.

View solution in original post

0 Karma

guilmxm
SplunkTrust
SplunkTrust

For those who be interested in such a case, i could not find a correct pure Splunk answer to this case.

I have ended by creating an rsync mirror workflow that would rsync files that i wanted to monitor from my NFS share, and then create required splunk inputs.

Works perfectly.

0 Karma

lguinn2
Legend

First - do not ask a Splunk indexer to monitor this many files. Even if you can't install a Universal Forwarder on the remote system. The indexer is already doing the actual indexing of data and responding to searches.

Second - if you can, use a separate system to collect and forward the data. This separate system could even be a virtual machine. Its only job will be to scan the remote filesystem and forward the correct files to the indexer(s). This system will need very good network access and a fair amount of CPU and memory - but it won't need great disk I/O. On this machine, install and configure the Universal Forwarder.

Third - If you can't use a separate system to collect and forward the data, you could still run a separate Universal Forwarder (on the same machine as the indexer) and use it to collect the data. I am not sure how much this will help, but it should improve things somewhat, especially if you follow suggestion the "Fourth" below.

Fourth - Run a regular script to remove older files. Otherwise, Splunk will continue to monitor files that will never be written to again. (Because, how does Splunk know they won't be updated?) This is a complete waste of resources. If you cannot do this, you can also add this setting to your inputs.conf

ignoreOlderThan = 14d

Note that once a file becomes "ignored", it will never be examined again, even if it is subsequently updated! So be sure to pick a reasonable date. This setting alone might solve your problem.

Fifth - If needed, you can run multiple Universal Forwarders, and have each forwarder monitor a section of the directory structure. So the first Universal Forwarder could have inputs for

[monitor:///mnt/MYMOUNT/logs/*/A*/exploit/prod/RQ_TB_*.res]   
[monitor:///mnt/MYMOUNT/logs/*/B*/exploit/prod/RQ_TB_*.res]
etc.

Anything that you do to keep Splunk from traversing unnecessary files and directories, will help.

guilmxm
SplunkTrust
SplunkTrust

Hi, Thanks for your answer and suggestions

As previously mentioned we're now in dev env but in Production we will dedicate Splunk instances to do this kind of jobs

Because we need to retrieve data from various DB, we planned to build an instance as an heavy forwarder.

I think your suggestion to ignore files older than may help, i will test this and revert. (after a first full run when files have been indexed, this is useless anyway to monitor every files)

Note that files in this NFS share are purged periodically, but it concentrates many logs of numerous systems

0 Karma

mic1024
Path Finder

I had similar issue (essentially there are 100s of thousands of files on the way [within dir structures]to the actual files i was interested in indexing [which were at the very bottom of the structure]. I've used uF, but it just couldn't handle it. too many files to 'scan'.

I finally given up and just change the dir structure so the files I was interested were being saved to other location.

guilmxm
SplunkTrust
SplunkTrust

Thanks for your experience feedback.
In my case, i have to adapt my own configuration to the customer, changing files structured won't be possible...

0 Karma

MuS
SplunkTrust
SplunkTrust

Hi guilmxm,

no chance using a universal forwarder here?

When you specify wildcards in a file input path, Splunk creates an implicit whitelist for that stanza. The longest fully qualified path becomes the monitor stanza, and the wildcards are translated into regular expressions.
This means your whitelist is being clobberd by your use of * expressions in the monitor stanza.
Try using ... instead and see the docs for more detailed information.

cheers, MuS

guilmxm
SplunkTrust
SplunkTrust

I've installed and configured a new instance running as an Heavy Forwarder where i will concentrate all inputs.

Off course i have the same behavior will many files being scanned, i've changed some script inputs which were generating files (that splunk was monitoring) to stdout streaming, some kind of workaround

0 Karma

guilmxm
SplunkTrust
SplunkTrust

Hi MuS,

Thank you for answering, changing from * to ... did not changed the number of files being scanned by Splunk (and the time for it to find all required files)

This is a dev environment running a few single Splunk instances, using a UF no because i'm also indexing data retrieved from various DB using DB connect, but i can concentrate inputs within an Heavy Forwarder

I thought part of the answer could have been in report with withelist / blacklist...

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...