We are currently running an evaluation of Splunk. Our current environment exists out of one indexer and 105 Windows servers that have Splunk Forwarder installed. All Forwarders use the same configuration:
All servers (that have a Forwarder installed) run Windows Server 2008R2. On most systems, the Forwarder uses little resources, while on others, there are constant spikes of 100% CPU.
It seems that only small servers are impacted by this. By small I mean one (virtual) CPU and not a lot of system activity. I have used Procmon to analyse what's going on and to compare the splunkd.exe process on a busy system (where it runs fine) and on a small system (where it uses lot's of CPU).
During the CPU spikes there are a lot of QueryDirectory actions seen on the systems that have these issue. The directory is the one that's in the monitor stanza. The action happens +- 150000 on troubled systems compared to +-300 for systems that run fine (roughly same monitor period.)
The configuration is the same, the forwarders were all installed the same way, using the command line. What could cause the Forwarder to query that directory so much and cause so much CPU?
You could try logging a support case and capturing a procdump as per:
https://answers.splunk.com/answers/5400/high-cpu-usage-on-splunk-forwarder.html
Also worth checking:
-Does your input stanza use a wildcard like * or ... (you said this was non-recursive)?
-Have you got AV on these systems (if so, what exclusions are in place)?
-What commonalities exist between the spiking systems versus the behaving ones?
The input stanza that seems to be causing the issue is the following one (I have "anonymized" the settings):
[monitor://D:\path\software name\logs]
disabled = false
index = index_name
sourcetype = sourcetype_name
whitelist = ^.*regex.*expression.*\.log$
recursive = false
So the monitor path itself does not contain ... or * but the whitelist option does.
I'm currently still waiting on the ant-virus team to whitelist all Splunk processes, but this does not seem to cause issues on the other systems.
The spiking systems and behaving ones all had the forwarder installed on the same day, in the same way. They also all use the same server class / deployed apps. The only difference is that the spiking systems have one CPU and have a lot less activity / generate a lot less logs.
I'll generate a procdump during a spike.