Has anyone ever run into a situation where the forwarder opened hundreds of sockets on a system?
Here is what we have configured on the system's output.conf:
[tcpout]
maxQueueSize = auto
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_internal|_introspection)
forwardedindex.filter.disable = false
indexAndForward = false
autoLBFrequency = 30
blockOnCloning = true
compressed = false
disabled = false
dropClonedEventsOnQueueFull = 5
dropEventsOnQueueFull = -1
heartbeatFrequency = 30
maxFailuresPerInterval = 2
secsInFailureInterval = 1
maxConnectionsPerIndexer = 2
forceTimebasedAutoLB = false
sendCookedData = true
connectionTimeout = 20
readTimeout = 300
writeTimeout = 300
tcpSendBufSz = 0
useACK = false
blockWarnThreshold = 100
sslQuietShutdown = false
[syslog]
type = udp
priority = <13>
dropEventsOnQueueFull = -1
maxEventSize = 1024
Admittedly, the admin before me who set the forwarder up had the forwarder pulling logs from a directory. The directory has over 2000 log files that are updated regularly. To elaborate, they are all flatfiles that contain metric data from an application performance monitoring tool.
I know that this is poor practice and I am working to correct this moving forwarder, but I wanted to use this as a learning opportunity. Would the consumption of a directory containing thousands of log files cause a forwarder to do something like that?
They were all stuck in a CLOSE_WAIT state as well
tcp 1294 0 myappserver:tproxy indexerServerOne:25022 CLOSE_WAIT
tcp 1218 0 myappserver:tproxy indexerServerOne:25023 CLOSE_WAIT
tcp 1246 0 myappserver:tproxy indexerServerOne:25020 CLOSE_WAIT
tcp 1269 0 myappserver:tproxy indexerServerOne:25021 CLOSE_WAIT
tcp 1218 0 myappserver:tproxy indexerServerOne:icl-twobase9 CLOSE_WAIT
tcp 1207 0 myappserver:tproxy indexerServerOne:icl-twobase3 CLOSE_WAIT
tcp 1226 0 myappserver:tproxy indexerServerOne:icl-twobase4 CLOSE_WAIT
tcp 1269 0 myappserver:tproxy indexerServerOne:icl-twobase7 CLOSE_WAIT
tcp 1294 0 myappserver:tproxy indexerServerOne:icl-twobase8 CLOSE_WAIT
tcp 1194 0 myappserver:tproxy indexerServerOne:icl-twobase5 CLOSE_WAIT
tcp 1246 0 myappserver:tproxy indexerServerOne:icl-twobase6 CLOSE_WAIT
tcp 1218 0 myappserver:tproxy indexerServerOne:24992 CLOSE_WAIT
tcp 1207 0 myappserver:tproxy indexerServerOne:24986 CLOSE_WAIT
tcp 1226 0 myappserver:tproxy indexerServerOne:24987 CLOSE_WAIT
tcp 1269 0 myappserver:tproxy indexerServerOne:24990 CLOSE_WAIT
tcp 1294 0 myappserver:tproxy indexerServerOne:24991 CLOSE_WAIT
tcp 1194 0 myappserver:tproxy indexerServerOne:24988 CLOSE_WAIT
tcp 1246 0 myappserver:tproxy indexerServerOne:24989 CLOSE_WAIT
tcp 1294 0 myappserver:tproxy indexerServerOne:24978 CLOSE_WAIT
tcp 1218 0 myappserver:tproxy indexerServerOne:24979 CLOSE_WAIT
tcp 1246 0 myappserver:tproxy indexerServerOne:24976 CLOSE_WAIT
tcp 1269 0 myappserver:tproxy indexerServerOne:24977 CLOSE_WAIT
tcp 1226 0 myappserver:tproxy indexerServerOne:24974 CLOSE_WAIT
tcp 1194 0 myappserver:tproxy indexerServerOne:24975 CLOSE_WAIT
tcp 1207 0 myappserver:tproxy indexerServerOne:24973 CLOSE_WAIT
tcp 1269 0 myappserver:tproxy indexerServerOne:24962 CLOSE_WAIT
tcp 1294 0 myappserver:tproxy indexerServerOne:24960 CLOSE_WAIT
tcp 1218 0 myappserver:tproxy indexerServerOne:24961 CLOSE_WAIT
One thing we are trying these days is to reduce the time_before_close parameter.
-- * Tells the system not to close files that have been updated in past seconds. Defaults to 3.
We lowered it to 1.
In your case, such a change can significantly reduce the number of open files.
The following explains it - inputs.conf
@lguinn said in Single forwarder to multiple indexers load balancing
-- One of the things that can bog down a forwarder is monitoring "too many" files. I don't know if there is an absolute limit to the number of files that can be monitored, but my experience says that monitoring over 5000 files can bog down a forwarder, causing it to consume lots of memory and CPU and to run poorly. In this case, it might make sense to "split flows" by having two Splunk forwarder processes running - each monitoring a non-overlapping part of the log files. However, the best solution is often to simply move inactive log files from the monitored directory. The fewer files that Splunk must monitor, the better for performance.
Perfect! this is great information. Although this would only bandaid the poor practice of monitoring a directory of thousands of files, it certainly can help. We are also in a clustered environment so the second point makes a good deal of sense too. Thanks a bunch for this information.
No, that doesn't look like the behavior of a forwarder, even one monitoring a lot of files. It looks more like indexerServerOne is connecting to myappserver on tcp/8081.
Consider running netstat -tnlp
on myappserver to see what process is listening on that port. Perhaps it's a JMX port and the indexer has an app configured to poll it?
Adding my 2 cents here:
Check this answer https://answers.splunk.com/answers/114447/splunk-to-splunk-communication-stuck-in-close-wait.html
cheers, MuS
This is great!! Thanks
Thanks for the really fast response. Unfortunately I had to stop the forwarder as it brought down our production server (app's were getting 'too many files' open errors across the board). Once I stopped the forwarder all the connections finalized and cleared out.
This has happened in our test environment as well, but at the time i didn't correlate it to the forwarder. So I will be sure to run the additional netstat options if it happens again.
Good to know that isn't 'normal' behavior.
Sorry to hear about the downtime. If you're running an OS like CentOS 7, I think you'll have a default limit of something like a soft 1024 and hard 4096 files. It's possible that you're running dangerously close to the limit even without the forwarder running. I imagine it would be particularly risky to run the forwarder under the same account as the app if you're near the limit. Maybe check the number of open files without the forwarder to see how close you are.
You certainly got me looking in an interesting direction @jtacy ... Looks like the system has a ulimit set to 'unlimited' ... but after looking in /etc/security/limits.conf ... looks like they never increased the per/user counts ... so both root and our app user is set to 4096.
Sounds low to me when you factor in that our application monitoring tool is probably opening those flat files from time to time to write data, the app itself running, forwarder, etc.
I hate to admit it, but the forwarder was running as root, which is something I am trying to go back to remediate now that I own the splunk infrastructure :(.
None of the apps on the server are running as root, so i wouldn't think that root would be getting close to the limit