Deployment Architecture

Linux Forwarder Opened Hundreds of Sockets

paimonsoror
Builder

Has anyone ever run into a situation where the forwarder opened hundreds of sockets on a system?

Here is what we have configured on the system's output.conf:

[tcpout]
maxQueueSize = auto
forwardedindex.0.whitelist = .*
forwardedindex.1.blacklist = _.*
forwardedindex.2.whitelist = (_audit|_internal|_introspection)
forwardedindex.filter.disable = false
indexAndForward = false
autoLBFrequency = 30
blockOnCloning = true
compressed = false
disabled = false
dropClonedEventsOnQueueFull = 5
dropEventsOnQueueFull = -1
heartbeatFrequency = 30
maxFailuresPerInterval = 2
secsInFailureInterval = 1
maxConnectionsPerIndexer = 2
forceTimebasedAutoLB = false
sendCookedData = true
connectionTimeout = 20
readTimeout = 300
writeTimeout = 300
tcpSendBufSz = 0
useACK = false
blockWarnThreshold = 100
sslQuietShutdown = false

[syslog]
type = udp
priority = <13>
dropEventsOnQueueFull = -1
maxEventSize = 1024

Admittedly, the admin before me who set the forwarder up had the forwarder pulling logs from a directory. The directory has over 2000 log files that are updated regularly. To elaborate, they are all flatfiles that contain metric data from an application performance monitoring tool.

I know that this is poor practice and I am working to correct this moving forwarder, but I wanted to use this as a learning opportunity. Would the consumption of a directory containing thousands of log files cause a forwarder to do something like that?

They were all stuck in a CLOSE_WAIT state as well

tcp     1294      0 myappserver:tproxy         indexerServerOne:25022 CLOSE_WAIT
tcp     1218      0 myappserver:tproxy         indexerServerOne:25023 CLOSE_WAIT
tcp     1246      0 myappserver:tproxy         indexerServerOne:25020 CLOSE_WAIT
tcp     1269      0 myappserver:tproxy         indexerServerOne:25021 CLOSE_WAIT
tcp     1218      0 myappserver:tproxy         indexerServerOne:icl-twobase9 CLOSE_WAIT
tcp     1207      0 myappserver:tproxy         indexerServerOne:icl-twobase3 CLOSE_WAIT
tcp     1226      0 myappserver:tproxy         indexerServerOne:icl-twobase4 CLOSE_WAIT
tcp     1269      0 myappserver:tproxy         indexerServerOne:icl-twobase7 CLOSE_WAIT
tcp     1294      0 myappserver:tproxy         indexerServerOne:icl-twobase8 CLOSE_WAIT
tcp     1194      0 myappserver:tproxy         indexerServerOne:icl-twobase5 CLOSE_WAIT
tcp     1246      0 myappserver:tproxy         indexerServerOne:icl-twobase6 CLOSE_WAIT
tcp     1218      0 myappserver:tproxy         indexerServerOne:24992 CLOSE_WAIT
tcp     1207      0 myappserver:tproxy         indexerServerOne:24986 CLOSE_WAIT
tcp     1226      0 myappserver:tproxy         indexerServerOne:24987 CLOSE_WAIT
tcp     1269      0 myappserver:tproxy         indexerServerOne:24990 CLOSE_WAIT
tcp     1294      0 myappserver:tproxy         indexerServerOne:24991 CLOSE_WAIT
tcp     1194      0 myappserver:tproxy         indexerServerOne:24988 CLOSE_WAIT
tcp     1246      0 myappserver:tproxy         indexerServerOne:24989 CLOSE_WAIT
tcp     1294      0 myappserver:tproxy         indexerServerOne:24978 CLOSE_WAIT
tcp     1218      0 myappserver:tproxy         indexerServerOne:24979 CLOSE_WAIT
tcp     1246      0 myappserver:tproxy         indexerServerOne:24976 CLOSE_WAIT
tcp     1269      0 myappserver:tproxy         indexerServerOne:24977 CLOSE_WAIT
tcp     1226      0 myappserver:tproxy         indexerServerOne:24974 CLOSE_WAIT
tcp     1194      0 myappserver:tproxy         indexerServerOne:24975 CLOSE_WAIT
tcp     1207      0 myappserver:tproxy         indexerServerOne:24973 CLOSE_WAIT
tcp     1269      0 myappserver:tproxy         indexerServerOne:24962 CLOSE_WAIT
tcp     1294      0 myappserver:tproxy         indexerServerOne:24960 CLOSE_WAIT
tcp     1218      0 myappserver:tproxy         indexerServerOne:24961 CLOSE_WAIT
Tags (1)
0 Karma

ddrillic
Ultra Champion

One thing we are trying these days is to reduce the time_before_close parameter.

-- * Tells the system not to close files that have been updated in past seconds. Defaults to 3.

We lowered it to 1.

In your case, such a change can significantly reduce the number of open files.

The following explains it - inputs.conf

@lguinn said in Single forwarder to multiple indexers load balancing

-- One of the things that can bog down a forwarder is monitoring "too many" files. I don't know if there is an absolute limit to the number of files that can be monitored, but my experience says that monitoring over 5000 files can bog down a forwarder, causing it to consume lots of memory and CPU and to run poorly. In this case, it might make sense to "split flows" by having two Splunk forwarder processes running - each monitoring a non-overlapping part of the log files. However, the best solution is often to simply move inactive log files from the monitored directory. The fewer files that Splunk must monitor, the better for performance.

paimonsoror
Builder

Perfect! this is great information. Although this would only bandaid the poor practice of monitoring a directory of thousands of files, it certainly can help. We are also in a clustered environment so the second point makes a good deal of sense too. Thanks a bunch for this information.

0 Karma

jtacy
Builder

No, that doesn't look like the behavior of a forwarder, even one monitoring a lot of files. It looks more like indexerServerOne is connecting to myappserver on tcp/8081.

Consider running netstat -tnlpon myappserver to see what process is listening on that port. Perhaps it's a JMX port and the indexer has an app configured to poll it?

MuS
SplunkTrust
SplunkTrust

paimonsoror
Builder

This is great!! Thanks

0 Karma

paimonsoror
Builder

Thanks for the really fast response. Unfortunately I had to stop the forwarder as it brought down our production server (app's were getting 'too many files' open errors across the board). Once I stopped the forwarder all the connections finalized and cleared out.

This has happened in our test environment as well, but at the time i didn't correlate it to the forwarder. So I will be sure to run the additional netstat options if it happens again.

Good to know that isn't 'normal' behavior.

0 Karma

jtacy
Builder

Sorry to hear about the downtime. If you're running an OS like CentOS 7, I think you'll have a default limit of something like a soft 1024 and hard 4096 files. It's possible that you're running dangerously close to the limit even without the forwarder running. I imagine it would be particularly risky to run the forwarder under the same account as the app if you're near the limit. Maybe check the number of open files without the forwarder to see how close you are.

0 Karma

paimonsoror
Builder

You certainly got me looking in an interesting direction @jtacy ... Looks like the system has a ulimit set to 'unlimited' ... but after looking in /etc/security/limits.conf ... looks like they never increased the per/user counts ... so both root and our app user is set to 4096.

Sounds low to me when you factor in that our application monitoring tool is probably opening those flat files from time to time to write data, the app itself running, forwarder, etc.

0 Karma

paimonsoror
Builder

I hate to admit it, but the forwarder was running as root, which is something I am trying to go back to remediate now that I own the splunk infrastructure :(.

None of the apps on the server are running as root, so i wouldn't think that root would be getting close to the limit

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...