Solved: Splunk to Splunk communication stuck in CLOSE_WAIT

rbal_splunk · ‎12-09-2013

A Splunk environment in one data center configured with multiple indexers became completely unresponsive to the data center's forwarders.
The indexers were confirmed running, but the ~2K forwarders could not consistently connect and events were dropping.

A splunkd.log on Forwarders stated: WARN TcpOutputProc - Cooked connection to ip=blah:port timed out

Upon examining the indexers, there were very few indications that anything was wrong other than:
ERROR TcpInputProc - Error encountered for connection from src=blah:port. Broken pipe
ERROR TcpInputProc - Error encountered for connection from src=blah:port. Timeout

A telnet to tcp port 9997 from the forwarder host to the indexers does connect.
Using S.O.S. to examine the indexers revealed almost no load at all. The queues were empty and CPU use was minimal.
Each indexers 'open files' ulimit as reported in splunkd.log was amply high (10K+).

A tcpdump shows repeated SYN packets to indexers port 9997, but mostly no replies.
Using netstat -an | grep 9997| grep ESTABLISHED on the indexers showed an average of ~300 ESTABLISHED per-indexer, with many hundreds of CLOSE_WAIT.

Restarting the indexer triggers a surge of ESTABLISHED connections, which very quickly (>2 min) dropped of back to the low hundreds with the majority of TCP connections back in CLOSE_WAIT.

What's going on with Splunk?

ekost · ‎12-09-2013

The symptoms above have been seen when reverse DNS is not functioning in the network environment. The Splunk indexers are unable to resolve any inputs and get stuck waiting on DNS.

A workaround is to change the inputs.conf setting "connection_host" to "none".
Details on setting "connection_host" can be found here: inputs.conf

The default setting of "connection_host" can vary depending upon Splunk version. Notably, each input stanza on the indexer that references a network port can have a different "connection_host" option set. Evaluate the current settings by using btool and change all or some as needed.

Caveats: The metrics.log no longer shows hostnames for forwarder data, only IP addresses.

In all cases, once DNS use was bypassed or minimized, normal data ingestion on the indexers resumed.

In a future release, a message will be added to splunkd.log when a timeout threshold on DNS has been triggered.

View solution in original post

ekost · ‎12-09-2013

The symptoms above have been seen when reverse DNS is not functioning in the network environment. The Splunk indexers are unable to resolve any inputs and get stuck waiting on DNS.

A workaround is to change the inputs.conf setting "connection_host" to "none".
Details on setting "connection_host" can be found here: inputs.conf

The default setting of "connection_host" can vary depending upon Splunk version. Notably, each input stanza on the indexer that references a network port can have a different "connection_host" option set. Evaluate the current settings by using btool and change all or some as needed.

Caveats: The metrics.log no longer shows hostnames for forwarder data, only IP addresses.

In all cases, once DNS use was bypassed or minimized, normal data ingestion on the indexers resumed.

In a future release, a message will be added to splunkd.log when a timeout threshold on DNS has been triggered.

ekost · ‎07-06-2015

The message was added in 6.1.3. An example is noted in this Answers post.

Splunk to Splunk communication stuck in CLOSE_WAIT

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!