Splunk Search

Splunk to Splunk communication stuck in CLOSE_WAIT

rbal_splunk
Splunk Employee
Splunk Employee

A Splunk environment in one data center configured with multiple indexers became completely unresponsive to the data center's forwarders.
The indexers were confirmed running, but the ~2K forwarders could not consistently connect and events were dropping.

A splunkd.log on Forwarders stated: WARN TcpOutputProc - Cooked connection to ip=blah:port timed out

Upon examining the indexers, there were very few indications that anything was wrong other than:
ERROR TcpInputProc - Error encountered for connection from src=blah:port. Broken pipe
ERROR TcpInputProc - Error encountered for connection from src=blah:port. Timeout

A telnet to tcp port 9997 from the forwarder host to the indexers does connect.
Using S.O.S. to examine the indexers revealed almost no load at all. The queues were empty and CPU use was minimal.
Each indexers 'open files' ulimit as reported in splunkd.log was amply high (10K+).

A tcpdump shows repeated SYN packets to indexers port 9997, but mostly no replies.
Using netstat -an | grep 9997| grep ESTABLISHED on the indexers showed an average of ~300 ESTABLISHED per-indexer, with many hundreds of CLOSE_WAIT.

Restarting the indexer triggers a surge of ESTABLISHED connections, which very quickly (>2 min) dropped of back to the low hundreds with the majority of TCP connections back in CLOSE_WAIT.

What's going on with Splunk?

Tags (1)
1 Solution

ekost
Splunk Employee
Splunk Employee

The symptoms above have been seen when reverse DNS is not functioning in the network environment. The Splunk indexers are unable to resolve any inputs and get stuck waiting on DNS.

A workaround is to change the inputs.conf setting "connection_host" to "none".
Details on setting "connection_host" can be found here: inputs.conf

The default setting of "connection_host" can vary depending upon Splunk version. Notably, each input stanza on the indexer that references a network port can have a different "connection_host" option set. Evaluate the current settings by using btool and change all or some as needed.

Caveats: The metrics.log no longer shows hostnames for forwarder data, only IP addresses.

In all cases, once DNS use was bypassed or minimized, normal data ingestion on the indexers resumed.

In a future release, a message will be added to splunkd.log when a timeout threshold on DNS has been triggered.

View solution in original post

ekost
Splunk Employee
Splunk Employee

The symptoms above have been seen when reverse DNS is not functioning in the network environment. The Splunk indexers are unable to resolve any inputs and get stuck waiting on DNS.

A workaround is to change the inputs.conf setting "connection_host" to "none".
Details on setting "connection_host" can be found here: inputs.conf

The default setting of "connection_host" can vary depending upon Splunk version. Notably, each input stanza on the indexer that references a network port can have a different "connection_host" option set. Evaluate the current settings by using btool and change all or some as needed.

Caveats: The metrics.log no longer shows hostnames for forwarder data, only IP addresses.

In all cases, once DNS use was bypassed or minimized, normal data ingestion on the indexers resumed.

In a future release, a message will be added to splunkd.log when a timeout threshold on DNS has been triggered.

ekost
Splunk Employee
Splunk Employee

The message was added in 6.1.3. An example is noted in this Answers post.

Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...