Getting Data In

Intermediate forwarder connections timeout

mlindsey
Explorer

I have about 1300 hosts configured with uni forwarders sending data to a single heavy forwarder. The heavy forwarder doesn't do any processing really (at least yet). It simply turns around and forwards to the indexers in another data center. Essentially, it is functioning as a gateway.

Many of the uni forwarders are logging timeout errors trying to connect to the intermediate. fs.max_files is ~793k. And process file descriptor limit is 64k. Typically a dozen or more connections on the intermediate forwarder are in SYN_RECV state. If I telnet to the port (9997) in this case, I get mostly timeouts with an occasional connection. It's acting like it is resource starved. It's running on a 4-core 8G VM. I can't find any splunk tunable connection limits except for the queueSize which I upped to 10MB with no apparent difference. The splunkd seems to float around 60% of a single CPU while in this state.

Have I maxed the capability of splunk in this case? I suppose I can add more forwarders and load balance between them. Splunk, what are your recommendations? Thanks, guys!

hsesterhenn_spl
Splunk Employee
Splunk Employee
0 Karma

hajir_shiftehme
New Member

From memory our issue turned out to be maxKBps default settings.
By default forwarder throughput is limited (256k ?) and when the buffer's full TCP connections are left hanging.
Give it a go and see if it helps: 'maxKBps' under 'thruput' in 'limits.conf'

0 Karma

bmacias84
Champion

I'm seeing the same thing with our VM indexers and Intermediate forwarders. I've load Wireshark some of my indexers and forwarders. What I've seen is that the intermediate and indexers are sending tcp resets immediately and tcp zero windowing. Digging in further I started to watch disk transfer rate, system processor queue, Network bytes in/out, thread counts, thread status, cpu % time. I recommend watch these stats to help diagnose your issue. Also try using a sniffer like wireshark.

The TCP reset seem to occur in bulk when System processor queue and disk transfer rates are high. System processor queue usually mean you have threads waiting to be processed. I am assume that in my case during high disk transfer time I am not able perform network tasks fast enough and commit to disk causing TCP zero windowing and TCP resets.

Adding more processors to a VM may cause more problems as all processors have to be schedule simultaneously or the VM waits to long to be schedule on the physical resources.

If you have an antivirus application on my servers exclude all splunk process and directories. This seems to help.
If you have a bunch of thread queuing/waiting try disabling no critical services.

If you are on a windows server you can disable tcp auto tuning.

Windows has three primary reg keys that can help: TcpTimedWaitDelay (Tcp Connection release after fin or reset), MaxUserPort (maxium number user ports for applications), TcpNumConnections. This settings done help too much as auto tuning is fairly good in windows server 2008 and 2012.

For more information try reading;
Windows TCP tuning

0 Karma

hajir_shiftehme
New Member

I'm getting identical symptoms on universal forwarder 5.0.2 but the workaround doesn't seem to work and nothing suspicious can be found in log files (I even tried splunk start --debug).

Any idea what I should be looking at?

[splunktcp]
_rcvbuf = 1572864
acceptFrom = *
connection_host = ip
host = da02.int
index = default
route = has_key:tautology:parsingQueue;absent_key:tautology:parsingQueue
[splunktcp://9997]
_rcvbuf = 1572864
connection_host = none
disabled = 0
host = da02.int
index = default
queueSize = 10MB

0 Karma

yannK
Splunk Employee
Splunk Employee

You may have identified the issue, the reverse DNS is taking too long.
Do you know if your DNS is quick to resolve an Ip/hostname, can it resolve all of your forwarders, are they defined in your dns server, are you under Windows with netbios reslution turned on ?

A first workaround to reduce the resolution time, by example by populating a local host list file, or changing the DNS settings.

A second workaround is to disable the dns resolution in splunk for your splunktcp inputs.

  • the resolution is in general useless, because the host field is already populated by the forwarder.
  • the only consequence is that in splunk metrics, the source field will be the ip instead of the hostname.
  • if needed you do this also for UDP/TCP syslog inputs, but this will change the format of your syslog logs.
  • To proceed, edit the inputs.conf and add the parameter connection-host = none

see http://docs.splunk.com/Documentation/Splunk/4.3.1/admin/Inputsconf

[splunktcp://9997]
connection_host = none

yannK
Splunk Employee
Splunk Employee

FYI, in splunk 4.3.1, to help you identify this issue, a new warning message has been added in splunkd.log to report a slow dns resolution.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...