We have implemented a new design with DNS load balancing, that we currently have issues with.
DNS is configured with a A record with 2 IPs defined. That LB hostname defined in deploymentclient.conf on the UF.
However as soon as we have more than 1 backen server active the UF fails on initial Phonehome handshake. With wirehark we can see that the traffic is split between our 2 IP's and the handshake never completes. It can run for days, trying.
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Changed state from=Initial to=Initial
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Attempting handshake
01-30-2020 13:16:32.828 +0100 DEBUG DC:DeploymentClient - Sending message <handshake/> to tenantService/handshake
01-30-2020 13:16:32.828 +0100 DEBUG HttpPubSubConnection - HttpClientPollingThread Woke up
01-30-2020 13:16:32.828 +0100 DEBUG HttpPubSubConnection - Not waiting as we have '1' requests in queue
01-30-2020 13:16:32.828 +0100 INFO HttpPubSubConnection - Running phone uri=/services/broker/phonehome/connection_10.11.12.13_48089_WIN2019.oneadr.net_WIN2019_7FDF738A-330F-42E1-BB34-B1EBCD881E67
01-30-2020 13:16:32.828 +0100 DEBUG HttpPubSubConnection - Will now wait for pollingInterval of 60.000 secs
01-30-2020 13:16:32.828 +0100 DEBUG DC:DeploymentClient - channel=tenantService/handshake Success sending handshake to DS.
01-30-2020 13:16:32.828 +0100 DEBUG DC:DeploymentClient - Changed state from=Initial to=HandshakeInProgress
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.000sec
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=HandshakeInProgress
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.000sec
01-30-2020 13:16:32.828 +0100 DEBUG DC:PhonehomeThread - Phonehome thread will wait for 12.000sec (1)
01-30-2020 13:16:44.833 +0100 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=HandshakeInProgress
01-30-2020 13:16:44.833 +0100 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.000sec
01-30-2020 13:16:44.833 +0100 DEBUG DC:PhonehomeThread - Phonehome thread will wait for 12.000sec (1)
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=HandshakeInProgress
01-30-2020 13:16:56.832 +0100 WARN DC:PhonehomeThread - No response to handshake for too long; starting over.
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - Changed state from=HandshakeInProgress to=Initial
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - Handshake not yet finished; will retry every 12.000sec
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - PhonehomeThread::main top-of-loop, DC state=Initial
01-30-2020 13:16:56.832 +0100 WARN DC:PhonehomeThread - No response to handshake for too long; starting over.
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - Changed state from=Initial to=Initial
01-30-2020 13:16:56.832 +0100 DEBUG DC:PhonehomeThread - Attempting handshake
01-30-2020 13:16:56.832 +0100 DEBUG DC:DeploymentClient - Sending message <handshake/> to tenantService/handshake
01-30-2020 13:16:56.832 +0100 DEBUG HttpPubSubConnection - HttpClientPollingThread Woke up
01-30-2020 13:16:56.832 +0100 DEBUG HttpPubSubConnection - Not waiting as we have '1' requests in queue
Deployment server running RHEL 7, Splunk 7.3.4
Deployment Client on Windows 2019, Splunk UF 7.3.3
@ejenson_splunk Have you experienced anything like this in your setup? Much appreciate if you could share details of your setup as there seem to be some differences that I'm not able to find.
... View more