Of approximately 3200 UF agents, only 1400 are checked into the deployment server. I see that many are still sending data, including their internal logs, so I have some idea of who these systems are.
I've attempted the following to drive the number up:
DS web.conf settings:
splunkdConnectionTimeout = 240
deploying deploymentclient.conf with the following:
phoneHomeIntervalInSecs = 365
handShakeRetryInterval = 89
the idea is that the retry does not ever collide with the phonehome interval. Also that these numbers never collide with the timeout.
I've incrementally changed these values, however nothing has moved the needle in any measurable way. Any thoughts on how I can get to 100% check-in?
I assume that the error message is appearing in the splunkd.log on a forwarder.
"Unable to send handshake message to deployment server. Error status is: not_connected"
I also assume that all the forwarders are sending their internal logs to the indexers - that there is nothing wrong with the forwarding, just the deployment process. Hopefully, the deployment server is also forwarding its internal logs to the indexers, so you can see the overall picture from just one place. If yes, here are some searches to play with
This search summarizes the "phone home" attempts of the forwarders
index=_internal (*phonehome* component=DC*) OR (component=DC:HandshakeReplyHandler)
| bin span=1d _time
| stats count by host log_level _time
This search summarize the "received phone homes" from the perspective of the DS
index=_internal metrics group=deploy-server sourcetype=splunkd
| bin span=1d _time
| stats sum(nReceived) by host _time
In addition to looking at these statistics, take a look at the events themselves. You will probably see other reports that would help you see the problem. Some timecharts might help you determine if the DS is getting too many requests at once, for example.
How are you determining the "checked into the deployment server" status?
You have a lot of clients for one deployment server (DS), but the DS can handle about 1000 polls per minute, if it is provisioned with adequate resources (network, CPU and memory). Does your DS seem resource-starved in any way? Your phoneHomeIntervalInSecs = 365
seems reasonably set on the clients.
What do you find in the splunkd.log on the clients that have not checked in? Does it indicate that the client attempted to phone home, that the phonehome failed/succeeded etc.?
In the splunkd.log on the DS, do you see any indication that all the clients are connecting or that connections are failing?
Looking at the deployment client window in the DS gives me a count. I then do a meta search on hosts and subtract my syslog count which gets me a much higher number. So then I start searching on the splunkd logs and I am seeing several messages that follow:
"Unable to send handshake message to deployment server. Error status is: not_connected"