Deployment Architecture

deployment clients failing check in / Deployment server not responding

ltrand
Contributor

Of approximately 3200 UF agents, only 1400 are checked into the deployment server. I see that many are still sending data, including their internal logs, so I have some idea of who these systems are.

I've attempted the following to drive the number up:

DS web.conf settings:

splunkdConnectionTimeout = 240

deploying deploymentclient.conf with the following:
phoneHomeIntervalInSecs = 365
handShakeRetryInterval = 89

the idea is that the retry does not ever collide with the phonehome interval. Also that these numbers never collide with the timeout.

I've incrementally changed these values, however nothing has moved the needle in any measurable way. Any thoughts on how I can get to 100% check-in?

0 Karma

lguinn2
Legend

I assume that the error message is appearing in the splunkd.log on a forwarder.

"Unable to send handshake message to deployment server. Error status is: not_connected"

I also assume that all the forwarders are sending their internal logs to the indexers - that there is nothing wrong with the forwarding, just the deployment process. Hopefully, the deployment server is also forwarding its internal logs to the indexers, so you can see the overall picture from just one place. If yes, here are some searches to play with

This search summarizes the "phone home" attempts of the forwarders

index=_internal (*phonehome* component=DC*) OR (component=DC:HandshakeReplyHandler)
| bin span=1d _time
| stats count by host log_level _time

This search summarize the "received phone homes" from the perspective of the DS

index=_internal metrics group=deploy-server sourcetype=splunkd 
| bin span=1d _time
| stats sum(nReceived) by host _time

In addition to looking at these statistics, take a look at the events themselves. You will probably see other reports that would help you see the problem. Some timecharts might help you determine if the DS is getting too many requests at once, for example.

0 Karma

lguinn2
Legend

How are you determining the "checked into the deployment server" status?

You have a lot of clients for one deployment server (DS), but the DS can handle about 1000 polls per minute, if it is provisioned with adequate resources (network, CPU and memory). Does your DS seem resource-starved in any way? Your phoneHomeIntervalInSecs = 365seems reasonably set on the clients.

What do you find in the splunkd.log on the clients that have not checked in? Does it indicate that the client attempted to phone home, that the phonehome failed/succeeded etc.?

In the splunkd.log on the DS, do you see any indication that all the clients are connecting or that connections are failing?

0 Karma

ltrand
Contributor

Looking at the deployment client window in the DS gives me a count. I then do a meta search on hosts and subtract my syslog count which gets me a much higher number. So then I start searching on the splunkd logs and I am seeing several messages that follow:

"Unable to send handshake message to deployment server. Error status is: not_connected"

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...