Deployment Architecture

Why am I getting "ConfReplicationThread - Error pulling configurations from captain" in my search head cluster?

bgaignon
Path Finder

Hi guys,

I have an issue with my Search Head cluster, the replication seems to not be working:

192.128.192.131 is the SearchHead1

192.128.192.136 is the Searchhead2

11-15-2014 12:42:32.993 +0100 WARN  ConfReplicationThread - Error pulling configurations from captain=https://192.168.192.131:8089, consecutiveErrors=966: Error in fetchFrom, at=: Non-200 status_code=500: refuse request without valid baseline; snapshot exists at op_id=1a4a26781bed0c57c325b1fd297fb07082eba435 for repo=https://192.168.192.131:8089
11-15-2014 12:42:32.990 +0100 ERROR HttpListener - Exception while processing request from 192.168.192.136 for /services/replication/configuration/commits?output_mode=json&at=: refuse request without valid baseline; snapshot exists at op_id=1a4a26781bed0c57c325b1fd297fb07082eba435 for repo=https://192.168.192.131:8089

The captain feature is working, if i stop the captain the other Search Head becomes the captain (according the command "splunk show shcluster-status").

Here is my server.conf on Search Heads:

[shclustering]
conf_deploy_fetch_url = https://192.168.192.134:8089 # DEPLOYER URL
disabled = 0
mgmt_uri = https://192.168.192.136:8089 # IP OF CURRENT SERVER
pass4SymmKey = $1$oov1Lgj65W5z
replication_factor = 2
id = 6EFA87CF-8D4D-43D5-85D3-DE8BAD78403E

Does someone see where is my problem ??

1 Solution

bgaignon
Path Finder

rbal_splunk
Splunk Employee
Splunk Employee

some further update for errors like below

08-01-2017 10:03:37.694 -0700 WARN ConfReplicationThread - Error pulling configurations from captain=https://:8089, consecutiveErrors=2 msg="Error in fetchFrom, at=ae823222d0607652969d338bb793469fb7de85cd: Network-layer error: Connect Timeout

Please not that consecutiveErrors is not larger than 10 is not considered as a real issue. It can be due to the captain side is busy and not be able to response in time.

Check what is the consecutiveErrors count for you using search like

Index=_internal ( host= OR host= OR host= OR host=) source="splunkd.log" "ConfReplicationThread - Error pulling configurations from captain" | stats max(consecutiveErrors) by host

It's not an issue is the consecutiveErrors<10. In case error is above 10 log case with Splunk Support

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

Normally this error means that Serahc Head Cluster member is fallen behind in replication - I think it may be good idea to debug why configurations aren't sync-ing in the first place and address the root cause.

A destructive resync is only truly required if the member has fallen really far behind the captain -- i.e. 20000 changes behind (by default) -- or if local state is completely corrupted/invalid (e.g. corrupt filesystem).

For Search Head cluster please refer answers below to ensure that Search Head Cluster members are configured as per requirement.

http://answers.splunk.com/answers/242905/shc-troubleshooting-configurations-under-search-he.html#ans...

bgaignon
Path Finder

Found the problem,

http://docs.splunk.com/Documentation/Splunk/6.2.0/DistSearch/Handlememberfailure

Splunk resync shcluster-replicated-config

bohanlon_splunk
Splunk Employee
Splunk Employee

I downvoted this post because this doesn't fix the underlaying issue (i.e. identify the cause of the replication bottleneck). this just temporarily works around it.

0 Karma

rmorlen
Splunk Employee
Splunk Employee

Ok. But in the docs it states:

"Caution: This command causes an overwrite of the member's entire set of search-related configurations, resulting in the loss of any local changes."

What does "loss of local changes" mean with this? Any changes that have been made are lost? For all time? For the last hour?

0 Karma

Steve_G_
Splunk Employee
Splunk Employee

It means any changes that you have made to that search head alone, as opposed to those changes that get propagated (through either the deployer or automatic replication) across the set of cluster members.

bgaignon
Path Finder

I guess you loose all changes since the last replication.
Without replication to other Search Head members, your changes are local.
This is how I understand that.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...