I had a multi-site SHC with 1 SH on one end and 3 SHs on the other end. Documents I find recommends an odd number at each site. One of my 3 servers was marked "down" by the Deployer but the SHC Service was still working. The failing server was rebuilt with a backup that was taken before the server was configured as part of the SHC. So, when the server came up it was no longer configured for the cluster and SHC Service broke. The single server at my site1 was configured to be the captain, so I am trying to figure out why this failed. What would rebooting splunk on the Deployer and/or on the search heads cause the SHC Service to fail if the server running the captain is not having any issues? And what can be done to prevent this from happening in the future?
In any clustered environment, Splunk or otherwise, you must have an odd number of cluster members in order to prevent split-brain situations.
The SHC Service is looking for 3 servers of which one was broken. Search head replication for the remaining two were find and search was not impacted.
splunk > splunk apply shcluster-bundle --answer-yes -target https://MY_SH2.mydomain.com:8089 -auth admin:XXXXXX
Error when issuing rolling restart on the master: Internal Server Error{"messages":[{"type":"ERROR","text":"Rolling restarted cannot be started without service_ready_flag = 1, check status through \"splunk show shcluster-status\". Reason :Waiting for 3 peers to register. (Number registered so far: 2)"}]}