Search Head Cluster captain election fails with er...

sat94541 · ‎01-06-2016

We have 5 Node SHC member on splunk version 6.3. The Captain election is not suceeding.
We followed steps and cleared _raft and that did not help.
Steps that were taken are

1) Stop all SHC members.
2)Clean _raft on all nodes > $SPLUNK_HOME/var/run/splunk/_raft
3)restart all members
4)Attempted to bootstraped all using command

splunk bootstrap shcluster-captain -servers_list "<URI>:<management_port>,<URI>:<management_port>,..." -auth <username>:<password>

This failed with error SHPRaftConsensus - NOT_LEADER CURRENT_STATE = FOLLOWER

teh Splunkd.log has the folloing entries

01-05-2016 19:35:53.658 -0500 INFO ServerConfig - My server name is "test5421.xx.test.com".
01-05-2016 19:35:53.659 -0500 INFO ServerConfig - My hostname is "test5421".
01-05-2016 19:40:37.058 -0500 INFO SHPRaftConsensus - stepDown(1)
01-05-2016 19:40:37.058 -0500 INFO SHPRaftConsensus - Activating configuration 1:\n<configuration>\n<prev_configuration>\n<server>\n<server_id>https://test5421.xx.test.com:8089
01-05-2016 19:41:03.430 -0500 INFO SHPRaftConsensus - Running for election in term 2
01-05-2016 19:41:03.431 -0500 INFO SHPRaftConsensus - Now leader for term 2
01-05-2016 19:41:03.431 -0500 INFO SHPRaftConsensus - New commitIndex: 2
01-05-2016 19:41:03.431 -0500 INFO SHPoolingMgr - Making node the captain
01-05-2016 19:41:03.431 -0500 INFO SHPoolingMgr - makeOrChangeSlave - master_shp = https://test5421.xx.test.com:8089
01-05-2016 19:41:03.613 -0500 INFO SHPRaftConsensus - stepDown(7495)
01-05-2016 19:41:03.613 -0500 INFO SHPRaftConsensus - Activating configuration 1:\n<configuration>\n<prev_conf
iguration>\n<server>\n<server_id>https://test5421.xx.test.com:8089</server_id>\n</server>\n</prev_configuration>\n&...
01-05-2016 19:41:03.613 -0500 INFO SHPRaftConsensus - Exiting and deleting server : https://test5422.xx.test.com:8089
01-05-2016 19:41:03.613 -0500 INFO SHPRaftConsensus - Exiting and deleting server : https://testa9437.xx.test.com:8089
01-05-2016 19:41:03.613 -0500 INFO SHPRaftConsensus - Exiting and deleting server : https://test9453.xx.test.com:8089
01-05-2016 19:41:03.613 -0500 INFO SHPRaftConsensus - Exiting and deleting server : https://test9454.xx.test.com:8089
01-05-2016 19:41:03.613 -0500 INFO SHPoolingMgr - makeOrChangeSlave - master_shp = ?
01-05-2016 19:41:03.613 -0500 INFO SHPRaftConsensus - NOT_LEADER CURRENT_STATE = FOLLOWER

Note in the above log we see "stepDown(1)" and "stepDown(7495)" which does not seems right

rbal_splunk · ‎01-06-2016

It could be network issues leading to the failing in append entries while bootstrapping,--check in splunkd.log

sat94541 · ‎01-06-2016

Here is what worked::::

1) Stop all 5 SHC members.
2)lean _raft on all nodes > $SPLUNK_HOME/var/run/splunk/_raft. NOTE: It needs to be cleaned from all nodes.
3) restart all 5 SHC members
6)We initially bootstrapped one member

Bootstrap one node using command like below and then added peers using add peer on the captain bootstrapped

splunk bootstrap shcluster-captain -servers_list ":" -auth :

Here the reference to add peer:

http://docs.splunk.com/Documentation/Splunk/6.2.0/DistSearch/Addaclustermember#Add_the_instance

rbal_splunk · ‎01-06-2016

when you clear make sure all the nodes are stopped and turn off.
Can you try bootstrapping just one member and then keep adding peers using add peer on the captain bootstrapped

Search Head Cluster captain election fails with error -NOT_LEADER CURRENT_STATE = FOLLOWER

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms