Deployment Architecture

Getting a failed to join cluster error from an indexer after an inadvertent IP change (and change back), but the CM reports the indexer is joined and healthy. How to fix?

twinspop
Influencer

Indexer was running normally yesterday. We offlined it, and after maintenance, rebooted it. When it came back up, it had a new IP because reasons, and joined the cluster with the new IP. After realizing what happened, and much troubleshooting with my NOC, they got the right IP in place and I offlined/rebooted again. Everything looked normal, but I'm seeing this error today:

Search peer dc1prsplixap08 has the following message: Failed to register with cluster master reason: failed method=POST path=/services/cluster/master/peers/?output_mode=json master=DC1PRSPLDP01:8089 rv=0 gotConnectionError=0 gotUnexpectedStatusCode=1 actual_response_code=500 expected_response_code=2xx status_line="Internal Server Error" socket_error="No error" remote_error=Cannot add peer=11.1.136.166 mgmtport=8089 (reason: Peer with guid=C395587E-CB3A-4492-8662-71AFD3002A89 is already registered and UP). Make sure pass4SymmKey is matching if the peer is running well. [ event=addPeer status=retrying AddPeerRequest: { _id= active_bundle_id=F24FD19BC912B3FE530FB3917ED1B287 add_type=Initial-Add base_generation_id=0 batch_serialno=1 batch_size=20 forwarderdata_rcv_port=9997 forwarderdata_use_ssl=0 last_complete_generation_id=0 latest_bundle_id=F24FD19BC912B3FE530FB3917ED1B287 mgmt_port=8089 name=C395587E-CB3A-4492-8662-71AFD3002A89 register_forwarder_address= register_replication_address= register_search_address= replication_port=9000 replication_use_ssl=0 replications= server_name=dc1prsplixap08 site=default splunk_version=6.6.0 splunkd_build_number=e21ee54bc796 status=Up } ].

Linux, Splunk version 6.6.3

0 Karma

thambisetty
SplunkTrust
SplunkTrust

I faced the exact same issue in one of the multi-site indexer clusters when I upgraded the indexer cluster from version 9.0.x to 9.0.5.

After upgrading Splunk on the indexer, the virtual machine (VM) running the indexer unexpectedly went down. When I restarted the VM, I discovered that the Splunk service was already running, and the version displayed was the latest one. However, I failed to notice that it was experiencing problems connecting to the cluster manager.

I completed the upgrade, but after a few days (around 15 days), the vulnerability management team requested another Splunk version upgrade. When I checked the Splunk version using the command, it displayed version 9.0.5. However, upon inspecting the $SPLUNK_HOME/etc/splunk.version file, I found that it still had the old version, indicating an unsuccessful upgrade.

Realizing this, put the cluster master in maintenance mode, I stopped the Splunk service on the faulty indexer, cleared the standalone buckets using the commands mentioned below. Unfortunately, while restarting the Splunk service on the faulty indexer, the server went down again.

 

# finding standaralone buckets
find $SPLUNK_DB/ -type d -name "db*" | grep -P "db_\d*_\d*_\d*$"
#converting standardalone buckets to clustered buckets
# 5A0E298B-0AFB-4d56-9dD0-A64dfdfd19DA8 is the GUID of cluster manager(master)
find $SPLUNK_DB/ -type d -name "db_*" | grep -P "db_\d*_\d*_\d*$" |xargs -I {} mv {} {}_5A0E298B-0AFB-4d56-9dD0-A64dfdfd19DA8 


I repeated this process two to three times, but it did not resolve the issue.

Finally, I cleared the $SPLUNK_HOME/etc/instances.cfg file on the faulty indexer and restarted the service. This time, the indexer successfully joined the cluster.

————————————
If this helps, give a like below.
0 Karma

fz
Explorer

Hi,

I guess your indexer trying to add into CM with the encrypted pass4symmkey, Since the Indexer was already added to CM with old IPand the pass4symmkey was encrypted.

Can you try adding the Pass4symmkey value on the indexer and restart it after adding.

Hope this helps!

s2_splunk
Splunk Employee
Splunk Employee

I would have expected the same error message when you added the peer with the "wrong" IP address. Looks like the CM did not get the memo when you offlined it to change it back to the correct IP address.
Have you restarted the CM yet to see if that clears the error?

0 Karma

twinspop
Influencer

The CM shows all healthy. It looks like the error (displayed on the SHC) was old. I have deleted it and it hasn't returned.

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...