Solved: Rolling restart of Cluster puts peer in restart lo...

rturk · ‎10-23-2013

Hi All,

After fresh installs of Splunk (Windows v5.0.4) I had (had) a fully functioning cluster that was happily replicating and life was good.

After updating an app on the cluster master (removing extraneous text files from a directory) I kicked of the bundle deployment:

.\splunk.exe apply cluster-bundle

I then checked the status with the following command:

.\splunk.exe show cluster-bundle-status

Output:

Guid: 71F63992-BD86-4935-932E-24258A6A3CDD
  ServerName: IDX-A
  Status: Up
  Bundle Validation Status: Validation successful
  Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
  Active Bundle: 37b2f885aeac2bbe59bfa95a7a4202fc

Guid: BC734690-BACE-41CC-812D-254085234EE5
  ServerName: IDX-B
  Status: Restarting
  Bundle Validation Status: Validation successful
  Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
  Active Bundle: 37b2f885aeac2bbe59bfa95a7a4202fc

All well and good, but when I checked again not long after:

Guid: 71F63992-BD86-4935-932E-24258A6A3CDD
  ServerName: IDX-A
  Status: Restarting
  Bundle Validation Status: Validation successful
  Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
  Active Bundle:

Guid: BC734690-BACE-41CC-812D-254085234EE5
  ServerName: IDX-B
  Status: Up
  Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
  Active Bundle: 1d6134c6cab9fd5a720516d8881a01a8

The impact of this is:

The Active Bundle for IDX-A is now blank
The app directories in /slave-apps are now empty
IDX-A is in a restart loop, and;
The splunkd.log on IDX-A indicate that the process is being told (repeatedly) to gracefully shut down.

This is not the first time this has happened... as this fresh install is a result of this happening previously and me taking the default "Reinstall & hope for the best" path... dammit.

Any and all suggestions greatly appreciated!

RT

EDIT #1: 10 minutes later and it's still happening.

EDIT #2: splunkd.log on the cluster master has this over & over again:

...
CMMaster - event=handleInputsQuiesced guid=71F63992-BD86-4935-932E-24258A6A3CDD
ClusterMasterPeerHandler - Add peer info replication_address=IDX-A forwarder_address= search_address= mgmtPort=8089 rawPort=9887 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=IDX-A activeBundleId= status=Up type=Initial-Add baseGen=0
CMMaster - event=removeOldPeer guid=71F63992-BD86-4935-932E-24258A6A3CDD hostport=IDX-A:8089 status=success
CMMaster - event=addPeer guid=71F63992-BD86-4935-932E-24258A6A3CDD replication_address=IDX-A forwarder_address= search_address= mgmtPort=8089 rawPort=9887 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=SE02SPL01LP activeBundleId= status=Up type=Initial-Add baseGen=0 bucket_count=0 
CMPeer - peer=71F63992-BD86-4935-932E-24258A6A3CDD transitioning from=Down to=Up reason="addPeer successful."
CMMaster - event=addPeer msg='Bundle mismatch; restarting peer. '
CMMaster - committing gen=121 numpeers=2
CMMaster - event=addPeer guid=71F63992-BD86-4935-932E-24258A6A3CDD status=success initialized=1 npeers=2 basegen=121
CMPeer - peer=71F63992-BD86-4935-932E-24258A6A3CDD transitioning from=Up to=Restarting reason="restart peer"
CMBundleServer - event=streamingbundle status=success file=C:\Program Files\Splunk\var\run\splunk\cluster\remote-bundle\4a483d66a10ab4976b2d984c9361d040-1382573311.bundle totalBytesWritten=3317760 checksum=1d6134c6cab9fd5a720516d8881a01a8 Content-Length=3317760
ClusterSlaveControlHandler - Bundle validation success reported by [71F63992-BD86-4935-932E-24258A6A3CDD] successful for bundleid=1d6134c6cab9fd5a720516d8881a01a8
CMMaster - event=handleShutdown guid=71F63992-BD86-4935-932E-24258A6A3CDD status=Restarting
CMPeer - peer=71F63992-BD86-4935-932E-24258A6A3CDD has started master-initiated restart
...

rturk · ‎10-24-2013

Found the cause and solution here: http://answers.splunk.com/answers/82275/why-is-my-windows-cluster-peer-node-continually-restarting

Essentially, directory permissions on /slave-apps/ on the search peer had been lost (why?) and directory was set to read only. As per the link above, resetting the permissions allowed the Cluster Master to once again populate the directory with the required apps.

View solution in original post

rturk · ‎10-24-2013

Found the cause and solution here: http://answers.splunk.com/answers/82275/why-is-my-windows-cluster-peer-node-continually-restarting

Essentially, directory permissions on /slave-apps/ on the search peer had been lost (why?) and directory was set to read only. As per the link above, resetting the permissions allowed the Cluster Master to once again populate the directory with the required apps.

Rolling restart of Cluster puts peer in restart loop

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes