Hi All,
After fresh installs of Splunk (Windows v5.0.4) I had (had) a fully functioning cluster that was happily replicating and life was good.
After updating an app on the cluster master (removing extraneous text files from a directory) I kicked of the bundle deployment:
.\splunk.exe apply cluster-bundle
I then checked the status with the following command:
.\splunk.exe show cluster-bundle-status
Output:
Guid: 71F63992-BD86-4935-932E-24258A6A3CDD
ServerName: IDX-A
Status: Up
Bundle Validation Status: Validation successful
Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
Active Bundle: 37b2f885aeac2bbe59bfa95a7a4202fc
Guid: BC734690-BACE-41CC-812D-254085234EE5
ServerName: IDX-B
Status: Restarting
Bundle Validation Status: Validation successful
Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
Active Bundle: 37b2f885aeac2bbe59bfa95a7a4202fc
All well and good, but when I checked again not long after:
Guid: 71F63992-BD86-4935-932E-24258A6A3CDD
ServerName: IDX-A
Status: Restarting
Bundle Validation Status: Validation successful
Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
Active Bundle:
Guid: BC734690-BACE-41CC-812D-254085234EE5
ServerName: IDX-B
Status: Up
Latest Bundle: 1d6134c6cab9fd5a720516d8881a01a8
Active Bundle: 1d6134c6cab9fd5a720516d8881a01a8
The impact of this is:
Active Bundle
for IDX-A is now blank/slave-apps
are now emptyThis is not the first time this has happened... as this fresh install is a result of this happening previously and me taking the default "Reinstall & hope for the best" path... dammit.
Any and all suggestions greatly appreciated!
RT
EDIT #1: 10 minutes later and it's still happening.
EDIT #2: splunkd.log
on the cluster master has this over & over again:
...
CMMaster - event=handleInputsQuiesced guid=71F63992-BD86-4935-932E-24258A6A3CDD
ClusterMasterPeerHandler - Add peer info replication_address=IDX-A forwarder_address= search_address= mgmtPort=8089 rawPort=9887 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=IDX-A activeBundleId= status=Up type=Initial-Add baseGen=0
CMMaster - event=removeOldPeer guid=71F63992-BD86-4935-932E-24258A6A3CDD hostport=IDX-A:8089 status=success
CMMaster - event=addPeer guid=71F63992-BD86-4935-932E-24258A6A3CDD replication_address=IDX-A forwarder_address= search_address= mgmtPort=8089 rawPort=9887 useSSL=false forwarderPort=0 forwarderPortUseSSL=true serverName=SE02SPL01LP activeBundleId= status=Up type=Initial-Add baseGen=0 bucket_count=0
CMPeer - peer=71F63992-BD86-4935-932E-24258A6A3CDD transitioning from=Down to=Up reason="addPeer successful."
CMMaster - event=addPeer msg='Bundle mismatch; restarting peer. '
CMMaster - committing gen=121 numpeers=2
CMMaster - event=addPeer guid=71F63992-BD86-4935-932E-24258A6A3CDD status=success initialized=1 npeers=2 basegen=121
CMPeer - peer=71F63992-BD86-4935-932E-24258A6A3CDD transitioning from=Up to=Restarting reason="restart peer"
CMBundleServer - event=streamingbundle status=success file=C:\Program Files\Splunk\var\run\splunk\cluster\remote-bundle\4a483d66a10ab4976b2d984c9361d040-1382573311.bundle totalBytesWritten=3317760 checksum=1d6134c6cab9fd5a720516d8881a01a8 Content-Length=3317760
ClusterSlaveControlHandler - Bundle validation success reported by [71F63992-BD86-4935-932E-24258A6A3CDD] successful for bundleid=1d6134c6cab9fd5a720516d8881a01a8
CMMaster - event=handleShutdown guid=71F63992-BD86-4935-932E-24258A6A3CDD status=Restarting
CMPeer - peer=71F63992-BD86-4935-932E-24258A6A3CDD has started master-initiated restart
...
Found the cause and solution here: http://answers.splunk.com/answers/82275/why-is-my-windows-cluster-peer-node-continually-restarting
Essentially, directory permissions on /slave-apps/ on the search peer had been lost (why?) and directory was set to read only. As per the link above, resetting the permissions allowed the Cluster Master to once again populate the directory with the required apps.
Found the cause and solution here: http://answers.splunk.com/answers/82275/why-is-my-windows-cluster-peer-node-continually-restarting
Essentially, directory permissions on /slave-apps/ on the search peer had been lost (why?) and directory was set to read only. As per the link above, resetting the permissions allowed the Cluster Master to once again populate the directory with the required apps.