We have a 6.5.3 - 9 node search cluster where at any time 2 members will stop running scheduled searches which is causing scaling issue.
from the captain I see the following message for nodes that stop running scheduled searches
05-27-2017 12:16:08.110 -0500 WARN SHCMaster - did not schedule removal for peer='56B4C21B-0B26-4BA2-826C-148E069F5FD0', err='SHPMaster::scheduleRemoveArtifactFromPeer_locked: aid=scheduler_adminbatch_RMD5ebd970f44716db9c_at_1495904700_9854_2E1C054F-9A8B-4D4A-BBC0-29F0562C7AED peer="xxxxx", guid="56B4C21B-0B26-4BA2-826C-148E069F5FD0" is pending some change, status='PendingDiscard''
I restart these nodes but then they stop running schedules searches a couple of hours later.
I cannot find anything in the docs or in answers for this message. Do i need just need to rsync the baseline?
Thanks!
Try removing the members, cleaning them, initialize them and rejoining them:
Removal:https://docs.splunk.com/Documentation/Splunk/6.5.2/DistSearch/Removeaclustermember
Clean/initialize/Join:http://docs.splunk.com/Documentation/Splunk/6.5.2/DistSearch/Addaclustermember#Add_a_member_that_was...
Did you try to do a
resync shcluster-replicated-config
from the two mentioned SH cluster members? I found one known issue from 6.5.3 but I'm not sure if that's your problem, you may want to contact the support if a resync doesn't fix your problem.
Skalli
I tried that and I also added executor_workers = 20 to server.conf but no change. I have an open case and hoping for a response soon. Thanks!