We are seeing occurring quite often now where the Splunk Search Head simply stops responding. We try running splunk stop or restart and neither works. When we grep the process we see this job never quits:
python -O /opt/splunk/lib/python2.6/site-packages/splunk/appserver/mrsparkle/root
We must kill this process with kill -9 then run splunk start. Then all is OK again. We are running Splunk 4.2.1 x86_64 on RHEL 5 64 bit.
Any idea what might be causing this? I have not seen anything interesting in splunkd.log indicating any ERRORS relating to this.
I'm not sure if this is related or not, but we've seen (about 3 times in the past 6 months, so all version 4.2.x) several instances where our search head box has essentially hung completely. If we were fortunate enough to be logged into the box at the time via SSH, then a recovery is possible by stopping Apache and Splunk completely and restarting them. If not, we aren't able to SSH into the box, or even bring up the console via KVM. Manual physical reset of the box is the only recourse. To the original poster: you aren't by chance running single sign-on at all are you? We have Splunk running behind a local Apache proxy using RSA Securid for auth, hence needing to restart Splunk AND Apache. We can never identify anything out of the ordinary after the reboot either, other than a segfault in the securid module ~20 minutes before complete unresponsiveness of the box.
We did not have any hardware issues or cpu / memory problems on the server just that the process I mentioned hanged and led to the problems I described. Splunk is still running but cannot be contacted. And it seems to occur on our newer search head running 4.3