Solved: Why splunkd is exiting with retrun code 2, when ru...

luhadia_aditya · ‎11-19-2014

Scenario - We have 2 forwarders, both have been configured in to an HA cluster using Heartbeat for failover situation. Both the forwarders have syslog-ng listening on port 514 and filtering and writing data in to respective log files which are further monitored by splunk. We have added syslog-ng, splunkd and splunkweb services in to heartbeat resource configuration to make sure that when any of these services go down, heartbeat will issue the respawn and get them restarted as -

/etc/ha.d/ha.cf

### Start the Syslog Service ### 
respawn root /sbin/syslog-ng -F
### Start the Splunk Services ### 
respawn splunk /opt/splunk/bin/splunk start splunkd --nodaemon --no-prompt --answer-yes 
respawn splunk /opt/splunk/bin/splunk start splunkweb --nodaemon --no-prompt --answer-yes

--nodeamon parameter to run splunk services in foreground hence the starting processes never exit and heartbeat can keep an eye on to that process making sure that splunk services are up and incase of exit hearbeat can respawn them again.

And incase of the network or node failure traffic will be switched over to secondary forwarder.

This configuration works like charm for syslog-ng and the service gets restarted by heartbeat in less than a second.

Question/Problem -
But for splunkd and splunk web services we are seeing scenarios wherein the start process exits with some exit codes (in an interval of 3 to 4 mins) and heartbeat respawns it again. Though this achieves the functionality of restarting services by heartbeat however this makes the solution unstable as there are regular restarts of the splunk services.

here are the hearbeat logs -

Nov 20 01:27:17 <hostname> heartbeat: [21234]: WARN: Managed /opt/splunk/bin/splunk start splunkweb --nodaemon --no-prompt --answer-yes process 21361 exited with return code 1.
Nov 20 01:27:17 <hostname> heartbeat: [21234]: ERROR: Respawning client "/opt/splunk/bin/splunk start splunkweb --nodaemon --no-prompt --answer-yes":
Nov 20 01:27:17 <hostname> heartbeat: [21234]: info: Starting child client "/opt/splunk/bin/splunk start splunkweb --nodaemon --no-prompt --answer-yes" (500,500)
Nov 20 01:27:18 <hostname> heartbeat: [21901]: info: Starting "/opt/splunk/bin/splunk start splunkweb --nodaemon --no-prompt --answer-yes" as uid 500  gid 500 (pid 21901) 

Nov 20 01:54:37 <hostname> heartbeat: [21234]: WARN: Managed /opt/splunk/bin/splunk start splunkd --nodaemon --no-prompt --answer-yes process 30173 exited with return code 2.
Nov 20 01:54:37 <hostname> heartbeat: [21234]: ERROR: Respawning client "/opt/splunk/bin/splunk start splunkd --nodaemon --no-prompt --answer-yes":
Nov 20 01:54:37 <hostname> heartbeat: [21234]: info: Starting child client "/opt/splunk/bin/splunk start splunkd --nodaemon --no-prompt --answer-yes" (500,500)
Nov 20 01:54:37 <hostname> heartbeat: [31569]: info: Starting "/opt/splunk/bin/splunk start splunkd --nodaemon --no-prompt --answer-yes" as uid 500  gid 500 (pid 31569)

These bunches will keep occurring in an interval of 3 to 5 mins.

Please help me understand if I am doing anything wrong. Any help is appreciable!
Thanks for your time.

cramasta · ‎11-19-2014

Hey Aditya,
Im not experienced in setting up HA like you are trying or running using nodeamon ( I was told to use nodeamon one time but it was for troubleshooting purposes only).

If only one host is going to be collecting logs (whoever has syslog-ng running), you could do a cheap workaround by just setting up a contab job to try and start the splunk forwarder every x minutes on both hosts. If the forwarders already running it wont make a difference as it knows that knows its already running. Only one host will be actively collecting logs via syslog-ng so it shouldn't hurt anything if both forwarders are running at the same time.

A crontab like this will try and start the forwarder every 5 minutes if it is down. Even if the forwarder goes down for a max of 5 minutes the forwarder will be able to know where it it left off on the files it was monitoring.
*/5 * * * * /opt/splunk/bin/splunk start

View solution in original post

jrodman · ‎04-14-2016

Splunkd's exit codes are not well-organized. You'd do better to consider all non-zero exit codes as "something was not good" and use the logs to investigate.

The heartbeat logs won't tell you anything, but splunkd.log may.

cramasta · ‎11-19-2014

Hey Aditya,
Im not experienced in setting up HA like you are trying or running using nodeamon ( I was told to use nodeamon one time but it was for troubleshooting purposes only).

If only one host is going to be collecting logs (whoever has syslog-ng running), you could do a cheap workaround by just setting up a contab job to try and start the splunk forwarder every x minutes on both hosts. If the forwarders already running it wont make a difference as it knows that knows its already running. Only one host will be actively collecting logs via syslog-ng so it shouldn't hurt anything if both forwarders are running at the same time.

A crontab like this will try and start the forwarder every 5 minutes if it is down. Even if the forwarder goes down for a max of 5 minutes the forwarder will be able to know where it it left off on the files it was monitoring.
*/5 * * * * /opt/splunk/bin/splunk start

luhadia_aditya · ‎11-19-2014

Hey Joe, good to see you here!!

Your suggestion works, but we need to use heartbeat itself to have the splunk restart due to design constrains.

I understand, If we have splunk down for 5 - 10 mins its absolutely ok, as the received data is anyway written in to the log files and will be ingested by splunk.

Do you have any clue as how different splunk in foreground (--nodaemon) behave in comparison to usual daemon.

Lucas_K · ‎11-19-2014

What version of splunk is that? I'd never seen the --nodaemon option listed anywhere before.

it also doesn't show up in a google search.

luhadia_aditya · ‎11-19-2014

Its Splunk 6.0.4 (build 207768). This parameter is not documented however it was an answer on the community - http://answers.splunk.com/answers/67442/can-run-splunkd-in-foreground.html

Lucas_K · ‎11-19-2014

Nice. As for your question as you sure that the process isn't actually just dying due to some other "normal" splunk issue?

I had similar issues with the prior version (specifically 6.0.3 we skipped 6.0.4) so it is possible its unrelated to your high availability configuration and occurs with a normal installation also.

luhadia_aditya · ‎11-19-2014

This behaviour is observed only after configuring heartbeat, prior to this we were using same heavy forwarder without HA and never had any issue about splunk services exiting in this manner.

Why splunkd is exiting with retrun code 2, when run in foreground (--nodaemon)

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor