Good unix way check if splunkd and splunkweb are r...

jeffoptimizely · ‎11-16-2011

What's a good Unix-y way to check whether splunkd and splunkweb are running? (I know the bin/splunk command does this before restarting)

I want to run a cronjob that restarts splunkd and/or splunkweb if one of them goes down.

Maybe something combined with "ps -ef | grep splunk"? It'd be nice if someone from Splunk could just look in the code for how the binary "splunk" command checks whether a service is already running.

Thanks,
Jeff

fairje · ‎09-02-2015

I know this is an older question but as I was recently tackling this issue since splunk was randomly crashing on me (I still haven't got to the root of that issue) I needed some way to force splunk to start back up again after it died so I didn't potentially start losing logs as this is one of my heavy forwarders that keeps crashing.

#!/bin/sh

# ## Splunk Health Checking Script to run hourly ##
# This will run some basic checks to ensure      ##
# splunk is running and restart those services   ##
# if it fails a check.                           ##
## ############################################# ##

service=splunk

# Error handling function
function errorCheck {
        if [ $? -ne 0 ] ; then
                echo "Error occurred connecting on port 8089 for $service"
                /etc/init.d/$service start
        fi
}


# check for the processes to be running
if (( $(ps -ef | grep -v grep | grep $service | wc -l) > 0 )); then
       echo "$service is running!!!"
else
       /etc/init.d/$service start
fi

# check for the service itself to be running
# sometimes the service can crash leaving stale PID's running
if (( $(/etc/init.d/$service status | grep "splunkd is running" | wc -l) > 0 )) ; then
       echo "$service is running!!!"
else
       /etc/init.d/$service start
fi

# check if we can connect locally on port 8089
/usr/bin/curl -s -k -o "/dev/null" https://127.0.0.1:8089
errorCheck

Note that because I am running this on a heavy forwarder, it is not running splunk web. So I haven't tested this yet on something running splunk web to see the impact of this or if it will need to be tweaked. Also, standard disclosure, please test this script in a safe way as I take no responsibility for any impact of using this script. It worked for me, but YMMV.

So just to break it down, I took the recommendations from other people's answers here to go through the checks that would be needed to ensure it is fully working.

First:

$(ps -ef | grep -v grep | grep $service | wc -l) > 0

This runs the command recommended above with a twist, I pipe into wc and have it just tell me how many lines are returned, if there are no running splunk processes this will be 0, so if it is 0 we need to start splunk.

Second:

$(/etc/init.d/$service status | grep "splunkd is running" | wc -l) > 0

Same concept, this time doing a service status check. Note that you need this registered with your system as a service. This is done as part of the "enable boot-start" command on the splunk CLI, you could replace this with pointing to the splunk binary directly instead. If it isn't running it should return a 0.

Lastly:

/usr/bin/curl -s -k -o "/dev/null" https://127.0.0.1:8089
 errorCheck

This uses curl, you could also do a wget instead. I opted for curl because it was installed by default on my RHEL image whereas wget is not so to avoid adding extra software to the system I opted for curl. Essentially the principle here is that you are running it silently, I added the "-k" option because you are calling it by the loopback IP and your certificates will not like that and error out on you. We aren't trying to validate the certificates, just see if it is running so therefore I threw in the "-k" (again wget has a similar option if you need to switch to that). lastly I am outputing the results of this call to null so it doesn't get saved anywhere. I don't need the page itself it is the error codes I care about. Which brings me to the "errorCheck" function. This is looking for the results to have come back as a 0, meaning success. You can go find what the other codes mean, I didn't care since anything other than 0 is a problem which means we should start the splunk service.

Hope this helps!

pivotaltracker · ‎11-05-2014

FWIW, we hooked the splunk pid file and init script up to monit, and it worked fine (even though it contains two pids, monit seems to be smart enough to only grab the first one).

rooney · ‎12-16-2011

I'm using Nagios with the stock check_procs nagios plugin along with check_listen_tcp_udp. You can use NRPE to check from a Nagios server to any systems with Splunk instances. So in nrpe.cfg I have:

command[check_splunk_indexer_proc]=/apps/tools/nagios/libexec/check_procs -c 1:1 -C splunkd -u {USER} -s Ss -a '-p {MGMTPORT}'
command[check_splunk_indexer_mgmtport]=/apps/tools/nagios/libexec/custom/check_listen_tcp_udp.sh -p {MGMTPORT} -P tcp
command[check_splunk_indexer_webport]=/apps/tools/nagios/libexec/custom/check_listen_tcp_udp.sh -p {WEBPORT} -P tcp

Just replace {USER} with the user you run Splunk as and fill in your {MGMTPORT} and {WEBPORT}.

You could take it a step further and use an event handler with a simple script to automatically restart Splunk if it is found to not be running.

rooney · ‎12-21-2011

They all run with different mgmt ports. So you use the port they run on to differentiate. For example, here are two instances on the same host, one uses 8089 the other 8092 for the management port:

$ ps x | grep -i splunkd
1630 ? Sl 69477:29 splunkd -p 8092 start
1631 ? Ss 10:42 splunkd -p 8092 start
7146 ? Sl 7200:38 splunkd -p 8089 restart
7147 ? Ss 13:03 splunkd -p 8089 restart

So with check_procs you do -a '-p 8089' and -a '-p 8092' for the other. Similarly check_listen_tcp_udp.sh can be used to make sure splunkd is listening on the proper port.

mfrost8 · ‎12-16-2011

The problem I have with this and some of the other approaches is if you have more than one Splunk instance on the box. Say like you've got a Splunk indexer and Splunk deployment server on the machine. They all show up as splunkd and you can't differentiate from 'ps' or with check_procs really.

I would like to go the route of reading the pids from the pidfiles (seems most direct), but the permissions on the default locations prevent all users except the splunk user or root from reading the dirs. The pid files are also not world-readable makeing that hard too.

jsb22 · ‎11-21-2011

For linux, you could try:

service splunk status

Don't know if that works in Unix or not. For a forwarder in my case, it'll give you output along the lines of:

Splunk status:
Splunkd is running (PID: 7523).
splunk helpers are running (PIDs: 7524).

MHibbin · ‎11-21-2011

If you use "ps -ef", you might be better grepping for "splunkd" and depending on your system you may have to pipe to another grep to exclude the actual grep command from the previous pipe. e.g...

ps -ef | grep splunkd | grep -v grep

And if you have multiple instances running on the same box, you may wish to explicitly define the port. e.g...

ps -ef | grep "splunkd -p 8091" | grep -v grep

Are you also aware you can use

$SPLUNK_HOME/bin/splunk status

Which will check the status of the both the mgmt daemon and web daemon. If the Splunk instance is not running, using this command will also clear the "splunkd.pid" file (mentioned above). The "wget" idea is quite good too (mentioned above). e.g...

wget http://127.0.0.1:8001

In relation to "jsb22"'s response... this may not work straight away... you may have to run following command....

$SPLUNK_HOME/bin/splunk enable boot-start

This will add it to init.d (rc#.d), or to remove simply replace enable with disable....

$SPLUNK_HOME/bin/splunk disable boot-start

This will run the equivalent of doing both (including stopping the Splunk helpers if they are still running, and clearing stale pid files)...

$SPLUNK_HOME/bin/splunk status splunkd
  and
$SPLUNK_HOME/bin/splunk status splunkweb

rajaguru2790 · ‎10-28-2019

Hi ,

Can someone please help me. I rain in heavy forwarder the above script but Splunk is not starting if it is down.

I stopped the splunk and ran the script manually.
chmod +x filename
./filename

Then I am getting like below "Starting splunk (via systemctl):". Splunk is not getting started. Please help me on this immediately.

It is a unix machine and I tried $ service splunk status but it failed.

Starting splunk (via systemctl): [ OK ]
Starting splunk (via systemctl): [ OK ]
Error occurred connecting on port 8089 for splunk
Starting splunk (via systemctl): [ OK ]
Starting splunk (via systemctl): [ OK ]
Starting splunk (via systemctl): [ OK ]
Error occurred connecting on port 8089 for splunk
Starting splunk (via systemctl): [ OK ]

joshd · ‎11-16-2011

Do you want to check if it's running or responsive? Those are two different things...

If it's running just write a quick script to grab the PIDs from $SPLUNK_HOME/var/run/splunk/splunkd.pid and make sure there's two instances running or if there's no file then that also means its down.

May also wish to include in your script a simple wget call to make sure the login page is returned, just to verify it's responding to requests while it's running.

Good unix way check if splunkd and splunkweb are running

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!