Monitoring Splunk

(Troubleshooting) Indexer became unresponsive today; rebooting server fixed it. A number of splunkd processes are dying and starting back up, is this normal behavior?

dpanych
Communicator

One of the six indexers we have were unresponsive today. I couldn't login through the web interface and ssh'ing to the server was very slow. I figured it's an OS problem, rebooted the server, and things seem to be clear. While looking at the logs, I noticed a number of splunkd dying. Is that normal? The server OS is RHEL 7x.

Dec 28 10:56:21 PRDSRV01 systemd[1]: systemd-journald.service: got WATCHDOG=1
Dec 28 10:56:23 PRDSRV01 systemd[1]: Received SIGCHLD from PID 120270 (splunkd).
Dec 28 10:56:23 PRDSRV01 systemd[1]: Child 120270 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:25 PRDSRV01 systemd[1]: Received SIGCHLD from PID 120277 (splunkd).
Dec 28 10:56:25 PRDSRV01 systemd[1]: Child 120277 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:27 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121218 (splunkd).
Dec 28 10:56:27 PRDSRV01 systemd[1]: Child 121218 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:29 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121211 (splunkd).
Dec 28 10:56:29 PRDSRV01 systemd[1]: Child 121211 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:31 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121697 (splunkd).
Dec 28 10:56:31 PRDSRV01 systemd[1]: Child 121697 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:32 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121431 (splunkd).
Dec 28 10:56:32 PRDSRV01 systemd[1]: Child 121431 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:34 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121228 (splunkd).
Dec 28 10:56:34 PRDSRV01 systemd[1]: Child 121228 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:36 PRDSRV01 systemd[1]: Received SIGCHLD from PID 122819 (splunkd).
Dec 28 10:56:36 PRDSRV01 systemd[1]: Child 122819 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:38 PRDSRV01 systemd[1]: Received SIGCHLD from PID 121324 (splunkd).
Dec 28 10:56:38 PRDSRV01 systemd[1]: Child 121324 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:40 PRDSRV01 systemd[1]: Received SIGCHLD from PID 120159 (splunkd).
Dec 28 10:56:40 PRDSRV01 systemd[1]: Child 120159 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:42 PRDSRV01 systemd[1]: Received SIGCHLD from PID 120296 (splunkd).
Dec 28 10:56:42 PRDSRV01 systemd[1]: Child 120296 (splunkd) died (code=exited, status=255/n/a)
Dec 28 10:56:44 PRDSRV01 systemd[1]: Received SIGCHLD from PID 123182 (splunkd).
Dec 28 10:56:44 PRDSRV01 systemd[1]: Child 123182 (splunkd) died (code=exited, status=255/n/a) 
0 Karma

Masa
Splunk Employee
Splunk Employee

It is difficult to say cause of the issue.

But, according to your description,

One of the six indexers we have were unresponsive today. I couldn't login
through the web interface and ssh'ing to the server was very slow. I figured
it's an OS problem, rebooted the server, and things seem to be clear.

I believe splunk processes also got affected by the system resource/performance issue. Potentially main splunkd and child splunkd processes could not communicated at all and died.

mattymo
Splunk Employee
Splunk Employee

What is the ulimit setting for the user running splunk on this server??

Usually ulimits crashes will cause a crash file to be present, which I believe you said there are none, but it is worth a look.

ulimit -a

Also be sure to check splunkd.log for any errors or warns.

- MattyMo
0 Karma

alemarzu
Motivator

Hi @dpanych,

Any crash log in $SPLUNK_HOME\var\log\splunk ?

EDIT: path

0 Karma

dpanych
Communicator

I do not see anything at that location, but I found some crash logs dated beginning of 2016 (crash-2016-xxxx) in %splunk_home%/var/log/splunk; I think those are irrelevant. Is that behavior normal as the indexer processes and indexes data?

0 Karma

alemarzu
Motivator

Oh my bad, \var\log\splunk it is. Thats not normal for sure.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...