I have several indexers in my cluster that are not producing introspection data, despite being identically configured to their cluster mates. This is causing my DMC to show unknown status (question marks) for these indexers on several DMC status panels. Why?
My indexers are running on RHEL 6.
(Warning: Medium length troubleshooting section follows, enjoy or skip to the Teal Deer at the end)
Normally introspection data is collected by a helper process that starts as a child when splunkd does. This process is a copy of splunkd with the argument "instrument-resource-usage". On some hosts we see this process:
$ ps -ef | grep splunk | grep -v search
splunk 18731 1 36 Feb26 ? 6-03:18:49 splunkd -p 8089 restart
splunk 18732 18731 0 Feb26 ? 01:28:01 [splunkd pid=18731] splunkd -p 8089 restart [process-runner]
splunk 18957 18732 0 Feb26 ? 00:39:53 /opt/splunk/bin/splunkd instrument-resource-usage -p 8089 --with-kvstore
On some we don't:
$ ps -ef | grep splunkd | grep -v "search"
splunk 30198 1 39 02:44 ? 03:07:52 splunkd -p 8089 restart
splunk 30199 30198 0 02:44 ? 00:02:00 [splunkd pid=30198] splunkd -p 8089 restart [process-runner]
And this missing process correlates with the hosts where we are not getting introspection data. Normally this process should get started every time at Splunk start. We can see it get started in splunkd.log
during our last restart:
$ egrep instrument-resource-usage $SPLUNK_HOME/var/log/splunk/splunkd.log
03-14-2016 02:44:10.822 -0400 INFO ExecProcessor - New scheduled exec process: /opt/splunk/bin/splunkd instrument-resource-usage
But it's not running long after the restart, as it's gone when we look only seconds later. We can try to run it ourselves and see if we get output:
$ /opt/splunk/bin/splunkd instrument-resource-usage
INFO RU_main - I-data gathering (Resource Usage) starting; period=10s
Cond 'swapSize_free <= ru._swap.get()' false; line 443, collect_hostwide__Linux()
Oof. An assertion it looks like, related to swap? Strange. Let's see what our swap looks like:
$ free -m
total used free shared buffers cached
Mem: 12921 12629 291 0 237 10158
-/+ buffers/cache: 2234 10686
Swap: 14855 1759218580 256081
Okay. Weird. We have LOTS more swap free than we do used. How does free
come up with those numbers? Time to dig out the source. (tick tock tick tock tick tock) Turns out free gets those numbers from /proc/meminfo
- and that is messed up too.
$ grep Swap /proc/meminfo
SwapCached: 19784 kB
SwapTotal: 15212520 kB
SwapFree: 262221008 kB
(And here's the TL;DR)
Now we're getting somewhere. If /proc/meminfo
is wrong, then it's most likely a kernel bug (whee!). A little googling finds https://access.redhat.com/solutions/1571043. If you don't have access behind Red Hat's paywall - the gist is that there was a bug in kernel-2.6.32-573.1.1.el6.x86_64 that causes SwapFree to get counted wrong. You can reboot to fix it temporarily but it's only temporary. Fixing it permanently requires updating to a newer kernel. This problem likely also occurs in CentOS at the same kernel level.
(Warning: Medium length troubleshooting section follows, enjoy or skip to the Teal Deer at the end)
Normally introspection data is collected by a helper process that starts as a child when splunkd does. This process is a copy of splunkd with the argument "instrument-resource-usage". On some hosts we see this process:
$ ps -ef | grep splunk | grep -v search
splunk 18731 1 36 Feb26 ? 6-03:18:49 splunkd -p 8089 restart
splunk 18732 18731 0 Feb26 ? 01:28:01 [splunkd pid=18731] splunkd -p 8089 restart [process-runner]
splunk 18957 18732 0 Feb26 ? 00:39:53 /opt/splunk/bin/splunkd instrument-resource-usage -p 8089 --with-kvstore
On some we don't:
$ ps -ef | grep splunkd | grep -v "search"
splunk 30198 1 39 02:44 ? 03:07:52 splunkd -p 8089 restart
splunk 30199 30198 0 02:44 ? 00:02:00 [splunkd pid=30198] splunkd -p 8089 restart [process-runner]
And this missing process correlates with the hosts where we are not getting introspection data. Normally this process should get started every time at Splunk start. We can see it get started in splunkd.log
during our last restart:
$ egrep instrument-resource-usage $SPLUNK_HOME/var/log/splunk/splunkd.log
03-14-2016 02:44:10.822 -0400 INFO ExecProcessor - New scheduled exec process: /opt/splunk/bin/splunkd instrument-resource-usage
But it's not running long after the restart, as it's gone when we look only seconds later. We can try to run it ourselves and see if we get output:
$ /opt/splunk/bin/splunkd instrument-resource-usage
INFO RU_main - I-data gathering (Resource Usage) starting; period=10s
Cond 'swapSize_free <= ru._swap.get()' false; line 443, collect_hostwide__Linux()
Oof. An assertion it looks like, related to swap? Strange. Let's see what our swap looks like:
$ free -m
total used free shared buffers cached
Mem: 12921 12629 291 0 237 10158
-/+ buffers/cache: 2234 10686
Swap: 14855 1759218580 256081
Okay. Weird. We have LOTS more swap free than we do used. How does free
come up with those numbers? Time to dig out the source. (tick tock tick tock tick tock) Turns out free gets those numbers from /proc/meminfo
- and that is messed up too.
$ grep Swap /proc/meminfo
SwapCached: 19784 kB
SwapTotal: 15212520 kB
SwapFree: 262221008 kB
(And here's the TL;DR)
Now we're getting somewhere. If /proc/meminfo
is wrong, then it's most likely a kernel bug (whee!). A little googling finds https://access.redhat.com/solutions/1571043. If you don't have access behind Red Hat's paywall - the gist is that there was a bug in kernel-2.6.32-573.1.1.el6.x86_64 that causes SwapFree to get counted wrong. You can reboot to fix it temporarily but it's only temporary. Fixing it permanently requires updating to a newer kernel. This problem likely also occurs in CentOS at the same kernel level.