Re: "Fatal thread error: pthread_mutex_lock: " whe...

hrawat_splunk · ‎02-22-2024

Heavy forwarder or indexer crashes with FATAL error on typing thread.

Note: Issue is now fixed for next 9.2.2/9.1.5/9.0.10 patches

Crashing thread: typing_0
Backtrace (PIC build):
  [0x00007F192F4C2ACF] gsignal + 271 (libc.so.6 + 0x4EACF)
  [0x00007F192F495EA5] abort + 295 (libc.so.6 + 0x21EA5)
  [0x000055E24388D6C0] ? (splunkd + 0x1A366C0)
  [0x000055E24388D770] ? (splunkd + 0x1A36770)
  [0x000055E2445D6D24] PipelineInputChannelReference::PipelineInputChannelReference(Str const**, PipelineInputChannelSet*, bool) + 388 (splunkd + 0x277FD24)
  [0x000055E2445BACC3] PipelineData::set_channel(Str const*, Str const*, Str const*) + 243 (splunkd + 0x2763CC3)
  [0x000055E2445BAF9E] PipelineData::recomputeConfKey(PipelineSet*, bool) + 286 (splunkd + 0x2763F9E)
  [0x000055E243E3689E] RegexExtractionProcessor::each(CowPipelineData&, PipelineDataVector*, bool) + 718 (splunkd + 0x1FDF89E)
  [0x000055E243E36BF3] RegexExtractionProcessor::executeMulti(PipelineDataVector&, PipelineDataVector*) + 67 (splunkd + 0x1FDFBF3)
  [0x000055E243BCD5F2] Pipeline::main() + 1074 (splunkd + 0x1D765F2)
  [0x000055E244C336FD] Thread::_callMainAndDiscardTerminateException() + 13 (splunkd + 0x2DDC6FD)
  [0x000055E244C345F2] Thread::callMain(void*) + 178 (splunkd + 0x2DDD5F2)
  [0x00007F192FF1F1CA] ? (libpthread.so.0 + 0x81CA)
  [0x00007F192F4ADE73] clone + 67 (libc.so.6 + 0x39E73)

Crashing thread: typing_0
Backtrace (PIC build):
  [0x00007F192F4C2ACF] gsignal + 271 (libc.so.6 + 0x4EACF)
  [0x00007F192F495EA5] abort + 295 (libc.so.6 + 0x21EA5)
  [0x000055E24388D6C0] ? (splunkd + 0x1A366C0)
  [0x000055E24388D770] ? (splunkd + 0x1A36770)
  [0x000055E2445D6D24] _ZN29PipelineInputChannelReferenceC2EPPK3StrP23PipelineInputChannelSetb + 388 (splunkd + 0x277FD24)
  [0x000055E2445BACC3] _ZN12PipelineData11set_channelEPK3StrS2_S2_ + 243 (splunkd + 0x2763CC3)
  [0x000055E2445BAF9E] _ZN12PipelineData16recomputeConfKeyEP11PipelineSetb + 286 (splunkd + 0x2763F9E)
  [0x000055E243E3689E] _ZN24RegexExtractionProcessor4eachER15CowPipelineDataP18PipelineDataVectorb + 718 (splunkd + 0x1FDF89E)
  [0x000055E243E36BF3] _ZN24RegexExtractionProcessor12executeMultiER18PipelineDataVectorPS0_ + 67 (splunkd + 0x1FDFBF3)
  [0x000055E243BCD5F2] _ZN8Pipeline4mainEv + 1074 (splunkd + 0x1D765F2)
  [0x000055E244C336FD] _ZN6Thread37_callMainAndDiscardTerminateExceptionEv + 13 (splunkd + 0x2DDC6FD)
  [0x000055E244C345F2] _ZN6Thread8callMainEPv + 178 (splunkd + 0x2DDD5F2)
  [0x00007F192FF1F1CA] ? (libpthread.so.0 + 0x81CA)
  [0x00007F192F4ADE73] clone + 67 (libc.so.6 + 0x39E73)

 Last few lines of stderr (may contain info on assertion failure, but also could be old):

    Fatal thread error: pthread_mutex_lock: Invalid argument; 117 threads active, in typing_0

Fatal thread error: pthread_mutex_lock: Invalid argument;

This crash happens if persistent queue is enabled. It has been reported for several years.
I see one reported back in 2015 as well.
https://community.splunk.com/t5/Monitoring-Splunk/What-would-cause-a-Fatal-thread-error-in-thread-ty...
The bug existed always but the interesting part is, since 9.x the frequency of crashes has gone up. More customers are reporting the crashes now. The probability of hitting the race condition has gone up now.

We are fixing the issue( internal ticket SPL-251434) for next patch, in the mean time here are few workarounds to consider depending on what is feasible for your requirement.

The reason for 9.x high frequency of crashes on instance with persistent queue enabled is that the forwarders(UF/HF/IUF/IHF) are sending data at faster rate due to 9.x autoBatch, thus small in-memory part of persistent queue (default 500KB) makes it nearly impossible to not bring persistent queue part (writing on to disk) into play. Meaning now 9.x receiver with persistent queue is writing on to disk nearly all the time even when down stream pipeline queues are not saturated. So the best solution to bring crashing frequency to the level of 8.x or older is to increase in-memory part of persistent queue ( so that if no down stream queues full not disk writes to persistent queue).

However the fundamental bug still remains there and will be fixed in a patch. The workarounds are reducing the possibility of disk writes for persistent queue. So have a look at the 3 possible workarounds and see which one works for you.

1. Turn off persistent queue on splunktcpin port( I sure not feasible for all). This will eliminate the crash.

2. Disable `splunk_internal_metrics` app as it does source type cloning for metrics.log. Most of us probably not aware that metrics.log is cloned and additionally indexed into `_metrics` index. If you are not using `_metrics` index, disable the app.
For crash to happen, you need two conditions
a) persistent queue
b) sourcetype cloning.

3. Apply following configs to reduce the chances of crashes.

limits.conf
[input_channels]
max_inactive=300001
lowater_inactive=300000
inactive_eligibility_age_seconds=120
inputs.conf, increase im-memory queue size of PQ( depending on ssl or non-ssl port)
[splunktcp-ssl:<port>]
queueSize=100MB
[splunktcp:<port>]
queueSize=100MB
Enable Async Forwarding on HF/IUF/IHF (crashing instance)

4. Slow down forwarders by setting `autoBatch=false` on all universal forwarders/heavy forwarders .

hrawat_splunk

>Also, why didn't Splunk containers crash when this kind of failure happen?

It's a race condition that happens if all of the following are true.
1. Using persistent queue
2. Received splunk metrics/introspection etc events from previous layer via splunktcpin port and cloning these events ( example `splunk_internal_metrics` app).
3. Queues were blocked on the instance, which triggered metrics/introspection etc events getting on to PQ disk queue, events being read from PQ disk queue after splunk restart and cloned.
If any instance avoids atleast one of above condition, will be able to avoid the crash.

Any event hitting PQ disk, read from PQ disk after splunk restart and cloned will cause crash.

hrawat_splunk

Good news is it's high priority issue for us, fixed for upcoming major release 9.3.0( conf release ).
Backported to upcoming 9.0.x/9.1.x/9.2.x patches (9.2.2/9.1.5/9.0.10).

DG

Thank you for the information @hrawat_splunk !
Do I understand correctly that the "backports" are coming with the major release, as you said "conf release" - so around the Splunk conference, June 11-14?

hrawat_splunk

That's right.

rphillips_splk · ‎03-25-2024

How to identify if you are hitting this bug:

1.) do you have persistentQueueSize set over 20MB in inputs.conf of the forwarder:
Ie:

etc/system/local/inputs.conf

[splunktcp-ssl:9996]
persistentQueueSize = 120GB

2.) is Splunk crashing after a restart where the crashing thread is Crashing thread: typing_0?

3.) you see in splunkd_stderr.log logs like:

2024-01-11 21:16:19.669 +0000 splunkd started (build 050c9bca8588) pid=2506904 Fatal thread error: pthread_mutex_lock: Invalid argument; 117 threads active, in typing_0

rphillips_splk · ‎03-25-2024

As mentioned by @hrawat_splunk the workaround is to disable the splunk_internal_metrics on the forwarder:

$SPLUNK_HOME/etc/apps/splunk_internal_metrics/local/app.conf

[install]
state = disabled

The bug (SPL-251434) will be fixed in the following releases:
9.2.2/9.1.5/9.0.10
This app:
Clones all metrics.log events into a metric_log sourcetype. The metrics_log sourcetype does the necessary conversion from log to metric event, and redirects the events to the _metrics index.

DG

Hi @rphillips_splk , @hrawat_splunk

It's great to hear that it will be finally fixed, but when will you release those fixed versions? I don't find those tags on docker hub.

Also, why didn't Splunk containers crash when this kind of failure happen? We are running the splunk/splunk images (as heavy forwarders) on K8S and we only noticed the issue when we saw that the network thoughput was low on a pod. K8S didn't restart the pod automatically because it didn't crash. The container stayed there as a zombie and didn't do any forwarding.

Thank you!

Regards,

DG

"Fatal thread error: pthread_mutex_lock: " when persistent queue enabled. Crashing thread: typing

other

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases