Knowledge Management

"Fatal thread error: pthread_mutex_lock: " when persistent queue enabled. Crashing thread: typing

hrawat_splunk
Splunk Employee
Splunk Employee

Heavy forwarder or indexer crashes with FATAL error on typing thread.

 

Note: Issue is now fixed for next 9.2.2/9.1.5/9.0.10 patches

 

Crashing thread: typing_0
Backtrace (PIC build):
  [0x00007F192F4C2ACF] gsignal + 271 (libc.so.6 + 0x4EACF)
  [0x00007F192F495EA5] abort + 295 (libc.so.6 + 0x21EA5)
  [0x000055E24388D6C0] ? (splunkd + 0x1A366C0)
  [0x000055E24388D770] ? (splunkd + 0x1A36770)
  [0x000055E2445D6D24] PipelineInputChannelReference::PipelineInputChannelReference(Str const**, PipelineInputChannelSet*, bool) + 388 (splunkd + 0x277FD24)
  [0x000055E2445BACC3] PipelineData::set_channel(Str const*, Str const*, Str const*) + 243 (splunkd + 0x2763CC3)
  [0x000055E2445BAF9E] PipelineData::recomputeConfKey(PipelineSet*, bool) + 286 (splunkd + 0x2763F9E)
  [0x000055E243E3689E] RegexExtractionProcessor::each(CowPipelineData&, PipelineDataVector*, bool) + 718 (splunkd + 0x1FDF89E)
  [0x000055E243E36BF3] RegexExtractionProcessor::executeMulti(PipelineDataVector&, PipelineDataVector*) + 67 (splunkd + 0x1FDFBF3)
  [0x000055E243BCD5F2] Pipeline::main() + 1074 (splunkd + 0x1D765F2)
  [0x000055E244C336FD] Thread::_callMainAndDiscardTerminateException() + 13 (splunkd + 0x2DDC6FD)
  [0x000055E244C345F2] Thread::callMain(void*) + 178 (splunkd + 0x2DDD5F2)
  [0x00007F192FF1F1CA] ? (libpthread.so.0 + 0x81CA)
  [0x00007F192F4ADE73] clone + 67 (libc.so.6 + 0x39E73)

 

 

 

 

 

 

 

 

 

Crashing thread: typing_0
Backtrace (PIC build):
  [0x00007F192F4C2ACF] gsignal + 271 (libc.so.6 + 0x4EACF)
  [0x00007F192F495EA5] abort + 295 (libc.so.6 + 0x21EA5)
  [0x000055E24388D6C0] ? (splunkd + 0x1A366C0)
  [0x000055E24388D770] ? (splunkd + 0x1A36770)
  [0x000055E2445D6D24] _ZN29PipelineInputChannelReferenceC2EPPK3StrP23PipelineInputChannelSetb + 388 (splunkd + 0x277FD24)
  [0x000055E2445BACC3] _ZN12PipelineData11set_channelEPK3StrS2_S2_ + 243 (splunkd + 0x2763CC3)
  [0x000055E2445BAF9E] _ZN12PipelineData16recomputeConfKeyEP11PipelineSetb + 286 (splunkd + 0x2763F9E)
  [0x000055E243E3689E] _ZN24RegexExtractionProcessor4eachER15CowPipelineDataP18PipelineDataVectorb + 718 (splunkd + 0x1FDF89E)
  [0x000055E243E36BF3] _ZN24RegexExtractionProcessor12executeMultiER18PipelineDataVectorPS0_ + 67 (splunkd + 0x1FDFBF3)
  [0x000055E243BCD5F2] _ZN8Pipeline4mainEv + 1074 (splunkd + 0x1D765F2)
  [0x000055E244C336FD] _ZN6Thread37_callMainAndDiscardTerminateExceptionEv + 13 (splunkd + 0x2DDC6FD)
  [0x000055E244C345F2] _ZN6Thread8callMainEPv + 178 (splunkd + 0x2DDD5F2)
  [0x00007F192FF1F1CA] ? (libpthread.so.0 + 0x81CA)
  [0x00007F192F4ADE73] clone + 67 (libc.so.6 + 0x39E73)

 Last few lines of stderr (may contain info on assertion failure, but also could be old):

    Fatal thread error: pthread_mutex_lock: Invalid argument; 117 threads active, in typing_0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fatal thread error: pthread_mutex_lock: Invalid argument; 

 

 

 

 

 

 

 

 


This crash happens if persistent queue is enabled. It has been reported for several years.
I see one reported back in 2015 as well.
https://community.splunk.com/t5/Monitoring-Splunk/What-would-cause-a-Fatal-thread-error-in-thread-ty...
The bug existed always but the interesting part is, since 9.x the frequency of crashes has gone up. More customers are reporting the crashes now. The probability of hitting the race condition has gone up now.

We are fixing the issue( internal ticket SPL-251434) for next patch, in the mean time here are few workarounds to consider depending on what is feasible for your requirement.

The reason for 9.x high frequency of crashes on instance with persistent queue enabled is that the forwarders(UF/HF/IUF/IHF) are sending data at faster rate due to 9.x autoBatch, thus small in-memory part of persistent queue (default 500KB) makes it nearly impossible to not bring persistent queue part (writing on to disk) into play. Meaning now 9.x receiver with persistent queue is writing on to disk nearly all the time even when down stream pipeline queues are not saturated. So the best solution to bring crashing frequency to the level of 8.x or older is to increase  in-memory part of persistent queue ( so that if no down stream queues full not disk writes to persistent queue).

However the fundamental bug still remains there and will be fixed in a patch.  The workarounds are reducing the possibility of disk writes for persistent queue. So have a look at the 3 possible workarounds and see which one works for you.

1. Turn off persistent queue on splunktcpin port( I sure not feasible for all). This will eliminate the crash.

2. Disable `splunk_internal_metrics`  app as it does source type cloning for metrics.log. Most of us probably not aware that metrics.log is cloned and additionally indexed into `_metrics` index. If you are not using `_metrics` index, disable the app.
For crash to happen, you need two conditions 
  a) persistent queue
  b) sourcetype cloning.


3. Apply following configs to reduce the chances of crashes.

  • limits.conf 
    [input_channels]
    max_inactive=300001
    lowater_inactive=300000
    inactive_eligibility_age_seconds=120

  • inputs.conf, increase im-memory queue size of PQ( depending on ssl or non-ssl port)
    [splunktcp-ssl:<port>]
    queueSize=100MB
    [splunktcp:<port>]
    queueSize=100MB

  • Enable Async Forwarding on HF/IUF/IHF (crashing instance) 

4.  Slow down forwarders by setting `autoBatch=false` on all universal forwarders/heavy forwarders . 

Labels (1)
Tags (1)

hrawat_splunk
Splunk Employee
Splunk Employee

>Also, why didn't Splunk containers crash when this kind of failure happen? 


It's a race condition that happens if all of the following are true.
1. Using persistent queue
2. Received splunk metrics/introspection etc events from previous layer via splunktcpin port and cloning these events ( example `splunk_internal_metrics`  app).
3. Queues were blocked on the instance, which triggered metrics/introspection etc events getting on to PQ disk queue, events being read from PQ disk queue after splunk restart and cloned.
If any instance avoids atleast one of above condition, will be able to avoid the crash.

Any event hitting PQ disk, read from PQ disk after splunk restart and cloned will cause crash.

0 Karma

hrawat_splunk
Splunk Employee
Splunk Employee

Good news is it's high priority issue for us, fixed for upcoming major release 9.3.0( conf release ).
Backported to upcoming 9.0.x/9.1.x/9.2.x patches (9.2.2/9.1.5/9.0.10).

0 Karma

DG
Explorer

Thank you for the information @hrawat_splunk !
Do I understand correctly that the "backports" are coming with the major release, as you said "conf release" - so around the Splunk conference, June 11-14?

0 Karma

hrawat_splunk
Splunk Employee
Splunk Employee

That's right.

rphillips_splk
Splunk Employee
Splunk Employee

How to identify if you are hitting this bug:

1.) do you have persistentQueueSize set over 20MB in inputs.conf of the forwarder:
Ie:

etc/system/local/inputs.conf

[splunktcp-ssl:9996]
persistentQueueSize = 120GB

 

2.) is Splunk crashing after a restart where the crashing thread is Crashing thread: typing_0?

3.) you see in splunkd_stderr.log logs like:

2024-01-11 21:16:19.669 +0000 splunkd started (build 050c9bca8588) pid=2506904 Fatal thread error: pthread_mutex_lock: Invalid argument; 117 threads active, in typing_0

0 Karma

rphillips_splk
Splunk Employee
Splunk Employee

As mentioned by @hrawat_splunk the workaround is to disable the splunk_internal_metrics on the forwarder:

$SPLUNK_HOME/etc/apps/splunk_internal_metrics/local/app.conf

 

[install]
state = disabled

 


The bug (SPL-251434) will be fixed in the following releases:
9.2.2/9.1.5/9.0.10
This app:
Clones all metrics.log events into a metric_log sourcetype. The metrics_log sourcetype does the necessary conversion from log to metric event, and redirects the events to the _metrics index.

0 Karma

DG
Explorer

Hi @rphillips_splk , @hrawat_splunk 

It's great to hear that it will be finally fixed, but when will you release those fixed versions? I don't find those tags on docker hub.

Also, why didn't Splunk containers crash when this kind of failure happen? We are running the splunk/splunk images (as heavy forwarders) on K8S and we only noticed the issue when we saw that the network thoughput was low on a pod. K8S didn't restart the pod automatically because it didn't crash. The container stayed there as a zombie and didn't do any forwarding.

Thank you!

Regards,

DG

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...