Deployment Architecture

Search Head Cluster member splunkd.log shows thousands of this error message: "SHPMaster - did not schedule removal for peer"

sat94541
Communicator

The splunkd.log file on a search head cluster member shows 450K of this type of error:

05-01-2016 18:42:15.010 -0400 WARN SHPMaster - did not schedule removal for peer='E984FB44-0D11-4B0B-ABBF-95E9BAE41658', err='SHPMaster::scheduleRemoveArtifactFromPeer_locked: aid=scheduler__admin__search__RMD51fa2da2a876e07a8_at_1461834240_5_E984FB44-0D11-4B0B-ABBF-95E9BAE41658 peer=E984FB44-0D11-4B0B-ABBF-95E9BAE41658 is pending some change, status='PendingDiscard''

Clarification on this error messages should be useful.

Tags (1)
0 Karma

dineshraj9
Builder

Setting max_searches_per_process = 1 in limits.conf resolved the issue.

http://docs.splunk.com/Documentation/Splunk/latest/admin/limitsconf

max_searches_per_process = <int>
* On UNIX, specifies the maximum number of searches that each search process
  can run before exiting.
* After a search completes, the search process can wait for another search to
  start and the search process can be reused.
* When set to “0” or “1”: The process is never reused. 
* When set to a negative value: There is no limit to the number of searches
  that a process can run.
* Has no effect on Windows if search_process_mode is not "auto”.
* Default: 500
0 Karma

sloshburch
Splunk Employee
Splunk Employee

I'm hearing that this is a workaround/bandaid and really diminishes the power of the platform. I'm hearing that the stronger solution would be to open a support case with Splunk.

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

Thanks for sharing that max_searches_per_process=1 resolved it.

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

There are two scenarios this could happen..

If you find the "/services/shcluster/member/artifacts//discard endpoint failing due to network/REST issues on splunkd_access.log on the members this can happen.

Other possibility is lot of messages between captain/members leading to delay in messages..

For the 1st case, recommendation is to transfer captaincy to a different node.. (6.2.3 does not have transfer captaincy command) So manually transfer captaincy by bringing down the current captain will fix the current flow of WARN messages

if you don't find any network issues/REST call failures it could be the delay in jobs and increasing the "executor_workers = 20" (default 10) in stanza shclustering in server.conf might help..

Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...