Deployment Architecture

Searches are delayed by 10s before any results are displayed.

drrushi_splunk
Splunk Employee
Splunk Employee

In a distributed environment with search-head pooling enabled it takes about 10s from search execution to events being displayed on the UI.
A cursory look at the search job inspector reveals that startup.handoff is taking an inexplicable ~ 10s.
The behavior is the same across different browsers, search queries and timespans.

1 Solution

drrushi_splunk
Splunk Employee
Splunk Employee

The culprit for this condition in our case was the leftover pooling.ini.lock file under /etc/pooling/.
Essentially when a pool member (search-head) validates itself with the pool a check is performed against pooling.ini and a lock is created by the requesting search-head.
The 'lock' is comprised of two files under: /etc/pooling/
1. pooling.ini.lock
2. pooling.ini.

On occasion search-heads fail to clean up after themselves and remove their respective lock and leaving them behind in the location above. Another search-head attempting to execute a search cannot validate itself against the pooling.ini due to the existence of the other lock and by default attempts/waits for 10s at which point it proceeds with accessing the pooling.ini regardless.

To confirm whether you are hitting this issue please check:

1) In splunkd.log or btool.log (it is unclear why this message appears in both places or 1 out of 2) there will be messages as the following:
ERROR SearchHeadPoolInfo - Error reading search head pool info: Failed to lock //****/pool/etc/pooling/pooling.ini with code 1, possible reason: No such file or directory

2) The job inspector output for any search job in version 5+ will include in Execution Costs a measurement:
startup.handoff = 10000
Note: In pre-v5 Splunk the issue may be there but the startup.handoff is not calculated, therefore it may be harder to verify if you have hit the condition. Also "total run time" in Job Inspector does NOT include the startup.handoff time.
This value is somewhere in the 10s value matching the current per current design timeout. In version 5.0.6 the behavior will change (SPL-66563) where Splunk will optimistically attempt to open the pooling.ini first and then fall back on a file-based lock around pooling.ini

3) Another simple check would be to execute ANY search and measure time against the "wall" clock. If it takes around 10s before you see anything on the UI, AND the above two checks are positive then you have hit this behavior.

If you have hit this condition the workaround is simply removing
1. pooling.ini.lock
2. pooling.ini.

View solution in original post

drrushi_splunk
Splunk Employee
Splunk Employee

The culprit for this condition in our case was the leftover pooling.ini.lock file under /etc/pooling/.
Essentially when a pool member (search-head) validates itself with the pool a check is performed against pooling.ini and a lock is created by the requesting search-head.
The 'lock' is comprised of two files under: /etc/pooling/
1. pooling.ini.lock
2. pooling.ini.

On occasion search-heads fail to clean up after themselves and remove their respective lock and leaving them behind in the location above. Another search-head attempting to execute a search cannot validate itself against the pooling.ini due to the existence of the other lock and by default attempts/waits for 10s at which point it proceeds with accessing the pooling.ini regardless.

To confirm whether you are hitting this issue please check:

1) In splunkd.log or btool.log (it is unclear why this message appears in both places or 1 out of 2) there will be messages as the following:
ERROR SearchHeadPoolInfo - Error reading search head pool info: Failed to lock //****/pool/etc/pooling/pooling.ini with code 1, possible reason: No such file or directory

2) The job inspector output for any search job in version 5+ will include in Execution Costs a measurement:
startup.handoff = 10000
Note: In pre-v5 Splunk the issue may be there but the startup.handoff is not calculated, therefore it may be harder to verify if you have hit the condition. Also "total run time" in Job Inspector does NOT include the startup.handoff time.
This value is somewhere in the 10s value matching the current per current design timeout. In version 5.0.6 the behavior will change (SPL-66563) where Splunk will optimistically attempt to open the pooling.ini first and then fall back on a file-based lock around pooling.ini

3) Another simple check would be to execute ANY search and measure time against the "wall" clock. If it takes around 10s before you see anything on the UI, AND the above two checks are positive then you have hit this behavior.

If you have hit this condition the workaround is simply removing
1. pooling.ini.lock
2. pooling.ini.

Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...