How to do a RCA on "The maximum number of concurre...

Glasses2 · ‎02-24-2023

Hi,

When I inherited this deployment, there were a lot of skipped searches.

The 3 node SHC was under resourced, but with some cron skewing, tuning the limits, reducing zombie scheduled searches, and optimizing some searches... I reduced a lot. However some intensive apps were still causing skipped searches.

So we added a 4th node to the SHC, and it was running smoothly without a skipped search.

Now recently, I started seeing a persistent skipped search warning. Nothing new was added (scheduled searches), resource usage looked good, but I kept seeing >>"The maximum number of concurrent running jobs for this historical scheduled search on this cluster has been reached ".

I could see the jobs that were skipped, but I am not finding a way to see which jobs piled up during a time interval that caused the skipped search and the warning.

I did notice some of the skipped searches were throwing warnings and errors. I am wondering if it caused a hanging job so it added to the count, and created a skipping loop.

IF any one has a way to see the scheduled searches that accumulate and cause this error and skipping, PLEASE advise.

Thank you!

Glasses2 · ‎02-27-2023

Well I understand your point about "this"... but that's the problem, I couldn't find an error with the skipped searches... unless I am missing something.

Since I did the rolling restart (reset) there are no more skipped searches.

Previously I looked for the longest running searches and none were over-running their schedules, that I could see. For example one search took an hour approx., but it ran every 4 hours.

Since I did some optimizing there were only 3 scheduled searches that produced the warning which I identified with

index="_internal" sourcetype="scheduler" 
            | eval scheduled=strftime(scheduled_time, "%Y-%m-%d %H:%M:%S") 
            | stats values(scheduled) as scheduled
                    values(savedsearch_name) as search_name
                    values(status) as status
                    values(reason) as reason
                    values(run_time) as run_time 
                    values(dm_node) as dm_node
                    values(sid) as sid
                    by _time,savedsearch_name |  sort -scheduled
            | table scheduled, search_name, status, reason, run_time

When I looked back at those 3 specific searches, they were not over-running the schedules, so I was wondering how it got stuck thinking it was "piling up" vs "still running".

I am trying to understand/investigate, if a search is "skipped" then when the shc scheduler retries that previously skipped search at its next runtime, "how can I see that the shc CPT thinks its still running"?

And looking back at the "skipped" events, they don't contain "run_time"... so I look back historically to find a day with a high value. But when the searches were running they took max 4 seconds with avg of 2 seconds to complete, which is why I thought the scheduled searches were piling up. Hope that makes sense.

The only other variable I can think of is that these searches are using the "| dbxquery" cmd from Splunk DB Connect app.

So did it the SHC just get stuck?

Any further thoughts appreciated.

TY

acharlieh · ‎02-24-2023

The key words there are "for this historical scheduled search"... So likely looking at a search job that's taking longer than its scheduled period to execute. I'd start with looking at the runtimes of the skipping search you've already found.

(of course not ruling out something crazy like the job wasn't running but the SHC captain thought it was...)

How to do a RCA on "The maximum number of concurrent running jobs ... on this cluster has been reached"?

other

search job inspector

Introducing the Splunk Community Dashboard Challenge!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

Wondering How to Build Resiliency in the Cloud?