Well I understand your point about "this"... but that's the problem, I couldn't find an error with the skipped searches... unless I am missing something. Since I did the rolling restart (reset) there are no more skipped searches. Previously I looked for the longest running searches and none were over-running their schedules, that I could see. For example one search took an hour approx., but it ran every 4 hours. Since I did some optimizing there were only 3 scheduled searches that produced the warning which I identified with index="_internal" sourcetype="scheduler"
| eval scheduled=strftime(scheduled_time, "%Y-%m-%d %H:%M:%S")
| stats values(scheduled) as scheduled
values(savedsearch_name) as search_name
values(status) as status
values(reason) as reason
values(run_time) as run_time
values(dm_node) as dm_node
values(sid) as sid
by _time,savedsearch_name | sort -scheduled
| table scheduled, search_name, status, reason, run_time When I looked back at those 3 specific searches, they were not over-running the schedules, so I was wondering how it got stuck thinking it was "piling up" vs "still running". I am trying to understand/investigate, if a search is "skipped" then when the shc scheduler retries that previously skipped search at its next runtime, "how can I see that the shc CPT thinks its still running"? And looking back at the "skipped" events, they don't contain "run_time"... so I look back historically to find a day with a high value. But when the searches were running they took max 4 seconds with avg of 2 seconds to complete, which is why I thought the scheduled searches were piling up. Hope that makes sense. The only other variable I can think of is that these searches are using the "| dbxquery" cmd from Splunk DB Connect app. So did it the SHC just get stuck? Any further thoughts appreciated. TY
... View more