Deployment Architecture

How to Calculate total Search Load for my Search Head Clustering Deployment?

sat94541
Communicator

Can you please help us in letting know calculation on how our search concurrency limit is being hit in my Search Head Cluster Deployment? We will like to investigate when we see a Schedules search being skipped.

rbal_splunk
Splunk Employee
Splunk Employee
0 Karma

efavreau
Motivator

Link is broken

###

If this reply helps you, an upvote would be appreciated.
0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

Response to your question is not very simple. At high level splunk run following type of searches

@adhoc searches

@Scheduled Searches ( running and delegated )
@Report Acceleration (running and delegate)
@datamodel acceleration (running and delegated)

alt text

To calculate number of SHC wide concurrent searches running at any given time it is required to calculate at adhoc searches+ scheduled searches + Report Acceleration scheduled searches + datamodel acceleration scheduled searches +delegrated searches .
Here are various log and searches that can be leveraged to get some stats, but these searches won’t provide you complete data. Splunk currently has an open Enhancement Request (SPL-125101:Comprehensive search concurrency metrics) to streamline these stats for reporting needs.)

1) The introspection log provide snapshot of all searches running on the SHC members. This snapshot is taken every 10sec for scheduled searches + Report Acceleration+ datamodel acceleration. You can use the search below to get trend of the searches being run in each category.

index=_internal  ( host=<> ….) 
            sourcetype=splunk_resource_usage component=PerProcess data.search_props.sid=*
                      | eval data.search_props.type = if(like('data.search_props.sid',"%_scheduler_%"),"scheduled",'data.search_props.type')
                      | bin _time span=10s
                      | stats dc(data.search_props.sid) AS distinct_search_count by _time,data.search_props.type 
                      | timechart bins=200 max(distinct_search_count) AS "median of search concurrency" by data.search_props.type| addtotals

alt text

Stats form introspection Data has following challenges :
@@Introspection Data is sampled every 10sec which means the searches that finished during 10s won’t get accounted.
@@ Introspection Data also doesn’t account for delegated searches

Due to these challenged introspection date can only be used to see the trend and may show stats below the actual search load.

2) To get the delegated searches I have been researching it in last few days and development has provided useful tips as published in https://answers.splunk.com/answers/449024/search-head-cluster-scheduled-searches-and-status.html

Based on this the scheduler/captain calculates the total number of scheduled searches can be derived from metrics (group=searchscheduler) as activeScheduledSearches.size + activeDelegatedSearch.size and below is the sample searches - but this metrics is missing adhoc searches.
Another limitation with this search is that it’s sampled(snapshotted ) every 30 sec. So even this data will miss the searches that finished in between those 30 sec

Scheduler Activity (based on metrics.log) :

index=_internal sourcetype=splunkd source=metrics group=searchscheduler | timechart span=3m sum(dispatched) as dispatched, sum(skipped) as skipped, sum(delegated) as delegated Max(delegated_waiting) as delegated_waiting, sum(delegated_scheduled) as delegated_scheduled, Max(max_pending) as max_pending, Max(max_running) as max_running

3)Here is another search that can be used to get scheduled ( running + skipped) from scheduler.log along with adhoc from _audit. To get meaning full data you need to run it for long time period like 4 hours or above. This is also missing delegated search. Another challenge is with audit log as it’s not always complete for ad-hoc searches. So number may be bit skewed.

Skipped searches vs concurrency:

host=<SHC_HOST_NAME>
(index=_internal source=*/scheduler.log*  (status=success run_time=*) OR status=skipped) OR
(index=_internal source=*/scheduler.log*  (status=success run_time=*) OR status=skipped) OR
((index=_audit action=search info=completed) (NOT search_id='scheduler_*' NOT search_id='rsa_*'))

| eval type=if(status="skipped", "skipped", "completed")
| eval run_time=coalesce(run_time, total_run_time)
| eval counter=-1
| appendpipe [
    | eval counter=1
    | eval _time=_time - run_time
]

| sort 0 _time
| streamstats sum(counter) as concurrency by type
| table _time concurrency counter run_time type
| timechart partial=f sep=_ span=1m count min(concurrency) as tmin max(concurrency) as tmax by type
| rename count_skipped as skipped     tmin_completed as min_concurrency     tmax_completed as max_concurrency
| fields + _time skipped *_concurrency
| filldown *_concurrency

Delayed-minutes vs concurrency:

host=<SHC_HOST_NAME>
index= _audit
(action=search info=completed)
(NOT search_id='scheduler_*' NOT search_id='rsa_*')

| eval run_time=coalesce(run_time, total_run_time)
| eval counter=-1
| appendpipe [
    | eval counter=1
    | eval _time=_time - run_time
]

| sort 0 _time
| streamstats sum(counter) as concurrency
| timechart partial=f sep=_ span=1m min(concurrency) as min_concurrency max(concurrency) as max_concurrency
| filldown *_concurrency

| join _time [
    | search index=internal host=<SHC_HOST_NAME>  source=*/scheduler.log* (status=success OR status=continued OR status=skipped)
    | eval dispatch_time =  coalesce(dispatch_time, _time)
    | eval scheduled_time = if(scheduled_time > 0, scheduled_time, "WTF")
    | eval window_time =    coalesce(window_time, "0")
    | eval execution_latency = max(dispatch_time - (scheduled_time + window_time), 0)
    | timechart partial=f sep=_ span=1m sum(execution_latency) as delayed_seconds
    | eval delayed_minutes=coalesce(delayed_seconds/60, 0)
    | fields + _time delayed_minutes

Due to these limitation currently splunk provide some challenges when you are trying to find Comprehensive search concurrency metrics .

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...