Alerting

Why are Splunk 7.x Monitoring Console alerts frequently reporting "DMC Alert - Search Peer Not Responding"?

kinaba_splunk
Splunk Employee
Splunk Employee

Splunk 7.x.x Monitoring Console Alerts are frequently reporting that one of our Indexers is "down" with a "DMC Alert - Search Peer Not Responding" alert. But I can see that the Splunk processes on this server and it has not been "restarted" in the past. It seems to be false positive.

Example)

scheduler.log:06-21-2018 04:31:01.957 +1000 INFO SavedSplunker - savedsearch_id="nobody;splunk_monitoring_console;DMC Alert - Search Peer Not Responding", search_type="scheduled", user="nobody", app="splunk_monitoring_console", savedsearch_name="DMC Alert - Search Peer Not Responding", priority=default, status=success, digest_mode=1, scheduled_time=1532268780, window_time=0, dispatch_time=1532268780, run_time=0.100, result_count=1, alert_actions="email", sid="scheduler_nobody_xxxx _RMDxxxx_at_xxxxx_xxxx", suppressed=0, thread_id="AlertNotifierWorker-0"

Could you tell me why?

0 Karma
1 Solution

kinaba_splunk
Splunk Employee
Splunk Employee

Basically, regarding this “DMC alert DMC Alert - Search Peer Not Responding”, DMC can check Peer’s status
every 5 minutes as below based on Endpoint status by rest “/services/search/distributed/peers”.

So, unfortunately, there might be chance that it might be triggered by not only peer down but also peer’s reply
not reached in timeout.
In general, when DMC alert is triggered despite peer is up, below SPL would be recommended to reduce it.

[DMC Alert - Search Peer Not Responding]
counttype = number of events
cron_schedule = 3,8,13,18,23,28,33,38,43,48,53,58 * * * *
description = One or more of your search peers is currently down.
quantity = 0
relation = greater than
search = | rest splunk_server=local /services/search/distributed/peers/ \
where status!="Up" AND disabled=0 |
fields peerName, status |
rename peerName as Instance, status as Status

Workaround:
There is 2 possible workarounds below.

(1) Set statusTimeout of distsearch.conf to be longer.
This timeout can be set longer. But the side effect would just be that it would take longer for search peers to
be considered down.
https://answers.splunk.com/answers/321592/dmc-alert-search-peer-not-responding-how-to-make-t.html

distsearch.conf: 
1. statusTimeout =  
2. * Set connection timeout when gathering a search peer's basic info (/services/server/info). 
3. * Note: Read/write timeouts are automatically set to twice this value. 
4. * Defaults to 10. 

(2) Replace SPL below with existing one.

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

Before setting, just check to see if the SPL works.
1) Go to DMC > Run a Search

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

2) See if result is shown
3) Go to /opt/splunk/etc/apps/splunk_monitoring_console/default
4) vi savedsearches.conf
5) copy part of [DMC Alert - Search Peer Not Responding]
6) Go to /opt/splunk/etc/system/local
7) vi savedsearches.conf and paste the above 3.
And change name like [DMC Alert - Search Peer Not Responding2]
8) Replace following “search = “ with below.

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

9) restart splunk
10) See if the above is shown on DMC > Settings > Alert Setup
11) Change the status to Enable

View solution in original post

0 Karma

kinaba_splunk
Splunk Employee
Splunk Employee

Basically, regarding this “DMC alert DMC Alert - Search Peer Not Responding”, DMC can check Peer’s status
every 5 minutes as below based on Endpoint status by rest “/services/search/distributed/peers”.

So, unfortunately, there might be chance that it might be triggered by not only peer down but also peer’s reply
not reached in timeout.
In general, when DMC alert is triggered despite peer is up, below SPL would be recommended to reduce it.

[DMC Alert - Search Peer Not Responding]
counttype = number of events
cron_schedule = 3,8,13,18,23,28,33,38,43,48,53,58 * * * *
description = One or more of your search peers is currently down.
quantity = 0
relation = greater than
search = | rest splunk_server=local /services/search/distributed/peers/ \
where status!="Up" AND disabled=0 |
fields peerName, status |
rename peerName as Instance, status as Status

Workaround:
There is 2 possible workarounds below.

(1) Set statusTimeout of distsearch.conf to be longer.
This timeout can be set longer. But the side effect would just be that it would take longer for search peers to
be considered down.
https://answers.splunk.com/answers/321592/dmc-alert-search-peer-not-responding-how-to-make-t.html

distsearch.conf: 
1. statusTimeout =  
2. * Set connection timeout when gathering a search peer's basic info (/services/server/info). 
3. * Note: Read/write timeouts are automatically set to twice this value. 
4. * Defaults to 10. 

(2) Replace SPL below with existing one.

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

Before setting, just check to see if the SPL works.
1) Go to DMC > Run a Search

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

2) See if result is shown
3) Go to /opt/splunk/etc/apps/splunk_monitoring_console/default
4) vi savedsearches.conf
5) copy part of [DMC Alert - Search Peer Not Responding]
6) Go to /opt/splunk/etc/system/local
7) vi savedsearches.conf and paste the above 3.
And change name like [DMC Alert - Search Peer Not Responding2]
8) Replace following “search = “ with below.

| rest splunk_server=local /services/search/distributed/peers/ mode=extended
| search health_status != Healthy
| fields peerName, status, status_details, health_status
| rename peerName as Instance, status as "Latest Status", status_details as "Latest Status Details", health_status as "Overall Health (last 10 mins)"

9) restart splunk
10) See if the above is shown on DMC > Settings > Alert Setup
11) Change the status to Enable

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...