I have 2 servers, Splunk1 and Splunk2, setup as search peers. How can I monitor when one of the servers goes down or stops responding using Splunk? I have received messages like the following:
-- Search generated the following messages --
Message Level: ERROR
1. Reading error while waiting for peer SPLUNK2. Search results might be incomplete!
I would like to be alerted when something like this happens. Does anyone have any ideas?
Back in Version 3, on the main search screen, you would see a not "x of y" servers . For example, "5 of 5" Servers. If one was not responding, you could pull down a tab and immediately see which one.
This was a good idea, and meant your users would immediately see any issue. I would like to suggest seeing it come back.
Here are 2 methods to detect if search peer is down, or hasn't responded to a search.
Pick a search that should always return results, and count the number of search-peers,
Then setup an email alert based on the number of search-peers (including the search head)
Schedule the search every 5 minutes over last 2hours, and use the alert condition :
if number of events is less than X
index=_internal splunk_server=* | stats count by splunk_server
This is to detect an failure in a search afterward.
By example schedule this search to run every 5 minutes over the last 5 minutes.
index=_internal source=*splunkd.log "Unable to connect to peer"
One remark, a search peer may not respond because of long searches that are hitting the timeout settings, you can increase them if its the case.
see : connectionTimeout, sendTimeout, receiveTimeout in distsearch.conf
http://www.splunk.com/base/Documentation/latest/Admin/Distsearchconf