Deployment Architecture

Multi-site Indexer rolling restart - indexer fails to restart/timeout

DEAD_BEEF
Builder

Using Splunk 7.3.3, after I initiated a rolling restart from the cluster master (multi-site indexer cluster), the first indexer began to restart. Then it showed batch adding, then the Indexer Clustering: Master Node page, showed that the indexer failed to restart

[Mon Feb  2 12:47:52 2020] Failed to restart peer=<GUID> peer_name=<hostname>. Moving to failed peer group and continuing.
[Mon Feb  2 12:47:52 2020] Failing peer=<GUID> peer_name=<hostname> timed out while trying to restart.

I did a ping from the CM to this indexer and it returned fine. Connectivity was not an issue before the rolling restart and network connectivity appears to be working fine.

  • Is there a timeout window or setting I can adjust to better accommodate network latency and give the CM more time to reach the peer?
  • What does this mean for my rolling restart, will remaining peers be restarted but I should restart this one manually?
  • How can I list this "failed peer group" to see all systems that may fail to restart?
1 Solution

maraman_splunk
Splunk Employee
Splunk Employee

Hi, you are probably looking at the restart timeout setting on the CM (see link text and link text)

[clustering]
restart_timeout = time_in_sec 
# default is 60, probably a good idea to really increase here (to avoid the cluster to go in fix mode)  but still adapt it to the time it usually take for a idx to restart (use something like 3600 if you really want not to restart in that case but obviously if your idx crash in the middle of the restart, this will take more time to detect)

View solution in original post

hmallett
Path Finder

From the error, if the indexer did restart without manual intervention, I would guess that the restart of the indexer took longer than the restart_timeout defined in the cluster master's server.conf. By default this is set to 60 seconds, and I have seen indexers take much longer than this to restart.

Can you see from splunkd.log on the indexer how long the restart actually took? If it's longer than 60 seconds, then you might want to extend your restart_timeout (https://docs.splunk.com/Documentation/Splunk/7.3.3/Indexer/Userollingrestart#Handle_slow_restarts)

DEAD_BEEF
Builder

Most indexers were taking 15-20 mins. I will try adjusting the restart_timeout value but this is the first time I've seen these errors and I have restarted this cluster many times with each taking 15-20 mins just like always. That's what prompted me to ask about this issue.

So this setting needs to be changed on the CM's server.conf, not the indexers themselves?

0 Karma

maraman_splunk
Splunk Employee
Splunk Employee

Hi, you are probably looking at the restart timeout setting on the CM (see link text and link text)

[clustering]
restart_timeout = time_in_sec 
# default is 60, probably a good idea to really increase here (to avoid the cluster to go in fix mode)  but still adapt it to the time it usually take for a idx to restart (use something like 3600 if you really want not to restart in that case but obviously if your idx crash in the middle of the restart, this will take more time to detect)

DEAD_BEEF
Builder

I will try adjusting this. Each idx takes on average 15-20 mins, my current timeout setting is 15mins, so maybe I just expand it to 30m to be safe?

0 Karma

maraman_splunk
Splunk Employee
Splunk Employee

Yes, it is a CM setting. 30 min(1800s) seem to be appropriate for your env.

DEAD_BEEF
Builder

Just finished a rolling restart and no errors anymore after increasing the timeout to 30mins. Thank you both for the assistance!

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...