Solved: Hard disk Failure on One Index in a Cluster

mark_wymer · ‎02-11-2020

Hi all,

Our environment consists of, amongst other things, a multisite (3) clustered environment. Each site has three indexers making a total of nine indexers. We also have a replication factor of 3. On each indexer the hot/warm and cold buckets are on separate filesystems.

On one of the indexers, the filesystem containing the cold buckets suffered a hard disk failure which has destroyed the entire FS.

My question is: when the disk/filesystem is repaired, will Splunk automatically rebuild the cold buckets from the replications? If it does, will it do it when I start Splunk or is there some maintenance commands that I will need to issue?

Many thanks,
Mark.

nickhills · ‎02-11-2020

Hi Mark,

Once the file system is back (assuming its just the index filesystem) and you can boot the peer as normal, it should rejoin the cluster.
When it joins, it will share its list of cold buckets (none) with the CM.
The CM will take any steps necessary to bring the cluster back into health, however if the cluster has already become consistent (using the remaining 8 hosts) there will not be anything needed to be replicated.

This will mean that your restored peer will initially have very few (none) cold buckets. This is fine from a cluster health perspective, but it does mean that indexer will not "pull its weight" for searches that include data in those cold buckets.

To restore even distribution of buckets across all peers (recommended for optimum performance and tolerance) you should do a rebalance on the cluster which will copy buckets to that host from the surviving 8 peers.

https://docs.splunk.com/Documentation/Splunk/8.0.1/Indexer/Rebalancethecluster

If my comment helps, please give it a thumbs up!

View solution in original post

nickhills · ‎02-11-2020

Hi Mark,

Once the file system is back (assuming its just the index filesystem) and you can boot the peer as normal, it should rejoin the cluster.
When it joins, it will share its list of cold buckets (none) with the CM.
The CM will take any steps necessary to bring the cluster back into health, however if the cluster has already become consistent (using the remaining 8 hosts) there will not be anything needed to be replicated.

This will mean that your restored peer will initially have very few (none) cold buckets. This is fine from a cluster health perspective, but it does mean that indexer will not "pull its weight" for searches that include data in those cold buckets.

To restore even distribution of buckets across all peers (recommended for optimum performance and tolerance) you should do a rebalance on the cluster which will copy buckets to that host from the surviving 8 peers.

https://docs.splunk.com/Documentation/Splunk/8.0.1/Indexer/Rebalancethecluster

If my comment helps, please give it a thumbs up!

mark_wymer · ‎02-11-2020

Thanks for the answer / confirmation Nick.

Hard disk Failure on One Index in a Cluster

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM