Deployment Architecture

Hard disk Failure on One Index in a Cluster

mark_wymer
Path Finder

Hi all,

Our environment consists of, amongst other things, a multisite (3) clustered environment. Each site has three indexers making a total of nine indexers. We also have a replication factor of 3. On each indexer the hot/warm and cold buckets are on separate filesystems.

On one of the indexers, the filesystem containing the cold buckets suffered a hard disk failure which has destroyed the entire FS.

My question is: when the disk/filesystem is repaired, will Splunk automatically rebuild the cold buckets from the replications? If it does, will it do it when I start Splunk or is there some maintenance commands that I will need to issue?

Many thanks,
Mark.

0 Karma
1 Solution

nickhills
Ultra Champion

Hi Mark,

Once the file system is back (assuming its just the index filesystem) and you can boot the peer as normal, it should rejoin the cluster.
When it joins, it will share its list of cold buckets (none) with the CM.
The CM will take any steps necessary to bring the cluster back into health, however if the cluster has already become consistent (using the remaining 8 hosts) there will not be anything needed to be replicated.

This will mean that your restored peer will initially have very few (none) cold buckets. This is fine from a cluster health perspective, but it does mean that indexer will not "pull its weight" for searches that include data in those cold buckets.

To restore even distribution of buckets across all peers (recommended for optimum performance and tolerance) you should do a rebalance on the cluster which will copy buckets to that host from the surviving 8 peers.

https://docs.splunk.com/Documentation/Splunk/8.0.1/Indexer/Rebalancethecluster

If my comment helps, please give it a thumbs up!

View solution in original post

nickhills
Ultra Champion

Hi Mark,

Once the file system is back (assuming its just the index filesystem) and you can boot the peer as normal, it should rejoin the cluster.
When it joins, it will share its list of cold buckets (none) with the CM.
The CM will take any steps necessary to bring the cluster back into health, however if the cluster has already become consistent (using the remaining 8 hosts) there will not be anything needed to be replicated.

This will mean that your restored peer will initially have very few (none) cold buckets. This is fine from a cluster health perspective, but it does mean that indexer will not "pull its weight" for searches that include data in those cold buckets.

To restore even distribution of buckets across all peers (recommended for optimum performance and tolerance) you should do a rebalance on the cluster which will copy buckets to that host from the surviving 8 peers.

https://docs.splunk.com/Documentation/Splunk/8.0.1/Indexer/Rebalancethecluster

If my comment helps, please give it a thumbs up!

mark_wymer
Path Finder

Thanks for the answer / confirmation Nick.

0 Karma
Get Updates on the Splunk Community!

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...