Deployment Architecture

What conditions will the cluster master wait for for when scheduling restarts of indexer cluster peers with "splunk apply cluster bundle"?

gavsdavs_GR
Path Finder

Can someone describe the conditions the cluster master will wait for when scheduling restarts of cluster peers when I have run splunk apply cluster bundle?

We have 8 peers in total.
3 in site1, 2 in site2, 3 in site 3.

We have not varied the percent_peers_to_restart value from its default of 10 percent.

When we run splunk apply cluster-bundle and the CM calls a restart on the 8 cluster peers, we regularly see more than 1 indexer down at once and often see more than one indexer down in the same site.

As I understand it, this should not happen - hence me wanting to understand what the CM waits for before starting the next restart.

I have extended the following from the defaults:

[clustering]
restart_timeout = 300

Of our 8 peers, 6 of them start themselves in 5-7 minutes, but 2 take up to 20 minutes.
By "start", I mean they start, and check what buckets they have in place and report them to the cluster master.

It does not look like the CM waits for the peers to complete that activity before kicking off the restart of the next peer, so we generally get people running searches and getting incomplete results warnings.

Thanks

nnmiller
Contributor

Currently, if the IDX takes longer than the restart_timeout to come back on-line, the CM marks the IDX as "down". Counter-intuitively, this frees up a slot in the CM's IDX restart queue, and it moves to the next IDX. Of course, this impacts the total number of IDXes available.

In addition, the CM does not take into account SF or RF when doing a rolling restart. E.g., it doesn't check to see if the current IDX it's restarting has the last searchable copy if other IDXes are marked as down.

The only way to avoid this issue and keep the data 100% searchable throughout a maintenance like this is to make a multi-site cluster (which could be in the same DC) and use the -site-by-site flag as mentioned in the link that mbrown mentioned.

If you have large numbers of buckets per IDX, this can increase the amount of time it takes an IDX to restart. Generally, you will see fewer issues if you keep the IDXes at under 100K buckets each. One of our primary index cluster developers gave a talk at our user's conference last year, and provided some recommendations for cluster tuning based on bucket counts. Slide 15 has a table with recommendations on tuning 'service_interval' (on the CM), 'heartbeat_period' (on IDXes), 'heartbeat_timeout' (on the CM), as well as a few other settings.

You can read the slides here: https://conf.splunk.com/session/2015/conf2015_Dxu_Splunk__Deploying_IndexerClusteringTips.pdf
A recording of the talk is available here:
https://conf.splunk.com/session/2015/recordings/2015-splunk-68.mp4

As I stated in my comment above, if you have large numbers of buckets with timestamp issues, this can cause problems with cluster rolling restarts. If a timestamp is far in the past compared to the current time, and you're using time-based retention, this will cause buckets to roll prematurely, creating many small buckets.

At a minimum, consider running splunk remove excess-buckets [index-name] periodically, particularly if you have had any IDX outages, as the CM does not remove excess buckets automatically.

gavsdavs_GR
Path Finder

Thanks for the links to the .conf slides, i'll take a look.

Your comments on the possible reasons why you might see many indexers down at once all seem alarmingly familiar.
1. An indexer taking longer than "restart_timeout" to restart permits the next indexer to be restarted. We have "restart_timeout" set to 600, which is fine for all but two of our indexers. I will increase it. This will help.
2. Indexers with more than 100,000 buckets. Check 😞
3. Our oldest indexers have crufty NTFS filesystems which appear to exhibit substantially slower IO than our newer indexers (Stat-ing 100,000 buckets takes a lot longer on two of them than on the others)
4. Large numbers of buckets with timestamping issues - Check 😞 Now fixed, but they're still in there as they haven't expired out yet.
5. We still use the default restart percentage of 10 percent, there is no sense in changing that. We have 8 indexers spread across three sites, this should mean that we should only ever see a single indexer down at one time. We must be violating the "restart_timeout" figure on several indexers.

With regards to the "why aren't you using the -site-by-site flag" - well, I can't, because I'm not running "rolling-restart", I'm running "apply-cluster-bundle", which doesn't let me use that flag and uses the built in heuristics - it invokes the rolling restart without me getting the chance to say "-site-by-site"

I keep on top of the surplus buckets, it doesn't seem to have got too bad in that respect.

0 Karma

nmiller_splunk
Splunk Employee
Splunk Employee

I suggest opening a support case requesting the site-by-site flag be added as an option to apply-cluster-bundle. In the case, ask for it to be assigned to me, and I'll submit it.

0 Karma

mbrown_splunk
Splunk Employee
Splunk Employee

To expand on @lohitkidu answer, by default the rolling restart is not site aware and needs to be invoked with site awareness for multisite clusters.

Details of this can be located within the "Managing Indexers and Clusters of Indexers" documentation: http://docs.splunk.com/Documentation/Splunk/6.4.0/Indexer/Userollingrestart

nnmiller
Contributor

Another point, if you have large numbers of buckets with timestamp issues, this can cause problems with cluster rolling restarts. Reducing the overall number of buckets in the cluster can help reduce restart times. At a minimum, consider running:

splunk remove excess-buckets [index-name]

lohitkidu
Path Finder

This will help http://docs.splunk.com/Documentation/Splunk/6.4.0/Indexer/Updatepeerconfigurations

Also I think splunk by default restart 10 percent of total peers.

gavsdavs_GR
Path Finder

Does anyone have a comment on this ?

I'm looking at our CM now and we have 6 indexers of 8 down at once.

What are the actual conditions these restart occur ?

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...