Solved: apply cluster-bundle

mathu · ‎02-11-2013

Hi

Taking a cluster peer out of the cluster with the "offline" command works great (version 5.0.2). The entire cluster remains searchable during shutdown process (replication factor=3, search factor = 2)

However, it seems that the command ./splunk apply cluster-bundle to update cluster configs on the peer nodes does not restart the instances using the offline command. During the apply process, I get some ugly error messages in the SearchHead like "failed to start search on peer ..." or "Failed to start the search process", etc.

What's the correct way to update peer configuration without interrupting the search coomands?

Kind regards
Mathias

svasan_splunk · ‎02-25-2013

In the 5.0 release, rolling-restart, apply, "rolling offline" - ie offlining all peers one after the other one at a time - are all not search-safe. Updating the configuration cluster-wide via apply really does behave like a "maintenance mode": data is safe but it may not be searchable during the rolling restart. After the rolling restart completes, the cluster should be searchable (I believe the master commits a new generation after). The docs don't seem to state this explicitly; I'll try to get those updated.

Also, we are working to fix the limitations detailed below.

To explain what is going on a bit more:

Every peer is potentially both the source and the target of ongoing hot bucket replications: it originates some hot buckets that are replicated to other peers and is the target (and potentially the searchable target; this is the problematic case) for hot buckets originating on other peers. Each peer is also the primary for the hot buckets it originates. When we offline a peer - say peer A - it rolls the hot buckets it originates cleanly and transfers the primary responsibility for those hot buckets (along with any other warm buckets it is primary for) to some other peers. It doesn't worry about any hot bucket - say bucket B1 - it is the searchable streaming target for originating from some other peer as the source is still up and is responsible for searching that bucket. So offlining one peer works by fixing up the hot buckets it originates and not worrying about the hot buckets it is receiving. For a rolling-restart though, those do come into the picture.

Now when peer A comes back, its copy of bucket B1 might be invalid. In the ace release, we don't fix up the bucket mid-stream - ie catch up on the data that has already been indexed while also keeping track of data that is now going to that bucket. Instead, the source rolls the bucket at that point. We cannot also fixup the search meta data files mid-stream. The copy on the peer that just restarted is likely invalid and is discarded and so the master fixes up the bucket. If the copy that was discarded was a searchable copy, this would mean that another copy has to be made searchable. This can take a bit of time depending on the size of the bucket. During this time, with a SF=2, the source of B1 is the only peer with a valid searchable copy for B1. if the source of B1 also goes offline, then there is no searchable copy of the bucket online while the source is restarting. (There is another copy being made searchable but it may not have finished yet; the source which has the only complete searchable copy has gone offline). So: data is not lost, but there may be no searchable copy online at that point.

Since in a cluster every peer is likely the searchable target for some bucket and every peer is going to go offline at some point or the other, the above situation is likely true for one or more buckets through-out the rolling restart process. So the cluster itself won't be search-safe through the rolling restart process.

Hope that helps to understand what is going on. If you have more questions, ask away. And, hopefully, updating the config cluster-wide is infrequent enough for you to be able to treat it as down time for searches. We are working to fix this going forward.

View solution in original post

svasan_splunk · ‎02-25-2013

In the 5.0 release, rolling-restart, apply, "rolling offline" - ie offlining all peers one after the other one at a time - are all not search-safe. Updating the configuration cluster-wide via apply really does behave like a "maintenance mode": data is safe but it may not be searchable during the rolling restart. After the rolling restart completes, the cluster should be searchable (I believe the master commits a new generation after). The docs don't seem to state this explicitly; I'll try to get those updated.

Also, we are working to fix the limitations detailed below.

To explain what is going on a bit more:

Every peer is potentially both the source and the target of ongoing hot bucket replications: it originates some hot buckets that are replicated to other peers and is the target (and potentially the searchable target; this is the problematic case) for hot buckets originating on other peers. Each peer is also the primary for the hot buckets it originates. When we offline a peer - say peer A - it rolls the hot buckets it originates cleanly and transfers the primary responsibility for those hot buckets (along with any other warm buckets it is primary for) to some other peers. It doesn't worry about any hot bucket - say bucket B1 - it is the searchable streaming target for originating from some other peer as the source is still up and is responsible for searching that bucket. So offlining one peer works by fixing up the hot buckets it originates and not worrying about the hot buckets it is receiving. For a rolling-restart though, those do come into the picture.

Now when peer A comes back, its copy of bucket B1 might be invalid. In the ace release, we don't fix up the bucket mid-stream - ie catch up on the data that has already been indexed while also keeping track of data that is now going to that bucket. Instead, the source rolls the bucket at that point. We cannot also fixup the search meta data files mid-stream. The copy on the peer that just restarted is likely invalid and is discarded and so the master fixes up the bucket. If the copy that was discarded was a searchable copy, this would mean that another copy has to be made searchable. This can take a bit of time depending on the size of the bucket. During this time, with a SF=2, the source of B1 is the only peer with a valid searchable copy for B1. if the source of B1 also goes offline, then there is no searchable copy of the bucket online while the source is restarting. (There is another copy being made searchable but it may not have finished yet; the source which has the only complete searchable copy has gone offline). So: data is not lost, but there may be no searchable copy online at that point.

Since in a cluster every peer is likely the searchable target for some bucket and every peer is going to go offline at some point or the other, the above situation is likely true for one or more buckets through-out the rolling restart process. So the cluster itself won't be search-safe through the rolling restart process.

Hope that helps to understand what is going on. If you have more questions, ask away. And, hopefully, updating the config cluster-wide is infrequent enough for you to be able to treat it as down time for searches. We are working to fix this going forward.

tprzelom · ‎04-02-2014

For some cases you'll be able to run

mysearch | extract reload=t

To reprocesses to props.conf file and begin extracting fields w/o needing a restart

joebensimo · ‎10-03-2013

Has this been fixed?

What is the impact on search while the restart is going on? Are incomplete results returned?

Ricapar · ‎06-19-2013

Has there been any update on this? Any plans to have a search-safe cluster restart?

Like @mathu, I have a similar use case where we're updating props.conf frequently. We'd like to ideally be able to do this on an ad-hoc basis so that we don't have people waiting 'till the next day to start seeing their data parsed properly.

However, that conflicts with the people who are already using Splunk - we cannot simply go and break searches on the Search Heads while they're using them.

mathu · ‎02-28-2013

very helpful answer, thanks a lot.

Unfortunately we have a lot of different usecases in our configuration. That means we update props.conf and transforms.conf quite frequently, i.e. to configure correct TIMESTAMP and LINEMERGE behaviour. 3rd party software is not really aware of CIM..

In addition we have a lot of users, which are implementing alerts based on realtime searches. They ask for a (near) 100% available SearchHead.

These are some of the reasons I'm waiting impatiently for a search-save cluster-wide restart.

Kind regards
Mathias

lcshared · ‎02-25-2013

thanks for this detailed answer!

svasan_splunk · ‎02-21-2013

mathu,

In the 5.0.x release, the apply command does the same thing as the rolling-restart command (after pushing the bundle) which is not search-safe.

That specific error message is a known issue (SPL-52430 )

lcshared · ‎02-25-2013

bumb - any news on this?

mathu · ‎02-22-2013

Is there a supported way to update the peer configuration gracefully without the "apply cluster-bundle" command? I.e. do it manually and then use the offline command peer by peer.

apply cluster-bundle

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!