Thank you, adonio.
OK, sounds like /etc/* and KV Store are all that need to be backed-up. Can I assume the KV Store is the same on the SH and MC and so I should only backup one, and would a restore be to just one (and they will sync) or is it more complicated than that?
You mention using "splunk diag" for backup. I prefer to do differential backup (using s3cmd) and so do not plan on using "splunk diag". Do you think this is OK or did you mention "splunk diag" because it is necessary?
if you can avoid recovering a new VM it will be best
Sure, but it is a scenario that must be planned for. A corrupt machine cannot be recovered as-is. And restoring from a full machine backup is typically a human-speed recovery. I want a machine-speed recovery, which involves automation that replaces the machine with an equivalent. And in the case of a site-failure, one cannot use the same machine with the same IP address (different sites of course have different subnet range), and even regions in the case of AWS.
I have a License Master and all machines simply refer to the LMin their server.conf.
Really, this whole difficulty seems due to Splunk not using a database server for its configuration like other enterprise tools. But that's a different topic. 🙂
(D) can you elaborate?
This was in regards to the KV Store backup. I didn't know if it was redundant with respect to backing up the /etc directory, or the KV Store is not within /etc and so one must also backup the KV Store using the command to export it.
Assuming it must also be backed up, my plan is to have a cron job run the KV Store export command with the output file being part of my backup. And up on restore the automation would import the KV Store backup file using the corresponding CLI command, before starting the Splunk service on the recovered machine.
(E) yes, maintain the pass4symmkey for connecting recovered CM.
The question was with regards to server.conf/sslConfig/sslPassword and not pass4symmkey. The backup/restore of /etc would recover the pass4symmkey value. Reference (2) implies that upon recovery one should not use the backed-up sslPassword value but instead the new value from the newly installed/generated replacement Splunk installation, which seems wrong to me since it would not match the backed-up/recovered certs.
On a general note, i would try to avoid changing VMs of CM, DS, and LM
(in your case same machine) and rather recover the bad one if possible as
their "down" state, does not effect splunk work. CM down = no
replications and no app pushing - > once up, resync and all good DS down =
no apps push -> once up, all good LM down = no counts and a warning (you
have 72 hours to fix) -> once up, all good.
I have to disagree in the general case.
CM Down: Newly launched hosts (containing the SUF) cannot start sending data until the CM is recovered, assuming using the "indexer discovery" model. That is why I am not using that model, but rather using Dynamic DNS so that hosts connect directly to indexers w/o the CM being required.
DS Down: Newly launched hosts (containing the SUF) cannot obtain their apps and output.conf files, and so will not start sending the correct data until the DS is recovered. Another comment on this topic implies one can run multiple redundant DS' side-by-side behind a load balancer, although I've not found reference to that in the Splunk documentation and in general the Splunk documentation indicates not being compatible with redundancy behind a a load balance. And if one has 2 DS', how does one modify a deployment app or server classes, manually on both? I don't see evidence of them auto-syncing such information.
The example case is a site failure. If the site containing the DS and some hosts fails, any hosts auto-recovering at an alternate site ahead of the DS will not start sending data. I'm sure they buffer, but that information will not be usable in Splunk, and if a host restarts before the DS comes up my guess is data is lost. That seems unacceptable in an HA enterprise environment.
If the Splunk DS function were run like a normal web site/service (multiple stateless servers behind an LB sharing their state in a redundant back-end database server) then it would be HA.
Thank you for sharing your knowledge, which has provided me insight and gotten me farther. Any elaboration or clarification on the above would also be appreciated!
Regards,
Ryan
... View more