Multi-Site Cluster Administration Questions.

TheColorBlack · ‎09-28-2020

Evening Splunk Community,

I'm a new Splunk administrator (<6MO) and I have a couple of questions I could really use your help finding the solutions to. Over the past several months my team and I have been working to lift and shift our production environments hosted in physical data centers up to AWS.

We use AWS Cloudformation to spin up these environments from top to bottom. Ideally with as little manual work as possible. Ultimately I'm having trouble finding solutions on how to keep my Splunk deployment and configurations data-center agnostic as possible to achieve this goal.

We currently have the following multi-site architecture deployed in AWS. Each Region consists of 2 primary VPCs, a production app VPC, and a shared services VPC. With the exception of some heavy forwarders the Splunk infrastructure in these environments is housed in the shared services VPC:

Region A (Active):
- 1x Standalone Search Head
- 1x Combined Deployment Server / License Server / Monitoring Console
- 1x Cluster Master
- 3x Clustered Indexers (Clustered with region B for event replication)
- 4x Load Balanced Syslog-NG Collectors / Heavy Forwarder Combo

Region B (Failover):
- 1x Standalone Search Head (Offline unless needed)
- 1x Combined Deployment Server / License Server / Monitoring Console (Offline unless needed)
- 1x Cluster Master (Offline Unless Needed)
- 3x Clustered Indexers (Clustered with region A for event replication)
- 4x Load Balanced Syslog-NG Collectors / Heavy Forwarder Combo

Here are my questions:

Our environment architect would prefer that there are no external dependencies in our production app tier. Currently the heavy forwarders in my app tier communicate to the primary cluster master to receive a list of available indexers to output their events to.

Would it be wise of me to write a custom outputs.conf that manually defined the indexers, essentially eliminating the need for my HFs to contact the cluster master on startup? What would the cons be to this approach?

It's worth noting that my HF's are only able to forward events to the indexers within their own region.
In a similar vein to question one is it possible to have more than one license server / deployment server active at a time? We would like to eliminate as many of Splunk's dependencies on the opposite region as much as possible. Ideally region A could have it's own DS / LS / MC and never need to contact region B's DS / LS / MC or visa versa.
Is there any clean way to handle index replication for data DR / HA purposes across indexers that are not in a multi-site cluster? My problems seem like they would be solved if I simply split our individual AWS regions into their own independent Splunk deployments.

Would something like having each region back their events up to AWS Smart Store / S3 then pulling the data into the other regions indexers work?

Any advice you could provide me would be more than appreciated. I'm here to learn. Thank you all for your time.

richgalloway · ‎09-30-2020

It would take a major catastrophe to bring down an entire AWS region. Spreading an app across AZs in a single region is sufficient in most cases. It all depends on your risk tolerance, of course.

Indexer replication is key to data protection, but it doesn't have to replicate to another region. The copies can be in other zones.

I understand the goal of minimizing outside dependencies. Splunk, however, is not fully HA and doesn't try to be. For instance, there can be only one CM and there is no built-in mechanism for a hot CM to keep a cold CM current. Fortunately, that's not a problem since a fresh CM can easily rebuild its state with information supplied by the indexers. The indexers just need to know where the new CM is and that can be done using DNS (or other networking tricks).

Patient: Doctor, it hurts when I do this.
Doctor: Well, don't do that.

If a Splunk component cannot communicate with another, necessary Splunk component then there's something wrong with the architecture. Firewall rules or other changes need to be made so components can talk to each other as intended.

Requiring forwarders to send data only to local indexers is reasonable and commonplace. It works well if the local indexers can replicate data to remote indexers.

Requiring forwarders to talk only to a local DS/LM/CM is also common, mainly because most customers have only one. If the DS/LM/CM fails then the forwarder continues to function using the most recent configuration it has until the server is restored.

I like your idea #2.

Avoid using intermediate forwarders as in your idea #3. That add complexity and can hamper performance.

Stick with your multi-site cluster for ensuring your data exists in two places.

---
If this reply helps you, Karma would be appreciated.

richgalloway · ‎09-29-2020

Have you considered the costs of sending all of your indexed data between regions (indexer replication)? It may make better economic sense to put the failover cluster in a different AZ within the same region.

I wouldn't say it's wise or not, but if it meets a need then go ahead and manually define the indexers in outputs.conf. Keep in mind you are committing yourself to maintaining the files whenever the indexers change.

You can have only 1 license master at a time. It's possible to have multiple deployment servers, but each forwarder will report to and be controlled by a single DS. Each DS will have similar, but different, configurations, of course.

Indexer clusters do not have to be multi-site to protect your data. A single-site cluster in a highly-reliable environment such as AWS should be fine. What is "unclean" about a single-site cluster?

SmartStore is a feature to help reduce storage costs. It is not a DR/HA feature. Each cluster needs to have its own SmartStore.

---
If this reply helps you, Karma would be appreciated.

TheColorBlack · ‎09-29-2020

Hey Rich,

Thank you for taking the time to respond to my questions. Genuinely appreciated.

I have not considered the costs of sending all of our indexed data between regions using indexer replication. Like I said earlier, I'm very green to AWS, and to Splunk. I've been drinking from the learning firehose for ~12+ hours a day since early February and all I know is that I don't know nothing. At the moment we're only indexing ~25GB of data a day, so nothing record shattering, I don't believe the costs of replicating that data cross region would be all that high.

Our current indexer architecture is 3 indexers in each region, each indexer lives within it's own availability zone within that region. The indexers are clustered cross region for the purposes of DR / HA in terms of data availability. If we ever lose an entire region in AWS we still need to be able to search against events logged within that region. If you don't use indexer replication to replicate data how would you ensure your data is highly available and searchable across all regions in the event you lost all of your indexers within an affected region?

In regards to hard-coding / dynamically creating the configurations for my Heavy-Forwarders I realize this solution may not be optimal. However, the indexers should remain static once they are created and the forwarders within a specific region should only ever forward to the indexers within that region. The main problem I'm trying to solve here is reducing the number of Splunk dependencies that exist between regions.

For example, my current architecture requires all of my Splunk forwarders to check in with the deployment server / cluster master to pull their configs and retrieve a list of indexers to forward events on to (standard stuff AFAIK). However, the heavy forwarders that live within our production application VPCs cannot reach out to the shared services VPC cross region which is where the primary splunk DS / LS / MC and CM live.

To make things concrete, if I'm operating production out of region B, I would failover the primary DS / LS / MC from region A shared services, to region B shared services. The heavy forwarders in my prod subnet's in Region B will be able to pull their configurations from the Region B DS / LS / MC / CM as that communication is directly allowed. However, any heavy forwarders still running in Region A's prod subnet will loose their ability to speak to the DS / LS / MC / CM since prod region A is not allowed to speak to shared services region B.

Could you recommend any solutions that would allow us to avoid this pitfall? I'm currently toying with the following ideas:

1) I move the splunk HF's out of my Production App Tier and have them live solely in the shared services tier. Our prod apps (mainly docker containers) would then send events via HEC straight to the Shared services tiers. Shared services tier will always be allowed to communicate cross region meaning they will always be able to speak to the active DS / LS / MC / CM ensuring they always have a list of active indexer peers and up to date configs from the deployment server.

2) For my prod tier HF's I bootstrap these systems when they are created via cloudformation with the necessary configurations required for the HF to begin listening for incoming events over HEC, and the configs required for that forwarder to forward on to the indexers in that region.

3) In a similar vein to 2, I could create specific HF configurations for my prod tier that tell my prod tier HF's to forward events to the shared services VPC HFs. From there the shared services HF's could forward onto the necessary indexers. Again, shared services HF's will always be able to speak to the DS / LS / MC / CM cross region.

I almost wonder if it would be easier to re-architect our multi-site cluster into single standalone sites to eliminate each regions dependencies on one another. The question then becomes how to make the data from Region A available in region B and visa versa.

Multi-Site Cluster Administration Questions.

indexer clustering

Splunk APM: New Product Features + Community Office Hours Recap!

Index This | Forward, I’m heavy; backward, I’m not. What am I?

A Guide To Cloud Migration Success