Getting Data In

Heavy forwarders are not auto load-balancing evenly

rjdargi
Explorer

I'm having a problem right now where I'm not seeing an even distribution across my indexers. I have 21 indexers (indexer04-indexer24) to which data is coming from six heavy forwarders.

My outputs.conf on my heavy forwarders looks like this:

[tcpout:myServerGroup]
autoLBFrequency=15
autoLB=true
disabled=false
forceTimebasedAutoLB=true
writeTimeout=30
maxConnectionsPerIndexer=20
server=indexer04:9996,indexer05:9996,indexer05:9996,<snip>,indexer24:9996

However, when I run a simple test search, for example

index=main earliest=-1h@h latest=now() | stats count by splunk_indexer | sort count desc

The event count is massively disproportionate across all the indexers, and indexer13 has twice the events of the next busiest indexer, and the least busy indexers have only a sixth of the events that indexer13 has. Likewise, our external hardware monitoring reflects indexer13 having a heavier load.

I've stopped indexer13 temporarily, and the other indexers pick up the slack, but immediately after turning on indexer13 it began being the king of traffic again.

I've broken it down by heavy-forwarder, and every single one of them seems to send more events to indexer13 as well. I'm at a loss, indexer04-indexer24 all share the same configuration, though indexer13-24 are beefier on the hardware side as they are newer builds.

Are there any settings I'm perhaps missing to get this evenly distributed to my indexers?

0 Karma
1 Solution

rjdargi
Explorer

The issue here ended up being that we were running a version of the heavy forwarders that had a bug -- they'd regularly pick a single indexer preferentially over all others. We're still in the Splunk5 world, so we went forward a few releases and the problem was solved.

View solution in original post

rjdargi
Explorer

The issue here ended up being that we were running a version of the heavy forwarders that had a bug -- they'd regularly pick a single indexer preferentially over all others. We're still in the Splunk5 world, so we went forward a few releases and the problem was solved.

jofe
Explorer

The autoLB feature should function pretty well if looked over a longer timespan - So there is probably some other factor here.

My question to you is :Is there something about indexer 13 that makes it capable of receiving more data in a shorter time than the others? Here are some suggestions.

Could be faster network cards 10Gbit vs 1Gbit or Trunking on the network cards on indexer 13? Something like that?
Powersaving features disabled on the indexer?
Are there routing differences or different vlans for the indexers with different load?
Are there packetloss on some of the connections to the indexers?
Are there queue blocking going on on some of the indexers recieving little data?

This could have many different causes, but is probably not related to the configuration on the heavy forwarders. 🙂

Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...