Why are the queues being filled up on one indexer?

ddrillic · ‎10-17-2017

In the last day or two all the queues of one indexer got filled up. We bounced it and now on another indexer all the queues are close to 100%. What can it be?

ddrillic · ‎10-17-2017

Normally, for months and months at this point of the day all the queues would be quite empty. However, h2709 is still pretty bad -

![

I took h2709out of the forwarders rotation (for the most part) and it took around 25 minutes to clear all queues.

ddrillic · ‎10-17-2017

After 25 minutes, h2709 queues are fine...

ddrillic · ‎10-17-2017

Now all the traffic seems to go to another server h8788 -

ddrillic · ‎10-18-2017

The binding with h8788 remained throughout the night and this server already processed 1/2 TB of data.

ddrillic · ‎10-18-2017

Thank you @mwirth for working with us !!! So, one forwarder sends to us huge amounts of Hadoop/Flume data and just yesterday we received 1 TB of data from this forwarder.

We end up with a forwarder-indexer bound. How can we avoid it?

mwirth_splunk · ‎10-17-2017

Usually there's 3 things that block up queues;

Input volume
Performance
Configuration

In this case, it's pretty clear the indexer in question is getting 2x the instantaneous indexing rate of the other indexers. My question is; is this server usually that much higher than the others?

ddrillic · ‎10-17-2017

Please keep in mind that the issue is now with h2709 but in the past 24 hours it was with h8789 until we bounced it and then it flipped to h2709.

Let me check the indexing rates...

Looking now and the indexing rate of h2709 is much lower but its queues are almost filled up -

mwirth_splunk · ‎10-17-2017

Okay, so that means that the forwarder(s) in question are successfully sending to multiple indexers, that's great!

Now we need to find out what datasource is causing that indexing bandwidth. Go to the monitoring console and open the "Indexing Performance: Instance" dashboard. Scroll down to the "Estimated Indexing Rate Per Sourcetype" panel and see if there are any outliers.

EDIT:
That feast/famine cycle (Where an instance has an enormous indexing rate with full queues then drops to nearly none) is just the data load balancing to another server and the queues emptying to disk after backing up. Very normal in this circumstance.

ddrillic · ‎10-17-2017

That's what we see for h2709 -

mwirth_splunk · ‎10-17-2017

Dark purple and dark green look like likely suspects, take note of those sourcetypes. Can you confirm the same spike in indexing load from those sourcetypes on other hosts during the time window when they had issues?

This is normally going to happen because of a single forwarder sending a very high amount of bandwidth. It can be addressed in a couple of different ways-
1. Increase the number of threads on the forwarder, since each thread can send to a distinct indexer.
2. If the data is coming from a centralized data source (like syslog etc) spread the load out between hosts.

For some perspective, 6MB/s over a 24h period would result in over 500GB/day, which is well outside the recommended 200-250GB/day per indexer. No wonder the poor servers are struggling!

Why are the queues being filled up on one indexer?

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM