Getting Data In

Purge queue on forwarder / indexer down

nocostk
Communicator

I had one of my indexers go down a couple weeks back. Since then each of my forwarders is trying to send events to that indexer but failing with errors like

WARN  TcpOutputFd - Connect to 10.1.4.183:9998 failed. Connection refused

So I modified my outputs.conf to remove that target indexer and restarted the forwarder (heavy). However, I'm still seeing that error. Also I'm seeing queueing errors on the forwarder:

INFO  TailingProcessor - Could not send data to output queue (parsingQueue), retrying...

I'm thinking that the queue has retained the old indexer and is continuing to attempt event delivery. As I noted, cycling splunkd on the forwarder doesn't make a difference. I also think this is causing delays in sending events to my other indexers (5-15 minutes will go by before any events show up).

How can I alleviate this problem (aside from standing up an indexer on the failed IP noted above)?

Tags (2)

sloshburch
Splunk Employee
Splunk Employee

We found that a couple of things were causing such issues. These are not necessarily the same issue you're seeing.

I did some math and realized that we had some blocking because our Universal Forwarder was hitting its default limits.conf

[thruput]
maxKBps = 256

So we changed that to 0, which makes it unlimited. Keep in mind this impacts CPU on the host system where the forwarder lives.

This allowed the forwarder to catch up to itself. I was then able to analyze the metrics.log on the forwarder to see what thruput was required based on actual (the other option is to do some math and figure out how much thruput you need).
The other thing was that we had to disable useACK in my forwarder's outputs.conf so its

[tcpout:mygroup]
useACK = false

This was because the ACKs caused even more thruput and pauses.

So in conclusion, check out the metrics.log and take a hard look at where the pipeline is backing up.

Hopefully that helps you as well?!
,We found that a couple of things were causing such issues. These are not necessarily the same issue you're seeing.

I did some math and realized that we had some blocking because our Universal Forwarder was hitting its default limits.conf

[thruput]
maxKBps = 256

So we changed that to 0, which makes it unlimited. Keep in mind this impacts CPU on the host system where the forwarder lives.

This allowed the forwarder to catch up to itself. I was then able to analyze the metrics.log on the forwarder to see what thruput was required based on actual (the other option is to do some math and figure out how much thruput you need).
The other thing was that we had to disable useACK in my forwarder's outputs.conf so its

[tcpout:mygroup]
useACK = false

This was because the ACKs caused even more thruput and pauses.

So in conclusion, check out the metrics.log and take a hard look at where the pipeline is backing up.

Hopefully that helps you as well?!

0 Karma

sloshburch
Splunk Employee
Splunk Employee

Did you find a solution for this? I think I'm seeing the same problem.

0 Karma

tkiss
Path Finder

Any solution or comment on this? We are in the same situation

0 Karma

aljohnson_splun
Splunk Employee
Splunk Employee

Voting the question up is one way of saying you think this is important.

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...