I had one of my indexers go down a couple weeks back. Since then each of my forwarders is trying to send events to that indexer but failing with errors like
WARN TcpOutputFd - Connect to 10.1.4.183:9998 failed. Connection refused
So I modified my outputs.conf to remove that target indexer and restarted the forwarder (heavy). However, I'm still seeing that error. Also I'm seeing queueing errors on the forwarder:
INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
I'm thinking that the queue has retained the old indexer and is continuing to attempt event delivery. As I noted, cycling splunkd on the forwarder doesn't make a difference. I also think this is causing delays in sending events to my other indexers (5-15 minutes will go by before any events show up).
How can I alleviate this problem (aside from standing up an indexer on the failed IP noted above)?
We found that a couple of things were causing such issues. These are not necessarily the same issue you're seeing.
I did some math and realized that we had some blocking because our Universal Forwarder was hitting its default limits.conf
[thruput]
maxKBps = 256
So we changed that to 0, which makes it unlimited. Keep in mind this impacts CPU on the host system where the forwarder lives.
This allowed the forwarder to catch up to itself. I was then able to analyze the metrics.log
on the forwarder to see what thruput was required based on actual (the other option is to do some math and figure out how much thruput you need).
The other thing was that we had to disable useACK
in my forwarder's outputs.conf
so its
[tcpout:mygroup]
useACK = false
This was because the ACKs caused even more thruput and pauses.
So in conclusion, check out the metrics.log
and take a hard look at where the pipeline is backing up.
Hopefully that helps you as well?!
,We found that a couple of things were causing such issues. These are not necessarily the same issue you're seeing.
I did some math and realized that we had some blocking because our Universal Forwarder was hitting its default limits.conf
[thruput]
maxKBps = 256
So we changed that to 0, which makes it unlimited. Keep in mind this impacts CPU on the host system where the forwarder lives.
This allowed the forwarder to catch up to itself. I was then able to analyze the metrics.log
on the forwarder to see what thruput was required based on actual (the other option is to do some math and figure out how much thruput you need).
The other thing was that we had to disable useACK
in my forwarder's outputs.conf
so its
[tcpout:mygroup]
useACK = false
This was because the ACKs caused even more thruput and pauses.
So in conclusion, check out the metrics.log
and take a hard look at where the pipeline is backing up.
Hopefully that helps you as well?!
Did you find a solution for this? I think I'm seeing the same problem.
Any solution or comment on this? We are in the same situation
Voting the question up is one way of saying you think this is important.