Getting Data In

Why is heavy forwarder repeatedly getting "WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 600 seconds."

hagjos43
Contributor

We are seeing the following errors on our Heavy Forwarder side:

09-05-2014 13:39:06.483 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:39:06.484 - 0400 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 600 seconds.
09-05-2014 13:39:36.493 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:40:06.501 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:40:36.509 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:40:39.510 - 0400 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 700 seconds.
09-05-2014 13:41:06.517 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:41:36.524 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:42:06.533 - 0400 INFO TcpOutputProc - Connected to idx= 23.42.214.219:9997
09-05-2014 13:42:19.536 - 0400 WARN TcpOutputProc - Forwarding to indexer group default-autolb-group blocked for 800 seconds.

This continues to repeat through the current date. Anyone else experience this or have any suggestions?

1 Solution

masonmorales
Influencer

From my experience, this is usually due to blocked queues at the indexers. The most common cause is insufficient IOPS/throughput at the indexers' disk subsystem. When a queue is full for a certain length of time on the indexer, the indexer will start rejecting forwarder connections so that it can clear its full queue(s) before processing new events.

Here are some searches you can run against the _internal index of your indexers to find and see the bottleneck:

View the current queue size:

index=_internal source=*metrics.log group=queue | timechart median(current_size) by name

Find blocked queue events:

index=_internal source=*metrics.log group=queue blocked
Blocked queues in last 24 hours by queue and Splunk server: 
index=_internal source=*metrics.log sourcetype=splunkd group=queue | eval max=if(isnotnull(max_size_kb),max_size_kb,max_size)  | eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size)  | eval fill_perc=round((curr/max)*100,2) | eval name=host.":".name | where fill_perc>=99.0 | timechart max(fill_perc) as MaxFillPerc by name useother=false limit=100 minspan=1h

Count how many times queues were >=99% by Queue Name and Splunk Server

index=_internal source=*metrics.log sourcetype=splunkd group=queue | eval max=if(isnotnull(max_size_kb),max_size_kb,max_size) | eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size)  | eval fill_perc=round((curr/max)*100,2) | where fill_perc>=99.0 | stats count by name host  | eval name=case(name=="aggqueue","2 - Aggregation Queue",name=="indexqueue","4 - Indexing Queue",name=="parsingqueue","1 - Parsing Queue",name=="typingqueue","3 - Typing Queue", 1=1, name) 

View solution in original post

PGrantham
Path Finder

Try checking your metrics.log on both your HF and indexer.

Do you see any blocked queues (like the parsingqueue or aggqueue or tcpinqueue)?

0 Karma

masonmorales
Influencer

From my experience, this is usually due to blocked queues at the indexers. The most common cause is insufficient IOPS/throughput at the indexers' disk subsystem. When a queue is full for a certain length of time on the indexer, the indexer will start rejecting forwarder connections so that it can clear its full queue(s) before processing new events.

Here are some searches you can run against the _internal index of your indexers to find and see the bottleneck:

View the current queue size:

index=_internal source=*metrics.log group=queue | timechart median(current_size) by name

Find blocked queue events:

index=_internal source=*metrics.log group=queue blocked
Blocked queues in last 24 hours by queue and Splunk server: 
index=_internal source=*metrics.log sourcetype=splunkd group=queue | eval max=if(isnotnull(max_size_kb),max_size_kb,max_size)  | eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size)  | eval fill_perc=round((curr/max)*100,2) | eval name=host.":".name | where fill_perc>=99.0 | timechart max(fill_perc) as MaxFillPerc by name useother=false limit=100 minspan=1h

Count how many times queues were >=99% by Queue Name and Splunk Server

index=_internal source=*metrics.log sourcetype=splunkd group=queue | eval max=if(isnotnull(max_size_kb),max_size_kb,max_size) | eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size)  | eval fill_perc=round((curr/max)*100,2) | where fill_perc>=99.0 | stats count by name host  | eval name=case(name=="aggqueue","2 - Aggregation Queue",name=="indexqueue","4 - Indexing Queue",name=="parsingqueue","1 - Parsing Queue",name=="typingqueue","3 - Typing Queue", 1=1, name) 

bohanlon_splunk
Splunk Employee
Splunk Employee
0 Karma

khourihan_splun
Splunk Employee
Splunk Employee

See this post for step to troubleshoot: http://answers.splunk.com/answers/189238/how-to-troubleshoot-error-on-splunk-6-universal-fo.html

but in general I'd use Splunk on Splunk (SoS) app to diagnose where the bottleneck is. If you are running 6.3, you can use the DMC (Distributed Management Console) to do the same analysis: Goto Setting and click Distribute Management Console icon on the left.

cdupuis123
Path Finder

Hi inters

Yes I've spent time on the answers site with similar results, but after using/running Splunk now for 3 years I've found that if I can't get the answer from Splunk answers I've either used the wrong search term, or most times I find something close and am able to backwards/sideways engineer it until it fixes my issue. Oh course if all else fails call my SE or Support. Good luck with your POC

inters
Explorer

I am currently evaluating Splunk. Ceaselessly, I encounter errors like this and "answers.splunk.com" has no answers, only other frustrated questioners.

Why does anyone use this software???

satishsdange
Builder

Please post your questions..I am sure you will get answers.

0 Karma

djfisher
Explorer

Same here,, started happening. Is it due to bad band width or to many seconds between collections? I use the 9*Nix app to collect audit logs using rlog.sh

0 Karma

cdupuis123
Path Finder

I don't have the answer, but I've got the same issue!!!! Anyone????

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...