Hi,
I have universal forwarder monitoring a number of directories and forwarding to an indexer.
On the forwarder, there are repeating entries in the splunkd.log file:
03-04-2013 12:12:39.503 +0000 INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
03-04-2013 12:12:44.506 +0000 INFO TailingProcessor - ...continuing.
03-04-2013 12:12:54.543 +0000 INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
03-04-2013 12:13:09.551 +0000 INFO TailingProcessor - ...continuing.
03-04-2013 12:13:14.568 +0000 INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
03-04-2013 12:13:19.571 +0000 INFO TailingProcessor - ...continuing.
03-04-2013 12:13:29.607 +0000 INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
03-04-2013 12:13:34.609 +0000 INFO TailingProcessor - ...continuing.
03-04-2013 12:13:49.644 +0000 INFO TailingProcessor - Could not send data to output queue (parsingQueue), retrying...
03-04-2013 12:13:54.647 +0000 INFO TailingProcessor - ...continuing.
etc.
The main effect of this seems to be a delay of ~10 mins to data being searchable.
I do not believe the indexer is the bottleneck as the indexer. I have Splunk On Splunk and according to that the queue's are pretty much zero
I have increased the persistent queue size to 100Mb on the forwarder but it still get's the error.
The metrics.log on the forwarder shows that the queues don't seem to be near full (either the parsingqueue or the tcpout queue):
03-04-2013 12:13:42.031 +0000 INFO Metrics - group=queue, name=tcpout_sec-mgr-01_9997, max_size=512000, current_size=65736, largest_size=65736, smallest_size=0
03-04-2013 12:13:42.031 +0000 INFO Metrics - group=queue, name=aeq, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
03-04-2013 12:13:42.031 +0000 INFO Metrics - group=queue, name=aq, max_size_kb=10240, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
03-04-2013 12:13:42.031 +0000 INFO Metrics - group=queue, name=auditqueue, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
03-04-2013 12:13:42.031 +0000 INFO Metrics - group=queue, name=fschangemanager_queue, max_size_kb=5120, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
03-04-2013 12:13:42.031 +0000 INFO Metrics - group=queue, name=indexqueue, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
03-04-2013 12:13:42.031 +0000 INFO Metrics - group=queue, name=nullqueue, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
**03-04-2013 12:13:42.031 +0000 INFO Metrics - group=queue, name=parsingqueue, max_size_kb=102400, current_size_kb=101811, current_size=2434, largest_size=2556, smallest_size=2417**
03-04-2013 12:13:42.031 +0000 INFO Metrics - group=queue, name=tcpin_queue, max_size_kb=500, current_size_kb=0, current_size=0, largest_size=0, smallest_size=0
CPU is low on both boxes.
On forwarder, splunk list monitor | wc -l
gives 14264
On indexer metrics.log
has no instances of blocked
On forwarder metrics.log
has a few instances of blocked=true
but but current_size
is always low compared to max_size
+kb:
Example:
Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=102400, current_size_kb=102399, current_size=1682, largest_size=1689, smallest_size=1662
Any ideas would be really appreciated. Don't know what the slowness is of how to fix it.
Indexer discovery used in Multisite clustering
There can be many reasons for this failure, including the ones listed above.
An additional reason that this message comes up is because of indexer discovery when using multisite clustering. When using multisite clustering, every forwarder must have a site. If you wish to avoid site affinity, you may use site0.
The configuration looks like this:
# server.conf
[general]
site = site0
References:
1. http://docs.splunk.com/Documentation/Splunk/6.4.3/Indexer/indexerdiscovery#Use_indexer_discovery_in_...
"Important: When you use indexer discovery with multisite clustering, you must assign a site-id to all forwarders, whether or not you want the forwarders to be site-aware. If you want a forwarder to be site-aware, you assign it a site-id for a site in the cluster, such as "site1," "site2," and so on. If you do not want a forwarder to be site-aware, you assign it the special site-id of "site0". When a forwarder is assigned "site0", it will forward to peers across all sites in the cluster."
Wow, I am humbled to be so opinionated and yet so wrong. Still, I think that 14K files are a lot, and I am not sure why the ignoreOlderThan = 2d wasn't working for you.
Could you be hitting the 256 KBPS limit on the universal forwarder? The forwarder limits its use of the network to 256 KBPS to avoid saturating the network on a production machine. You can change this by editing etc/system/local/limits.conf:
[thruput]
maxKBps = 0
# means unlimited
If you continue to have problems, a call to Splunk Support might be next. You have certainly done your homework!
If you are monitoring anywhere near 14,000 files on a forwarder - I'll bet that this is your problem. You can increase the file descriptors, etc. but you will probably still have performance issues. A ten minute delay in indexing is actually pretty darn good considering the work that Splunk is doing. I'll bet that the forwarder is consuming more CPU and memory than it should, too.
Even if only a portion of these files are actively being updated, Splunk will monitor ALL of them. This means that Splunk will examine the mod time of each file in a round-robin fashion. Over and over again, even though nothing has (and maybe never will) change. Because Splunk can't know which files will or won't be updated.
This is obviously a huge waste of machine time if most of the files are not being updated. Here are some steps that you could take:
ignoreOlderThan = <time window>
in inputs.conf - but BE CAREFUL. ignoreOlderThan
causes the monitored input to stop checking files for updates if their modtime has passed this threshold. So if you set it for 14d, then you can't ever add a file older than 2 weeks into the directory. (Well, you can, but Splunk will ignore it.)If you must monitor this many files, consider installing 2 copies of the forwarder. Split the monitoring between them by assigning them different directories. I would try to keep the total number of files being monitored by a forwarder under 5,000 if possible.
OK. This still doesn't work.
There are now <1000 files monitored, the parsingqueue is mostly full (~200MB). The CPU cpu is under 20 and splunk is hardly using it. Why can't it keep up?
Thanks for the tips. Ideally this server (the raw syslog server will KEEP a full set of raw logs so don't really want to delete them)
I have already got ignoreOlderThan = 2d BUT it is interesting to note that the file list in "list monitor" contains all the entries including files from several days ago.
There are ~260 logs for today, yesterday and the day before so total of approx 800 logs that should be being monitored if it honours the ingoreolderthan. I guess it still scans them all to check if they are olderthan...
Might have to go with opt2 then.
On the indexer I do not see any blocked=true.
On the forwarder there are a couple entries over several days but the numbers look odd: Metrics - group=queue, name=parsingqueue, blocked=true, max_size_kb=102400, current_size_kb=102399, current_size=1682, largest_size=1689, smallest_size=1662
On forwarder:
splunk list monitor | wc -l
14264
(I have to up ulimits on the OS and increase the max_fd in limits.conf on the splunk forwarder)
Also, how many files is the forwarder monitoring? On the forwarder, run this command
splunk list monitor
Out of interest, does blocked=true appear anywhere in the metrics.log on the indexer or forwarder?