AD Universal Forwarder stops forwarding

skalliger · ‎02-09-2017

Hi Splunkers,

we ran in some problem with our Universal Forwarder (version 6.5.0.) which collects event logs from our root DC in the testing environment.
So, we had several issues, but limited those to one issue left - our forwarder stops forwarding Windows Security eventlog data. _internal is coming just fine.

We have read through many threads here and found no solution for this.
First of all, the latest inputs.conf:

[WinEventLog://Security]
disabled = 0
index = t_active_directory_60
sourcetype = windows_security
batch_size = 20
start_from = newest
evt_dc_name = xyz
evt_resolve_ad_obj = 0
checkpointInterval = 60

We tried different things here. Setting batch_size makes no difference. Setting evt_resolve_ad_obj to 1 sends no data at all (no _internal either).

Then, today, we finally got an interesting error we've never seen before:

02-09-2017 13:42:14.676 +0100 ERROR ExecProcessor - message from ""C:\Program Files\SplunkUniversalForwarder\bin\splunk-winevtlog.exe"" splunk-winevtlog - WinEventLogChannel::queryEvtChannel: Unable to set seek position to the given bookmark

And this ones keeps coming up every time we restart the forwarder:

02-09-2017 13:41:54.231 +0100 ERROR Metrics - Metric with name thruput:idxSummary already registered

Also, we saw the following warning the first time today:

02-09-2017 13:44:14.197 +0100 WARN TcpOutputProc - Pipeline data does not have indexKey. [_path] = C:\Program Files\SplunkUniversalForwarder\bin\splunk-winevtlog.exe\n[_raw] = \n[_stmid] = Pv7LDc2XW3JCugFC\n[MetaData:Source] = source::WinEventLog\n[MetaData:Host] = host::XYZ\n[MetaData:Sourcetype] = sourcetype::WinEventLog\n[_done] = _done\n[_conf] = source::WinEventLog|host::XYZ|WinEventLog|\n

Does anyone have any ideas on this one?

Our outputs.conf for reference:

[tcpout]
indexAndForward = false
defaultGroup = HEAVY_FORWARDER

[tcpout:HEAVY_FORWARDER]
server = HEAVY_FORWARDER:9997
sendCookedData = true
sslPassword = ...
clientCert = C:\Program Files\SplunkUniversalForwarder\etc\auth\abc.pem
sslRootCAPath = C:\Program Files\SplunkUniversalForwarder\etc\auth\abc.pem
sslVerifyServerCert = true
useClientSSLCompression = true
useACK = true

Also, a funny side note: useACK should have no affect here. But as soon as we set useACK to false, we get duplicate Windows Security events (same record numbers three times). Setting sendCookedData to false also sends no data at all.

Any help is appreciated.

Skalli

woodcock · ‎01-11-2019

It looks to me like there is a zombie splunk process running. I would stop splunk in the process manager, then go through and manually kill any splunk processes that you find in the task manager, then restart splunk process.

bstimely · ‎01-11-2019

I have had a similar issue and found the following had to be done...
Increase the TCP input queue on the indexers.
Increase the thruput setting on the UF
Increase the TCP output queue on the UF.
Check for any other blocked queues in your deployment.
Check the _indextime vs _time for events and make sure this is a steady number of seconds and is small.

You will also have to make sure you have the performance in your DC. If the DC is virtual, look a the CPU COStop value to see if you are really getting CPU time scheduled for your system.

I asked our Splunk REP if parallelIngestionPipelines would help in this case since all of the events are coming from one source Wineventlog://Security. No answer yet.

koshyk · ‎02-09-2017

I wouldn't change the sourcetype in the UF as the correct sourcetype will be done the Windows TA in your indexer
can u have a try like.

[WinEventLog://Security]
disabled = 0
start_from = oldest
current_only = 0
evt_resolve_ad_obj = 1
checkpointInterval = 10
blacklist1 = EventCode="4662" Message="Object Type:\s+(?!groupPolicyContainer)"
blacklist2 = EventCode="566" Message="Object Type:\s+(?!groupPolicyContainer)"
index = t_active_directory_60
renderXml=false

koshyk · ‎02-09-2017

which version of windows AD is running on ? Hope it is Windows2012+ as 2008 support is gone for this version ?

skalliger · ‎02-10-2017

Hi,

sorry for the late answer and thanks for your comments so far. Yes, we are using Windows Server 2012.

We have not modified the limits.conf yet, but we will try that when we run into this issue again. Right now, we have completely uninstalled the 6.5 forwarder on the root DC and installed a 6.4 forwarder on another DC and there are no issues right now (without tuning any settings in limits.conf).

We have only modified the checkpointInterval because it was suggested in another thread. With our working installation, it is back to the standard value now.

However, thanks for the suggestions. With our root DC getting a fresh installation next week (which gets more events than the other DC), we will try to tune the settings in limits.conf if we run into those problems again.

Automatic eventlog backups should be no problem, they aren't running that often, as far as I've seen.

I will post an update next week if the problems are gone then.

Edit: And yes, the forwarder stops completely to collect eventlog data. It resends them as soon as it gets restarted.

Skalli

reedmohn · ‎03-07-2017

We're seeing this problem for a few of our servers in remote locations. Did you manage to resolve this with 6.5?

koshyk · ‎02-11-2017

sure. waiting for your output

mattymo · ‎02-09-2017

is it possible the AD logs are rolling off the server before Splunk reads them fully? What is the log retention like on your test AD?

Have you tuned the thruput limits on the forwarder? Generally you will need to ensure the forwarder can keep up with a busy machine. Make sure to up this value in limits.conf, UF defaults to 256KB, you need something higher on AD for sure maybe start with 1024?:

[thruput]
maxKBps = <integer>
* If specified and not zero, this limits the speed through the thruput processor 
  in the ingestion pipeline to the specified rate in kilobytes per second.
* To control the CPU load while indexing, use this to throttle the number of
  events this indexer processes to the rate (in KBps) you specify.
* Note that this limit will be applied per ingestion pipeline. For more information 
  about multiple ingestion pipelines see parallelIngestionPipelines in the
  server.conf.spec file.
* With N parallel ingestion pipelines the thruput limit across all of the ingestion 
  pipelines will be N * maxKBps.

Also, I see you changed the default checkpoint interval, what was the idea behind that?

checkpointInterval = <integer>
* How often, in seconds, that the Windows Event Log input saves a checkpoint.
* Checkpoints store the eventID of acquired events. This lets the input
  continue monitoring at the correct event after a shutdown or outage.
* The default value is 5.

Also, when you say it stops. Does it stop completely, or is there gaps in the collection?

- MattyMo

AD Universal Forwarder stops forwarding

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life