Getting Data In

How to troubleshoot why our universal forwarder is not sending all events after a certain date?

TLAZO
Explorer

Good afternoon Splunk team, please could you help us with this?
We have this scenario: Splunk has been logging constantly our 60 events per hour, but starting at November 5th, we are now missing events:

alt text

We are logging these events through a Universal Forwarder.
This is the log that we are trying to forward. As you can see there is an event per minute.

alt text

But, if we search for this log's events in Splunk, we see that there are missing events.
We are afraid that we might be missing a lot of events that could be potential errors happening in production, so this should be treated as a critical issue.

0 Karma

TLAZO
Explorer

This suddenly start working normally the last week (11/14/2016). We don't know what could've been the issue. Probably some Windows Update that was messing the network. Really appreciate your help guys, I'm sorry couldn't find the exact issue.

0 Karma

nekbote
Path Finder

@TLAZO Did you get a chance to check the splunkd.log and metrics.log on the days when the events were missing? It is quite possible that one of the queues on the forwarder was full and was not able to send the data down, in which case you would see blocked=true messages in metrics.log, by default the Queue sizes are 256kb , if we are noticing issues with the size we may want to bump up the queue size to couple of MB or 1 GB , though we configure it as 1Gb it would use as much as needed and this can be reset to lower value once all the data has been sent out.

0 Karma

bmacias84
Champion

What does your log rotation work? Is it a truncate and copy, move, etc? also what does your monitor stanza look like.

0 Karma

Richfez
SplunkTrust
SplunkTrust

Have you checked RAM, CPU and disk activity on the UF box?

What is the date and time of the first item missed? Does that coincide or nearly coincide with anything other happenings, either on servers or the network? Is it a "special" time like before midnight it was fine but skipped its first event at 2 minutes after midnight? Or 11:00 PM?

While I have no idea how this would still be happening now, does it appear like it could be daylight savings related? If you search all time just to see the relative amounts of events over the long term, can you spot these events possibly out of order (like they're in 2017 or something, or maybe added to 2015's data?)

Lastly, do you have something unique (or nearly so) you can search that comes from inside one of the missing events? Maybe if you can search for that you'll find that event, and perhaps that will lead you to find the rest of them.

0 Karma

TLAZO
Explorer

>> Have you checked RAM, CPU and disk activity on the UF box?
RAM, CPU and Disk Activity are fine

>> What is the date and time of the first item missed? Does that coincide or nearly coincide with anything other happenings, either on servers or the network? Is it a "special" time like before midnight it was fine but skipped its first event at 2 minutes after midnight? Or 11:00 PM?
Nov 5th midnight.

*>> While I have no idea how this would still be happening now, does it appear like it could be daylight savings related? If you search all time just to see the relative amounts of events over the long term, can you spot these events possibly out of order (like they're in 2017 or something, or maybe added to 2015's data?)
It's not related to daylight savings. That was our first thought. This started at November 5th, two days before daylight saving change

*>>Lastly, do you have something unique (or nearly so) you can search that comes from inside one of the missing events? Maybe if you can search for that you'll find that event, and perhaps that will lead you to find the rest of them.
I did, and the line I looked for was not registered in Splunk.

0 Karma

Richfez
SplunkTrust
SplunkTrust

All right, you definitely have a non-obvious problem then.

Though I changed my clocks Saturday November 5th in the evening, because between then and Sunday November 6th the time changed an hour. 🙂 Still, you may be elsewhere when it changed on a different date, and in any case I don't think this is the problem.

Have you checked the splunkd.log on your forwarder (and possibly other logs) on your $SPLUNK_HOME/var/log/splunk folder for errors or warnings? Failing finding anything there, can you run a tcpdump host=<myhost> and make sure you see the packets from it, even including ones that "don't make it"? If it were me, I'd run a RT search over the past 5 minutes targeting just those logs, and compare with the TCP traffic. Indeed, you could temporarily disable all but that input from the UF for a few minutes and get an easy correlation (or not) between the two.

At this point I'd be telling a customer/client "Well, we're still trying to find out what the problem is. Once we actually find out specifically what's going wrong it'll probably only take a few more minutes to fix it, but the troubleshooting process can get a bit involved."

0 Karma

TLAZO
Explorer

I've also tried setting useACK=true in Inputs.config file. No luck

0 Karma

richgalloway
SplunkTrust
SplunkTrust

Are you sure the forwarder was running during the time of the missing events?

---
If this reply helps you, Karma would be appreciated.
0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...