Good afternoon Splunk team, please could you help us with this?
We have this scenario: Splunk has been logging constantly our 60 events per hour, but starting at November 5th, we are now missing events:
We are logging these events through a Universal Forwarder.
This is the log that we are trying to forward. As you can see there is an event per minute.
But, if we search for this log's events in Splunk, we see that there are missing events.
We are afraid that we might be missing a lot of events that could be potential errors happening in production, so this should be treated as a critical issue.
This suddenly start working normally the last week (11/14/2016). We don't know what could've been the issue. Probably some Windows Update that was messing the network. Really appreciate your help guys, I'm sorry couldn't find the exact issue.
@TLAZO Did you get a chance to check the splunkd.log and metrics.log on the days when the events were missing? It is quite possible that one of the queues on the forwarder was full and was not able to send the data down, in which case you would see blocked=true messages in metrics.log, by default the Queue sizes are 256kb , if we are noticing issues with the size we may want to bump up the queue size to couple of MB or 1 GB , though we configure it as 1Gb it would use as much as needed and this can be reset to lower value once all the data has been sent out.
What does your log rotation work? Is it a truncate and copy, move, etc? also what does your monitor stanza look like.
Have you checked RAM, CPU and disk activity on the UF box?
What is the date and time of the first item missed? Does that coincide or nearly coincide with anything other happenings, either on servers or the network? Is it a "special" time like before midnight it was fine but skipped its first event at 2 minutes after midnight? Or 11:00 PM?
While I have no idea how this would still be happening now, does it appear like it could be daylight savings related? If you search all time just to see the relative amounts of events over the long term, can you spot these events possibly out of order (like they're in 2017 or something, or maybe added to 2015's data?)
Lastly, do you have something unique (or nearly so) you can search that comes from inside one of the missing events? Maybe if you can search for that you'll find that event, and perhaps that will lead you to find the rest of them.
>> Have you checked RAM, CPU and disk activity on the UF box?
RAM, CPU and Disk Activity are fine
>> What is the date and time of the first item missed? Does that coincide or nearly coincide with anything other happenings, either on servers or the network? Is it a "special" time like before midnight it was fine but skipped its first event at 2 minutes after midnight? Or 11:00 PM?
Nov 5th midnight.
*>> While I have no idea how this would still be happening now, does it appear like it could be daylight savings related? If you search all time just to see the relative amounts of events over the long term, can you spot these events possibly out of order (like they're in 2017 or something, or maybe added to 2015's data?)
It's not related to daylight savings. That was our first thought. This started at November 5th, two days before daylight saving change
*>>Lastly, do you have something unique (or nearly so) you can search that comes from inside one of the missing events? Maybe if you can search for that you'll find that event, and perhaps that will lead you to find the rest of them.
I did, and the line I looked for was not registered in Splunk.
All right, you definitely have a non-obvious problem then.
Though I changed my clocks Saturday November 5th in the evening, because between then and Sunday November 6th the time changed an hour. 🙂 Still, you may be elsewhere when it changed on a different date, and in any case I don't think this is the problem.
Have you checked the splunkd.log on your forwarder (and possibly other logs) on your $SPLUNK_HOME/var/log/splunk folder for errors or warnings? Failing finding anything there, can you run a tcpdump host=<myhost>
and make sure you see the packets from it, even including ones that "don't make it"? If it were me, I'd run a RT search over the past 5 minutes targeting just those logs, and compare with the TCP traffic. Indeed, you could temporarily disable all but that input from the UF for a few minutes and get an easy correlation (or not) between the two.
At this point I'd be telling a customer/client "Well, we're still trying to find out what the problem is. Once we actually find out specifically what's going wrong it'll probably only take a few more minutes to fix it, but the troubleshooting process can get a bit involved."
I've also tried setting useACK=true in Inputs.config file. No luck
Are you sure the forwarder was running during the time of the missing events?