OPSEC LEA R80 logging behind

mmoermans · ‎11-16-2017

Ever since the upgrade to R80 the logs from OPSEC LEA app have been behind by about an hour (ranging from 30m to 90m through out the day), what can be the cause of this? Before they were always perfectly indexed within seconds.

The opseclea:log:modinput logs don't show any errors so it's hard to pinpoint the issue.

Action01 · ‎06-08-2018

Hi,

We had the same problem, but we had latencies ranging from minutes up to 15 hours, depending on the traffic load on the checkpoint (this was pre-upgrade on R77.30). It seemed that at more than 41000 events per minute, we experienced a build-up of latency. Performance and resource usage of the HF, the indexers or the checkpoint management server were just fine, no obvious culprit and no errors or warnings in opseclea:log:modinput.

We have NMON running on the HF, so inspecting the CPU and memory usage was easy. Splunkd didn't do much (on average 0.35), and lea_lograbber was close to nothing (0.03). Python however used 1.6-1.7 CPU cores continuously. This was the maximum that i observed, which always happened at more than 41000 events per minute. Below the 41000 epm the process used less resources.

After the upgrade to R80.10 it got even worse with latency running up to 22 hours and climbing.

(I think) I managed to solve this by just setting the log level to INFO, instead of DEBUG (which I assumed was necessary for "debugging" this problem...). The debugging resulted in half a million events per minute of _internal debug logging...

After changing this level, and setting the starttime on each input to a time a couple hours before (thus skipping most of the 22 hours of latency), the CPU usage of python was only 0.7 CPU core.... And fw1_loggrabber and splunkd spiked to levels not seen before (both at 2 CPU cores each). Around the same moment the Metrics log reports that it indexed 1.3 million events/minute for a couple of minutes. It seems that the DEBUG log level (very) negatively impacted the maximum events that python/lea_loggrabber could retrieve/send to splunk.

Some time later the resource usage came down; splunkd and lea_loggrabber run both at 0.1 core, python at 0.03... That is at around 70000 events per minute.

I'll be monitoring closely what happens under more load, but for now it seems all right.

Action01

dominiquevocat · ‎11-29-2018

the logs also have many new fields so the size of a event is about 3x larger plus you can not deselect the fields since they are not exposed in the inputs definition app plsu i fail to blacklist the superflous fields. 😕

hatalla · ‎03-27-2018

Hey mmoermans,

Did you figure a solution for this time gap between _time and indextime? We are having the same issue where the events _time can span anywhere from few minutes to up to 3 hours in comparison to indextime. I reduced the polling interval to 300 secs/5 minutes and no luck; still seeing the time gap. We are also using R80

Thanks.

OPSEC LEA R80 logging behind

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!