About jpvriel

jpvriel · ‎05-24-2011

This response doesn't offer exact answers to the questions posed in the original post. We didn't use syslog-ng. Instead, it shares our approach and experience with overcoming UDP Packet Drops. My understanding of the architecture when sending syslog messages directly to a Splunk UDP collector (source) is: Kernal UDP mem limits -> Kernel UDP socket buffer size -> Splunk receive queue (in memory) + Splunk persistent queue (on disk) -> Splunk Indexer/forwarder In our case, UDP drops occurred due to: Kernel network buffers got filled. We ran Ubuntu Desktop as the base OS, so the default buffer values were tuned for Desktop use... When Splunk creates the UDP socket, the default receive buffer is 1.5MB. So even if your OS kernel config allows for larger buffers, Splunk is setting a smaller UDP buffer for its socket. Logging bursts causing Splunk's receiving queue for UDP sources to be filled if it's too small. The Splunk indexer/forwarder fell behind due to system load, causing the UDP receiving queues to be filled. Our collector and indexer were on the same system and complex searches caused load issues. We resolved our UDP drops in 5 ways: Increasing Linux kernel network buffers (similar to posts above) Increasing splunk's queue size for udp sources in input.conf Enabling splunk's persistent queue functionality in input.conf Re-prioritising the Splunk collector process, which helped deal with load caused by searches Turning off unnecessary/redundant messages at source, or failing that, apply a filter on the source feed so that the indexer wasn't burdened with extra data. (1) Increasing Linux kernel network buffers Increase the UDP receive buffer size from 128K to 32MB sysctl -w net.core.rmem_max=33554432 Increase other memory management options which moderate how much memory UDP buffers use. E.g. default on system with 2GB: net.ipv4.udp_mem = 192576 256768 385152 net.ipv4.udp_rmem_min = 4096 sysctl -w net.ipv4.udp_mem='262144 327680 393216' Note, net.ipv4.udp_mem works in pages, so multiply values by PAGE_SIZE, where PAGE_SIZE = 4096 (4K). Max udp_mem is then 385152 * 4096 = 1,577,582,592 Increase the queue size for incoming packets. On Ubuntu, the default appeared to be 1000. sysctl -w net.core.netdev_max_backlog=2000 As pre the posts above, make the change persistent using sysctl.conf (2) Increasing Splunk's queue size for UDP sources & (3) Enabling splunk's persistent queue functionality Example from inputs.conf [udp://9000] connection_host = ip index = main # _rcvbuf defaults to 1572864 (1.5MB) _rcvbuf = 16777216 # queueSize = defaults to 500K queueSize = 16MB # persistentQueueSize defaults to 0 (no persistent queue) persistentQueueSize = 128MB (4) Re-prioritising the Splunk collector process We increase the processing and IO priorities for the 'collector' process, so that searches would not impact data collection. Typically, a high volume Splunk deployment would not have this issue as a separate search head would be used. Here's a little bash script I wrote to do the work: # get PID of splunk collector process splunkd_pid="$(head -n 1 /home/splunk/splunk/var/run/splunk/splunkd.pid)" echo "Splunk collector PID is $splunkd_pid" # show current processes and listening ports echo "Splunk ports:" netstat -nlp4 | grep splunk echo # re-nice process to higher priority splunkd_was_nice="$(ps -p $splunkd_pid --no-headers -o nice)" splunkd_was_nice="$(expr match "$splunkd_was_nice" '[ ]*$[-0-9]*$')" if renice -n -5 -p $splunkd_pid; then echo "Changed splunkd CPU priority from nice $splunkd_was_nice to -5 (meaner)" else echo "ERROR: failed to renice process" 1>&2 exit 1 fi # re-nice IO priority splunkd_io_was_nice="$(ionice -p $splunkd_pid)" splunkd_io_was_nice_class="$(expr match "$splunkd_io_was_nice" '^$[a-zA-Z-]*$:')" splunkd_io_was_nice_pri="$(expr match "$splunkd_io_was_nice" '.*: prio $[0-9]$$')" if ionice -c2 -n1 -p $splunkd_pid; then echo "Changed splunkd IO class from $splunkd_io_was_nice_class to best-effort" echo "Changed splunkd IO priority from $splunkd_io_was_nice_pri to 1" else echo "ERROR: failed to renice IO for process" 1>&2 exit 1 fi echo echo "Splunk collector prioritisation complete" echo (5) Apply a filter to the source to remove unwanted input In our case, we wanted to filter out specific info level messages, but not all info level messages - the administrator's were unable to selectively control which info level messages were sent. Our example is provided below. Specify a transform to enable filtering in props.conf [udp:9000] TRANSFORMS-filter=filter_dp Specify the filter in transforms.conf [filter_dp] REGEX=DataPower (\[[^\]]+\]){2}\[info].+?: rule \( DEST_KEY = queue FORMAT = nullQueue Monitoring for UDP errors We created a cron job to append the output of netstat to a file, every 5 minutes. netstat command: crontab -l # m h dom mon dow command */5 * * * * netstat -su >> /tmp/udp_status.txt We then indexed /tmp/udp_status.txt and created a view to graph the errors. The delta command came in handy to compare time intervals. To timechart the number of UDP packets received: sourcetype=udp_stat | kv pairdelim="\n", kvdelim=": ", auto=f | rex "(?<p_rec>\w+)\spackets received" | delta p_rec as p_rec_delta | eval p_rec_delta=abs(p_rec_delta) | timechart span=5m sum(p_rec_delta) To timechart the number of UDP packets dropped due buffer errors (every 5 minutes): sourcetype=udp_stat | kv pairdelim="\n", kvdelim=": ", auto=f | rex "RcvbufErrors: (?<rec_buf_err>\d+)" | delta rec_buf_err as rec_buf_err_delta | eval rec_buf_err_delta=abs(rec_buf_err_delta) | timechart span=5m sum(rec_buf_err_delta) To timechart the number of UDP packets discarded due to the application not taking them from the buffer: udp_stat | kv pairdelim="\n", kvdelim=": ", auto=f | rex "(?<rec_app_err>\d+) packets to unknown port received." | delta rec_app_err as rec_app_err_delta | eval rec_app_err_delta=abs(rec_app_err_delta) | timechart span=5m sum(rec_app_err_delta) To tabulate delta's for UDP stats: sourcetype=udp_stat | kv pairdelim="\n", kvdelim=": ", auto=f | rex "(?<recieved>\w+)\spackets received" | rex "RcvbufErrors: (?<rec_buf_err>\d+)" | rex "(?<rec_app_err>\d+) packets to unknown port received." | delta recieved as recieved_delta | delta rec_buf_err as rec_buf_err_delta | delta rec_app_err as rec_app_err_delta | eval recieved_delta=abs(recieved_delta) | eval rec_buf_err_delta=abs(rec_buf_err_delta) | eval rec_app_err_delta=abs(rec_app_err_delta) | eval rec_buf_err_percent=round((rec_buf_err_delta / recieved_delta * 100),2) | eval rec_app_err_percent=round((rec_app_err_delta / recieved_delta * 100),2) | table _time recieved recieved_delta rec_buf_err rec_buf_err_delta rec_buf_err_percent rec_app_err rec_app_err_delta rec_app_err_percent

Posts	1
Solutions	0
Karma Given	0
Karma Received	9
Member Since	‎05-24-2011

Online Status	Offline
Date Last Visited	‎06-05-2020 02:02 AM

Re: UDP Drops on Linux