This response doesn't offer exact answers to the questions posed in the original post. We didn't use syslog-ng. Instead, it shares our approach and experience with overcoming UDP Packet Drops.
My understanding of the architecture when sending syslog messages directly to a Splunk UDP collector (source) is:
Kernal UDP mem limits -> Kernel UDP socket buffer size -> Splunk receive queue (in memory) + Splunk persistent queue (on disk) -> Splunk Indexer/forwarder
In our case, UDP drops occurred due to:
Kernel network buffers got filled. We ran Ubuntu Desktop as the base OS, so the default buffer values were tuned for Desktop use...
When Splunk creates the UDP socket, the default receive buffer is 1.5MB. So even if your OS kernel config allows for larger buffers, Splunk is setting a smaller UDP buffer for its socket.
Logging bursts causing Splunk's receiving queue for UDP sources to be filled if it's too small.
The Splunk indexer/forwarder fell behind due to system load, causing the UDP receiving queues to be filled.
Our collector and indexer were on the same system and complex searches caused load issues.
We resolved our UDP drops in 5 ways:
Increasing Linux kernel network buffers (similar to posts above)
Increasing splunk's queue size for udp sources in input.conf
Enabling splunk's persistent queue functionality in input.conf
Re-prioritising the Splunk collector process, which helped deal with load caused by searches
Turning off unnecessary/redundant messages at source, or failing that, apply a filter on the source feed so that the indexer wasn't burdened with extra data.
(1) Increasing Linux kernel network buffers
Increase the UDP receive buffer size from 128K to 32MB
sysctl -w net.core.rmem_max=33554432
Increase other memory management options which moderate how much memory UDP buffers use. E.g. default on system with 2GB:
net.ipv4.udp_mem = 192576 256768 385152
net.ipv4.udp_rmem_min = 4096
sysctl -w net.ipv4.udp_mem='262144 327680 393216'
Note, net.ipv4.udp_mem works in pages, so multiply values by PAGE_SIZE, where PAGE_SIZE = 4096 (4K). Max udp_mem is then 385152 * 4096 = 1,577,582,592
Increase the queue size for incoming packets. On Ubuntu, the default appeared to be 1000.
sysctl -w net.core.netdev_max_backlog=2000
As pre the posts above, make the change persistent using sysctl.conf
(2) Increasing Splunk's queue size for UDP sources & (3) Enabling splunk's persistent queue functionality
Example from inputs.conf
[udp://9000]
connection_host = ip
index = main
# _rcvbuf defaults to 1572864 (1.5MB)
_rcvbuf = 16777216
# queueSize = defaults to 500K
queueSize = 16MB
# persistentQueueSize defaults to 0 (no persistent queue)
persistentQueueSize = 128MB
(4) Re-prioritising the Splunk collector process
We increase the processing and IO priorities for the 'collector' process, so that searches would not impact data collection. Typically, a high volume Splunk deployment would not have this issue as a separate search head would be used.
Here's a little bash script I wrote to do the work:
# get PID of splunk collector process
splunkd_pid="$(head -n 1 /home/splunk/splunk/var/run/splunk/splunkd.pid)"
echo "Splunk collector PID is $splunkd_pid"
# show current processes and listening ports
echo "Splunk ports:"
netstat -nlp4 | grep splunk
echo
# re-nice process to higher priority
splunkd_was_nice="$(ps -p $splunkd_pid --no-headers -o nice)"
splunkd_was_nice="$(expr match "$splunkd_was_nice" '[ ]*\([-0-9]*\)')"
if renice -n -5 -p $splunkd_pid; then
echo "Changed splunkd CPU priority from nice $splunkd_was_nice to -5 (meaner)"
else
echo "ERROR: failed to renice process" 1>&2
exit 1
fi
# re-nice IO priority
splunkd_io_was_nice="$(ionice -p $splunkd_pid)"
splunkd_io_was_nice_class="$(expr match "$splunkd_io_was_nice" '^\([a-zA-Z-]*\):')"
splunkd_io_was_nice_pri="$(expr match "$splunkd_io_was_nice" '.*: prio \([0-9]\)$')"
if ionice -c2 -n1 -p $splunkd_pid; then
echo "Changed splunkd IO class from $splunkd_io_was_nice_class to best-effort"
echo "Changed splunkd IO priority from $splunkd_io_was_nice_pri to 1"
else
echo "ERROR: failed to renice IO for process" 1>&2
exit 1
fi
echo
echo "Splunk collector prioritisation complete"
echo
(5) Apply a filter to the source to remove unwanted input
In our case, we wanted to filter out specific info level messages, but not all info level messages - the administrator's were unable to selectively control which info level messages were sent. Our example is provided below.
Specify a transform to enable filtering in props.conf
[udp:9000]
TRANSFORMS-filter=filter_dp
Specify the filter in transforms.conf
[filter_dp]
REGEX=DataPower (\[[^\]]+\]){2}\[info].+?: rule \(
DEST_KEY = queue
FORMAT = nullQueue
Monitoring for UDP errors
We created a cron job to append the output of netstat to a file, every 5 minutes. netstat command:
crontab -l
# m h dom mon dow command
*/5 * * * * netstat -su >> /tmp/udp_status.txt
We then indexed /tmp/udp_status.txt and created a view to graph the errors.
The delta command came in handy to compare time intervals.
To timechart the number of UDP packets received:
sourcetype=udp_stat | kv pairdelim="\n", kvdelim=": ", auto=f | rex "(?<p_rec>\w+)\spackets received" | delta p_rec as p_rec_delta | eval p_rec_delta=abs(p_rec_delta) | timechart span=5m sum(p_rec_delta)
To timechart the number of UDP packets dropped due buffer errors (every 5 minutes):
sourcetype=udp_stat | kv pairdelim="\n", kvdelim=": ", auto=f | rex "RcvbufErrors: (?<rec_buf_err>\d+)" | delta rec_buf_err as rec_buf_err_delta | eval rec_buf_err_delta=abs(rec_buf_err_delta) | timechart span=5m sum(rec_buf_err_delta)
To timechart the number of UDP packets discarded due to the application not taking them from the buffer:
udp_stat | kv pairdelim="\n", kvdelim=": ", auto=f | rex "(?<rec_app_err>\d+) packets to unknown port received." | delta rec_app_err as rec_app_err_delta | eval rec_app_err_delta=abs(rec_app_err_delta) | timechart span=5m sum(rec_app_err_delta)
To tabulate delta's for UDP stats:
sourcetype=udp_stat | kv pairdelim="\n", kvdelim=": ", auto=f | rex "(?<recieved>\w+)\spackets received" | rex "RcvbufErrors: (?<rec_buf_err>\d+)" | rex "(?<rec_app_err>\d+) packets to unknown port received." | delta recieved as recieved_delta | delta rec_buf_err as rec_buf_err_delta | delta rec_app_err as rec_app_err_delta | eval recieved_delta=abs(recieved_delta) | eval rec_buf_err_delta=abs(rec_buf_err_delta) | eval rec_app_err_delta=abs(rec_app_err_delta) | eval rec_buf_err_percent=round((rec_buf_err_delta / recieved_delta * 100),2) | eval rec_app_err_percent=round((rec_app_err_delta / recieved_delta * 100),2) | table _time recieved recieved_delta rec_buf_err rec_buf_err_delta rec_buf_err_percent rec_app_err rec_app_err_delta rec_app_err_percent
... View more