Getting Data In

rsyslog connection to splunk stalling

tcutts
New Member

Most of our systems use rsyslog for logging, and log their events over UDP to a central splunk server. This works fine.

One of our groups of users have their own splunk server, and wanted to log separately and in more detail to that for certain applications of their own. They have configured their rsyslog setup to forward over TCP, not UDP, with the following rules in our rsyslog configuration:

:syslogtag, startswith, "psd_" @@psd1d:5140;RSYSLOG_ForwardFormat
:syslogtag, startswith, "psd_" ~

Most of the time, this works fine. Once in a blue moon, however, something very strange happens. The web servers which are creating these log events -- the application is a Rails app -- suddenly start going very slowly. On investigation, they find that the calls to the logger are going really slowly. Worse than that, the events are no longer being logged, anywhere.

Running an strace on the rsyslogd shows that it's doing nothing at all. Restarting the rsyslogd seems to wake things up and get things going. I suspect that restarting splunkd would also fix it, since that would also cause the TCP connection to drop, although I haven't tried that yet.

One thing we do know, is that it tends to happen to several machines simultaneously, which rather points to it being a problem at the splunk end rather than at the rsyslog end.

We've never seen this with UDP forwarding, but the users don't want UDP forwarding because they're scared of losing events, or getting events in the wrong order.

My current theory is that something's causing the TCP connections to get into a wedged state, but I have no idea what. Has anyone seen anything like this before?

Thanks,

Tim

Tags (2)
0 Karma

gkanapathy
Splunk Employee
Splunk Employee

Well, it's inherent in the nature of TCP vs UDP that a stall will cause the application to stall. They have to decided whether they're more scared of that or of losing data.

The way to minimize this problem is to have a very large buffer in case Splunk stalls, and the way to do that (and in general, the way we recommend you handle all incoming syslog data, UDP included) is to capture the data using a syslog (or rsyslog) daemon on the Splunk server, have that write to files (preferably splitting the files up by host) and then have Splunk tail/monitor those files. Then set the files to roll over up to however much space you want to use as buffer. Basically, you use the files themselves to buffer the output. This also makes it resistant to Splunk stalls, stops, restarts, etc., since the syslog daemon is doing less stuff than the Splunk server will be doing.

0 Karma

jrwilk01
Explorer

Take a look at the rsyslog docs here: http://www.rsyslog.com/doc/rsyslog_reliable_forwarding.html

I've found forwarding to be flakey without the suggested tweaks.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...