I have a forwarder (4.2, build 96430) set up on one server to forward logs to two indexers (4.3, build 115073). When I wasn't doing anything but forwarding the logs, everything was working normally. However, when I started assigning a sourcetype to specific source through the following lines in props.conf, I started seeing issues:
[source::.../*.request_log]
sourcetype = access_webservices
When assigning a sourcetype at the forwarder, I see a variable number of null characters inserted into some events, usually at the end. Sometimes it's only 100 or so, sometimes it's thousands of them. Here are a couple of examples:
10.0.0.100 www.example.com - [01/Mar/2012:09:34:53 +0100] "GET /path/a.gif HTTP/1.1" 200 562 "https://www.example.com/path/index.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.3; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)" 5926 "https://127.0.0.1:8300" "cookie1=a2" \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...(followed by several more, 799 in total)...\x00\x00\x00\x00\x00\x00\x00test.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; AskTbMP3R7/5.12.2.16749)" 14600 "http://127.0.0.1:7300" "cookie1=a2"
10.0.0.100 www.example.com - [01/Mar/2012:09:34:53 +0100] "GET /path2/b.gif HTTP/1.1" 200 1977 "http://www.example2.com/path/test.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 21767 "ajp://server3.internal.net:9201" \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...(followed by several more, 297 in total)...\x00\x00\x00\x00\x00\x00Windows NT 6.1; Trident/4.0; GTB7.3; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)" 7045 "ajp://server3.internal.net:9201" \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...(followed by several more, 7500 in total)...\x00\x00\x00\x00\x00\x00
10.0.0.100 www.example.com - [01/Mar/2012:09:34:53 +0100] "GET /path3/c.gif HTTP/1.1" 200 1947 "https://www.example.com/test/index.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.3; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)" 4170 "https://127.0.0.1:8300" "cookie1=a2" \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...(followed by several more, 2500 in total)...\x00\x00\x00\x00\x00\x0
It's only happening on a small percentage of events in a small percentage of files. I'm not doing anything with that sourcetype at the indexer or search head (also 4.3, build 115073) and I verified that the null characters are not occurring in the log file but are in the raw data in Splunk by piping the search to "table _raw".
My outputs.conf on the forwarder looks like this:
[tcpout]
defaultGroup = indexer_A
indexAndForward = false
disabled = false
[tcpout:indexer_A]
autoLB = true
server = server01:9997,server02:9997
The inputs.conf on both indexers looks like this:
[splunktcp://9997]
I looked at this thread but I don't think it's relevant since neither Windows servers nor other character sets are at play. I also found this thread but I don't think it applies since I am using splunktcp in inputs.conf and I'm not seeing any other metacharacters or keywords.
Has anyone run into this issue before?
Seems likely that your application or filesystem is null-padding your files before writing them, or the filesystem is extending the file but not returning the contents immediately. Unclear to me what it would do this. What application/server is creating these files, or is there some intermediate process capturing and writing the data? What file system is being used, and how is it connected (NFS, local, etc) to both the writer and the reader/forwarder?
Had the same issue with monitor input of local CIFS mount (CentOS 6.4).
Adding the options directio
to the mount options resolved it.
This mount option resolved the issue for me. I have a similar configuration, a CIFS share mounted on a CentOS 5.9 server.
Sure, I've edited my question above to include them.
Seems likely that your application or filesystem is null-padding your files before writing them, or the filesystem is extending the file but not returning the contents immediately. Unclear to me what it would do this. What application/server is creating these files, or is there some intermediate process capturing and writing the data? What file system is being used, and how is it connected (NFS, local, etc) to both the writer and the reader/forwarder?
@lsouzek : I would say that of the two possible actions we identified (taking NFS out of the picture or the multiple wwriters) you'll probably want to start with whichever is the least disruptive. My suggestion would be to pick one application whose logs are presenting this issue and have it write to a local disk instead of NFS, and then use a forwarder to monitor those files. But again, that depends on which change is the least disruptive.
I'm not the owner of all pieces of this process so I'm not sure what I can get changed but I would like to figure out what my priorities should be. If I could get a local forwarder on the app servers but it was still reading from NAS and there were multiple writers, would that make any difference? If I eliminate the multiple writers, is that likely to resolve the issue or is NFS always going to be a problem?
@gkanapathy
@hexx : I assigned a custom sourcetype (access_webservices) to the data. When I removed the sourcetyping, I wasn't seeing those characters anymore but I discovered today that our admin had configured the indexer to "manually" remove the null characters from the default sourcetype (request_log-too_small). You're probably right then, the null characters were probably occurring in both cases, I just wasn't seeing them for the default type.
I think that not setting the sourcetype might have slowed down the Splunk input process enough that maybe it didn't read ahead to the nulls. When you don't set a sourcetype, the input process/forwarder spends more time trying to guess a sourcetype as it reads a file. That's just my speculation.
yes, the underlying file system doesn't/shouldn't matter. it might still be a Linux NFS client bug. It might be possible to resolve it via NFS settings, though I'm not sure.
@lsouzek : I really don't think that input-time sourcetyping is involved, but out of curiosity what sourcetype did you assign to the data? A simple test to confirm that sourcetyping is not responsible for the breakage would be to simply remove the pertinent configuration parameter in inputs.conf and let some data get indexed while monitoring for "\x00" characters.
Other references :
- http://comments.gmane.org/gmane.linux.nfs/21948
- https://lkml.org/lkml/1998/4/1/6
- http://www.zsh.org/mla/users/2005/msg00171.html
It definitely sounds similar. It's strange though, the Red Hat thread says that the bug was resolved in a 5.2 update and all of my servers involved here are 5.5. I didn't see when the issue in the Kernel Trap thread was resolved.
Is it not strange that I never saw this issue when Splunk was not adding a sourcetype to the events and that once I enabled the sourcetype in props.conf it popped up? Adding a sourcetype shouldn't change anything about the read operations for Splunk, should it?
Is it possible to set the sourcetype at the indexer instead? Maybe I could test that as a workaround.
@lsouzek : It sounds like you are indeed in the case that @gkanapathy describes above, specifically where your filesystem (here, NFS) is responsible for the NULL-padding at or near EOF. AFAIK, this is a bug with the NFS client of certain Linux kernel revisions. References :
- http://kerneltrap.org/index.php?q=mailarchive/linux-kernel/2008/9/22/3369584/thread
- https://bugzilla.redhat.com/show_bug.cgi?id=429755
I'm seeing this with two kinds of logs, one from Apache running on Linux and the other from WebSphere App Server on Linux. The logs are located on a NAS share which is NFS-mounted on the application servers and the forwarder--the app servers write directly to it, the forwarder reads from it. The underlying filesystem on the NAS shares is WAFL from NetApp.
Could you share a sample event or two showing the "\x00" string?