Solved: Re: Why are null characters ("\x00") appearing in ...

lsouzek · ‎02-29-2012

I have a forwarder (4.2, build 96430) set up on one server to forward logs to two indexers (4.3, build 115073). When I wasn't doing anything but forwarding the logs, everything was working normally. However, when I started assigning a sourcetype to specific source through the following lines in props.conf, I started seeing issues:

[source::.../*.request_log]
sourcetype = access_webservices

When assigning a sourcetype at the forwarder, I see a variable number of null characters inserted into some events, usually at the end. Sometimes it's only 100 or so, sometimes it's thousands of them. Here are a couple of examples:

10.0.0.100 www.example.com - [01/Mar/2012:09:34:53 +0100] "GET /path/a.gif HTTP/1.1" 200 562 "https://www.example.com/path/index.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.3; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)" 5926 "https://127.0.0.1:8300" "cookie1=a2" \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...(followed by several more, 799 in total)...\x00\x00\x00\x00\x00\x00\x00test.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR    2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; AskTbMP3R7/5.12.2.16749)" 14600 "http://127.0.0.1:7300" "cookie1=a2"

10.0.0.100 www.example.com - [01/Mar/2012:09:34:53 +0100] "GET /path2/b.gif HTTP/1.1" 200 1977 "http://www.example2.com/path/test.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)" 21767 "ajp://server3.internal.net:9201" \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...(followed by several more, 297 in total)...\x00\x00\x00\x00\x00\x00Windows NT 6.1; Trident/4.0; GTB7.3; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)" 7045 "ajp://server3.internal.net:9201" \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...(followed by several more, 7500 in total)...\x00\x00\x00\x00\x00\x00 

10.0.0.100 www.example.com - [01/Mar/2012:09:34:53 +0100] "GET /path3/c.gif HTTP/1.1" 200 1947 "https://www.example.com/test/index.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; GTB7.3; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)" 4170 "https://127.0.0.1:8300" "cookie1=a2" \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...(followed by several more, 2500 in total)...\x00\x00\x00\x00\x00\x0

It's only happening on a small percentage of events in a small percentage of files. I'm not doing anything with that sourcetype at the indexer or search head (also 4.3, build 115073) and I verified that the null characters are not occurring in the log file but are in the raw data in Splunk by piping the search to "table _raw".

My outputs.conf on the forwarder looks like this:

[tcpout]
defaultGroup = indexer_A
indexAndForward = false
disabled = false

[tcpout:indexer_A]
autoLB = true
server = server01:9997,server02:9997

The inputs.conf on both indexers looks like this:

[splunktcp://9997]

I looked at this thread but I don't think it's relevant since neither Windows servers nor other character sets are at play. I also found this thread but I don't think it applies since I am using splunktcp in inputs.conf and I'm not seeing any other metacharacters or keywords.

Has anyone run into this issue before?

gkanapathy · ‎02-29-2012

Seems likely that your application or filesystem is null-padding your files before writing them, or the filesystem is extending the file but not returning the contents immediately. Unclear to me what it would do this. What application/server is creating these files, or is there some intermediate process capturing and writing the data? What file system is being used, and how is it connected (NFS, local, etc) to both the writer and the reader/forwarder?

View solution in original post

yuvalba · ‎07-29-2013

Had the same issue with monitor input of local CIFS mount (CentOS 6.4).
Adding the options directio to the mount options resolved it.

morgancrocker · ‎03-18-2014

This mount option resolved the issue for me. I have a similar configuration, a CIFS share mounted on a CentOS 5.9 server.

lsouzek · ‎03-01-2012

Sure, I've edited my question above to include them.

gkanapathy · ‎02-29-2012

Seems likely that your application or filesystem is null-padding your files before writing them, or the filesystem is extending the file but not returning the contents immediately. Unclear to me what it would do this. What application/server is creating these files, or is there some intermediate process capturing and writing the data? What file system is being used, and how is it connected (NFS, local, etc) to both the writer and the reader/forwarder?

hexx · ‎03-02-2012

@lsouzek : I would say that of the two possible actions we identified (taking NFS out of the picture or the multiple wwriters) you'll probably want to start with whichever is the least disruptive. My suggestion would be to pick one application whose logs are presenting this issue and have it write to a local disk instead of NFS, and then use a forwarder to monitor those files. But again, that depends on which change is the least disruptive.

lsouzek · ‎03-02-2012

I'm not the owner of all pieces of this process so I'm not sure what I can get changed but I would like to figure out what my priorities should be. If I could get a local forwarder on the app servers but it was still reading from NAS and there were multiple writers, would that make any difference? If I eliminate the multiple writers, is that likely to resolve the issue or is NFS always going to be a problem?

lsouzek · ‎03-02-2012

@gkanapathy

I tried mounting the NAS read-only but it didn't help. I looked for some other mount options that might help but didn't find anything. These shares are not mounted with the "soft" option, which I read can cause issues in similar situations.
In the case of the WebSphere server, I don't think so, but for Apache, yes. I wasn't aware of it until I just checked but it turns out that Apache is logging through cronolog and in some cases there are multiple cronolog instances running and writing to the same log file.

lsouzek · ‎03-02-2012

@hexx : I assigned a custom sourcetype (access_webservices) to the data. When I removed the sourcetyping, I wasn't seeing those characters anymore but I discovered today that our admin had configured the indexer to "manually" remove the null characters from the default sourcetype (request_log-too_small). You're probably right then, the null characters were probably occurring in both cases, I just wasn't seeing them for the default type.

gkanapathy · ‎03-01-2012

I think that not setting the sourcetype might have slowed down the Splunk input process enough that maybe it didn't read ahead to the nulls. When you don't set a sourcetype, the input process/forwarder spends more time trying to guess a sourcetype as it reads a file. That's just my speculation.

gkanapathy · ‎03-01-2012

yes, the underlying file system doesn't/shouldn't matter. it might still be a Linux NFS client bug. It might be possible to resolve it via NFS settings, though I'm not sure.

Is the NAS mounted to the forwarder read-only? (probably won't help)
Are there possibly multiple processes/instances appending to the same file? If so, is there a way to change it so there is only a single writer per file?

hexx · ‎03-01-2012

@lsouzek : I really don't think that input-time sourcetyping is involved, but out of curiosity what sourcetype did you assign to the data? A simple test to confirm that sourcetyping is not responsible for the breakage would be to simply remove the pertinent configuration parameter in inputs.conf and let some data get indexed while monitoring for "\x00" characters.

Other references :

- http://comments.gmane.org/gmane.linux.nfs/21948

- https://lkml.org/lkml/1998/4/1/6

- http://www.zsh.org/mla/users/2005/msg00171.html

lsouzek · ‎03-01-2012

It definitely sounds similar. It's strange though, the Red Hat thread says that the bug was resolved in a 5.2 update and all of my servers involved here are 5.5. I didn't see when the issue in the Kernel Trap thread was resolved.

Is it not strange that I never saw this issue when Splunk was not adding a sourcetype to the events and that once I enabled the sourcetype in props.conf it popped up? Adding a sourcetype shouldn't change anything about the read operations for Splunk, should it?

Is it possible to set the sourcetype at the indexer instead? Maybe I could test that as a workaround.

hexx · ‎03-01-2012

@lsouzek : It sounds like you are indeed in the case that @gkanapathy describes above, specifically where your filesystem (here, NFS) is responsible for the NULL-padding at or near EOF. AFAIK, this is a bug with the NFS client of certain Linux kernel revisions. References :

- http://kerneltrap.org/index.php?q=mailarchive/linux-kernel/2008/9/22/3369584/thread

- https://bugzilla.redhat.com/show_bug.cgi?id=429755

lsouzek · ‎03-01-2012

I'm seeing this with two kinds of logs, one from Apache running on Linux and the other from WebSphere App Server on Linux. The logs are located on a NAS share which is NFS-mounted on the application servers and the forwarder--the app servers write directly to it, the forwarder reads from it. The underlying filesystem on the NAS shares is WAFL from NetApp.

hexx · ‎02-29-2012

Could you share a sample event or two showing the "\x00" string?

Why are null characters ("\x00") appearing in events assigned a sourcetype by a forwarder?

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases