I'm trying to work out some sourcetype settings. The events look like this:
2015.07.13 08:38:47: system,DEBUG: <<SomeListener>> Thread-foo/host.something.Listener$Method$1@somewhere setting set to false
2015.07.13 08:38:47: system,DEBUG: <<SomeRule>> [ID(ID_aswell)] .method() config=[. fooDO: OID=digits
2015.07.13 08:38:47: system,DEBUG+ . GATEWAYID=. . GatewayDO: OID=digits
2015.07.13 08:38:47: system,DEBUG+ . . SHORTNAME=foo
2015.07.13 08:38:47: system,DEBUG+ . . LONGNAME=FOObar
Those are multiline events, so I only want to break when there's a :
after the DEBUG, which led me to the following regex:
(\r\n)*\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]
When I use this with BREAK_ONLY_BEFORE
, it works like a charm. However, I don't see the need to first break lines and then merge them if the same can also be accomplished by a linebreak alone. So I also tried the above regex with LINE_BREAKER
, but then the data in the preview and after indexing becomes this:
015.07.13 08:38:47: system,DEBUG: <<SomeListener>> Thread-foo/host.something.Listener$Method$1@somewhere setting set to false
015.07.13 08:38:47: system,DEBUG: <<SomeRule>> [ID(ID_aswell)] .method() config=[. fooDO: OID=digits
015.07.13 08:38:47: system,DEBUG+ . GATEWAYID=. . GatewayDO: OID=digits
015.07.13 08:38:47: system,DEBUG+ . . SHORTNAME=foo
015.07.13 08:38:47: system,DEBUG+ . . LONGNAME=FOObar
with obviously wrong timestamps and some weird markup in the preview, but otherwise ok data (especially the multiline events are still preserved).
Whats happening here?
On a side note, is my understanding correct that LINE_BREAKER
is generally preferable to BREAK_ONLY_BEFORE
from a processing load perspective? If yes, why does the "Add Data" wizard in Splunk always use the latter and doesn't allow the user to set LINE_BREAKER
under "Advanced" explicitly, so you really have to set this thing via the props.conf file directly if you want to use it?
I cannot speak to why the wizard
does what it does but I can explain what confuses most people about LINE_BREAKER
. Once you redefine LINE_BREAKER
from the default, it now has nothing to do with newlines, which means that "line" doesn't mean what you think it means, and so SHOULD_LINEMERGE
doesn't, either. Generally, use LINE_BREAKER=
and SHOULD_LINEMERGE = false
together.
Splunk processes every stream of input data as follows:
•Break the stream into a single "line" using LINE_BREAKER. The default LINE_BREAKER ([\r\n]+) prevents newlines but yours probably allows them.
•Check if we are done (SHOULD_LINEMERGE=false) or if we are merging multiple "lines" into one event using, BREAK_ONLY_BEFORE, etc. At this point, Splunk recognizes each event as either multi-"line" or single-"line", as defined by "LINE_BREAKER" not as defined by a newline character boundary (as you are used to thinking).
So the problem you are specifically having is probably because you were using BOTH LINE_BREAKER=
AND SHOULD_LINEMERGE=true
(which is the default), which is why you needed to add in the BREAK_ONLY_BEFORE
. If you use ONLY LINE_BREAKER=
and SHOULD_LINEMERGE
, then you should not need BREAK_ONLY_BEFORE
. You should always put in a SHOULD_LINEMERGE
so that you are not mis-remembering the default and to "comment" your explicit desire, so that people don't try to "help" you by "fixing" it later and adding it, which will break everything.
I cannot speak to why the wizard
does what it does but I can explain what confuses most people about LINE_BREAKER
. Once you redefine LINE_BREAKER
from the default, it now has nothing to do with newlines, which means that "line" doesn't mean what you think it means, and so SHOULD_LINEMERGE
doesn't, either. Generally, use LINE_BREAKER=
and SHOULD_LINEMERGE = false
together.
Splunk processes every stream of input data as follows:
•Break the stream into a single "line" using LINE_BREAKER. The default LINE_BREAKER ([\r\n]+) prevents newlines but yours probably allows them.
•Check if we are done (SHOULD_LINEMERGE=false) or if we are merging multiple "lines" into one event using, BREAK_ONLY_BEFORE, etc. At this point, Splunk recognizes each event as either multi-"line" or single-"line", as defined by "LINE_BREAKER" not as defined by a newline character boundary (as you are used to thinking).
So the problem you are specifically having is probably because you were using BOTH LINE_BREAKER=
AND SHOULD_LINEMERGE=true
(which is the default), which is why you needed to add in the BREAK_ONLY_BEFORE
. If you use ONLY LINE_BREAKER=
and SHOULD_LINEMERGE
, then you should not need BREAK_ONLY_BEFORE
. You should always put in a SHOULD_LINEMERGE
so that you are not mis-remembering the default and to "comment" your explicit desire, so that people don't try to "help" you by "fixing" it later and adding it, which will break everything.
Perhaps I didn't make it clear enough, but I used BREAK_ONLY_BEFORE and LINE_BREAKER exclusively - and I also added SHOULD_LINEMERGE = false to the LINE_BREAKER version, because that defaults to true if I'm not mistaken. So I had these two configurations in my props.conf:
# A
[sourcetype]
NO_BINARY_CHECK = true
BREAK_ONLY_BEFORE = (\r\n)?\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]
and
# B
[sourcetype]
NO_BINARY_CHECK = true
LINE_BREAKER = (\r\n)?\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]
SHOULD_LINEMERGE = false
with the former working like it should and the second somehow removing the initial "2" of the year in the timestamp, thus messing everything up.
I see what could be the problem, you have a string
instead of a character class
for your linebreaks; why are you using (\r\n)
instead of ([\r\n]+)
? And why did you make them optional with the question mark? I think probably you need this:
# A
[sourcetype]
NO_BINARY_CHECK = true
BREAK_ONLY_BEFORE = ([\r\n]+)\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]
and
# B
[sourcetype]
NO_BINARY_CHECK = true
LINE_BREAKER = ([\r\n]+)\d{4}\.\d{2}\.\d{2}\s\d{2}\:\d{2}\:\d{2}\:\s[^\,]+\,[^+:]+[\:]
SHOULD_LINEMERGE = false
Ah! Yes, the issue was caused by not using a character class... well, I'd say, more precisely because in my initial settings, I required exactly a return and a newline. I feel pretty dumb for not noticing that myself. I don't exactly understand why this led to the described behavior though.
I made the capturing group optional because I've had it happen to me that two events weren't separated by a return/newline, and then making the capturing group optional still made splunk break them into two events. It kinda caught on; is there any downside to it?
By the way, I've further "simplified" the regex by using a group for the date and time pattern instead of the explicit two-digit-one-separator expression, so for anyone following this it now looks like this:
([\r\n]+)\d{2}(?:\d{2}.){6}\s[^\,]+\,[^+:]+[\:]