Understanding regex used in LINE_BREAKER

bshamsian · ‎12-17-2012

Using Splunk 4.3 - My data input file is in JSON format with multiple events in each file stored in an events array.

Sample Data (Greatly simplified):

{"events":[{"requests":[{"document":"4968435","requestheaders":{"Content-Type":"url",},"headers":{"Data":"123456",},}],"list":[{"type":"W8021X"},{"rssi":"97",}],"event":"ONE","systemdate":"2012-10-0910:33:39-0700"},{"systemdate":"2012-10-0910:35:30-0700","list":[{"rssi":"97",},{"id":"TWO",}],"event":"TWO"}]}

Every record is one of the elements in the "events" array in this JSON file. Every event has a date stored in the "systemdate" field.

To come up with a regex that can parse nested braces in a JSON array I used the method outlined at link text. Per this article I created the following regex:

((\{"events":\[)|,)\s*\{(?:[^{}]|\{[^{}]*(?:[^{}]|\{[^{}]*(?:[^{}]|\{[^{}]*\})*\})*\})*\}

The above regex supports up to three levels of braces deep which works for my data set. Tested the regex on multiple websites like: link text and regexbuddy and they all show the regex is working correctly and I get two extracted events from data above.

Per above info I created a sourcetype with following settings:

KV_MODE=json LINE_BREAKER=((\{"events":\[)|,)\s*\{(?:[^{}]|\{[^{}]*(?:[^{}]|\{[^{}]*(?:[^{}]|\{[^{}]*\})*\})*\})*\} SHOULD_LINEMERGE=true TIME_PREFIX="systemdate":" TRUNCATE=0

However the regex does not work in Splunk and the Data Previewer does not show two records - nor does it pickup the date/time.

Truncate is there because each event is more than 1000 Characters and the entire file has no line breaks and is many thousands of characters. Tried SHOULDLINEMERGE=false but did not make any difference.

How can I troubleshoot where this is going wrong? Is it failing in the regex or am I using the LINE_BREAKER incorrectly?

Any help is greatly appreciated.

jonuwz · ‎12-18-2012

This is a head scratcher..

After cleaning up the json, (trailing , are not allowed in arrays / hashes ( unlike perl)), your regex splits the sample data into 3 events :

{"requests":[{"document":"4968435","requestheaders":{"Content-Type":"url",},"headers":{"Data":"123456",},}],"list":[{"type":"W8021X"}

{"rssi":"97"}],"event":"ONE","systemdate":"2012-10-0910:33:39-0700"}

{"systemdate":"2012-10-0910:35:30-0700","list":[{"rssi":"97"}

So it doesn't handle the 1st event properly.

It seems that the regular expression following the capture group acts a a forward lookahead assertion, which has different behaviour than if it were its own capture group.

Rather than having to mess about in splunk to figure it out, try this, and fiddle with the regex until it works

#!/usr/bin/perl
my $string='{"events":[{"requests":[{"document":"4968435","requestheaders":{"Content-Type":"url",},"headers":{"Data":"123456",},}],"list":[{"type":"W8021X"},{"rssi":"97"}],"event":"ONE","systemdate":"2012-10-0910:33:39-0700"},{"systemdate":"2012-10-0910:35:30-0700","list":[{"rssi":"97"},{"id":"TWO"}],"event":"TWO"}]}';

while ($string=~/((\{"events":\[)|,)(?=\s*\{(?:[^{}]|\{[^{}]*(?:[^{}]|\{[^{}]*(?:[^{}]|\{[^{}]*\})*\})*\})*\})/g ) {
    print substr($string,$end,$-[0]-$end),"\n" if $end;
    $end=$+[0];
}

alacercogitatus · ‎12-18-2012

Firstly, I'd suggest using a JSON validator to make sure you are using correct syntax. I ran your JSON through a validator and it failed (http://jsonlint.com/). Once I corrected the syntax, Splunk began to automatically parse the JSON in the UI and auto extracted a lot of fields. I then noticed another issue. With the way the JSON is structured, the "event" array item may or may not have "event" listed first. This poses a problem with splitting using LINE_BREAKER. In this case, I would suggest using spath and xpath to parse the information you need.

If you want the timestamps to correctly index each individual event, the JSON should probably be rewritten to allow a more specific extraction.

Example:

Rewrite JSON to :
{ "events": [ { "event": { "list": [ { "type": "W8021X" }, { "rssi": "97" } ], "id": "ONE", "systemdate": "2012-10-0910:33:39-0700" } }, { "event": { "systemdate": "2012-10-0910:35:30-0700", "list": [ { "rssi": "97" }, { "id": "TWO" } ], "id": "TWO" } } ] }

This syntax allows you to use some props/transforms to break the events, without having to worry about which order the variables show up in the JSON. The MAX_TIMESTAMP_LOOKAHEAD needs to have enough characters to grab the systemdate variable. I'll have to play with it to get the props.conf right.

bshamsian · ‎12-18-2012

Per Spec outlined above the first capturing group tells Splunk how to break the events. My first capturing group is ((\{"events":\[)|,) This will match the start of the data which is the "events" array or the "," between each array element. Per spec above "The contents of the first capturing group are discarded" thus after running this expression you are left with the following two records:

{"requests":[...,"event":"ONE","systemdate":"2012-10-0910:33:39-0700"}

{"systemdate":"2012-10-0910:35:30-0700",...,"event":"TWO"}]}

Where each one is an individual event to be logged by Splunk.

bshamsian · ‎12-18-2012

Per Splunk Documentation:
LINE_BREAKER=
* Specifies a regex that determines how the raw text stream is broken into initial events, before line merging takes place
* The regex must contain a capturing group - a pair of parentheses which defines an identified subcomponent of the match
* Wherever the regex matches, Splunk considers the start of the first capturing group to be the end of the previous event, and considers the end of the first capturing group to be the start of the next event
* The contents of the first capturing group are discarded, and will not be present in any event ...

jonuwz · ‎12-18-2012

LINE_BREAKER is not a regular expression that matches the event. Its regular expression that says where the line should be broken. What you've written probably gobbles up the actual event and leaves you with the opening and closing json syntax.

You need a regex that matches '"events":' or a ',' that splits array elements

Understanding regex used in LINE_BREAKER

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life