Solved: Re: Why is my regex not matching for a multivalue ...

spike021 · ‎01-20-2016

I looked through quite a few posts on here and couldn't find an appropriate answer, so please bare with me.

I have events coming into Splunk in JSON format. The top-level fields are extracted fine. However, a nested map/dictionary is giving me issues. When I run a search to get the values from that inner dictionary, it works in that I get a resulting table like:

 A       B
---     ---   
 x       y
         z
         y
         z

 s       m
         n

 u       -  (- means None)

So, the y and z both belong to x and occasionally there are more than 2 items per each x. This happens for any x in A.

Since the cell in the table makes the values in B look separated by a newline, I created a regular expression that I've verified to correctly grab the logical groups for each y and z, if, for instance, they were just in a text box like this:

y
z
y
z
y
z

So the regex would properly grab the two as many times necessary, separately.

What I want to do is pull out each pair and separate the two items into two new fields, say C and D, and then later have a table where I have C and D grouped to field A.

The regex part of the command:

rex field="A{}{}"  "(?<C>[\da-z\.-]+\.[a-z\.]{2,6})\n(?<D>\d{1,3})"

Note: the A{}{} together makes up the multivalue field, B, and A is just A as in the earlier part of my example.

The issue I'm running into is that when I pipe what should be the output from that statement into the table command, I don't get anything.. The regex is definitely confirmed working on a site like http://regexr.com/ just for sanity-checking.

So there must be something I'm missing. Maybe the initial table with my example just looks like newlines separate the two values into rows when it doesn't. In which case I tried using a \s as the separator rather than \n and it still doesn't work.

Or maybe there's a super simple explanation for an obvious mistake I'm making.

Regardless I would appreciate some help very much.

Thanks in advance.

alemarzu · ‎01-21-2016

Spike,

Are this results acceptable for you ?

http://postimg.org/image/5ofc2b29v/

View solution in original post

alemarzu · ‎01-21-2016

Spike,

Are this results acceptable for you ?

http://postimg.org/image/5ofc2b29v/

spike021 · ‎01-21-2016

I think so! Would you mind pasting the query? Some of it is a bit blurry in the screenshot.

alemarzu · ‎01-21-2016

Sure,

sourcetype="_json" | rex mode=sed "s/\.\.\.//g" | rex max_match=0 field=_raw "(?s)\s{9,}\"(?<KEYS>.*?)\":" | rex max_match=0 field=_raw "(?s)\s{9,}.*?(?:\:|\:\s+)(?<VALUES>\d+)" | table timestamp KEYS VALUES

spike021 · ‎01-21-2016

That worked! Processing took a bit longer than I expected but that definitely seems to have done the job!

Thank you so much, alemarzu! I know it was a bit back and forth the past few hours trying to figure it out.

alemarzu · ‎01-21-2016

Awesome spike, glad I could help!

spike021 · ‎01-21-2016

I might need to play around with the regex a bit more. Occasionally the Key field is getting data from other fields, probably since you used _raw.

alemarzu · ‎01-21-2016

If you need some help with it just let me know.

spike021 · ‎01-21-2016

Thanks! I think the issue might be related to "empty" dictionaries. Occasionally for that section I don't have results, so the json looks like:

{
      "timestamp": "2016-01-21T14:44:28", 
      "SOME_FIELD": "etc.",
      "ANOTHER_FIELD": "...", 
      "IMPORTANT": {}, 
      "test": [
          [
              "something", 
              1.0
          ]
      ]
  }

spike021 · ‎01-21-2016

Although sometimes it still happens even when there are elements in that dictionary, so I'm not totally sure.

alemarzu · ‎01-22-2016

Ok then, try with this one,

NOT "\"IMPORTANT\": {}," | rex mode=sed "s/...//g" | rex max_match=0 field=_raw "(?s)(?!\IMPORTANT\":\s{)\s{9,}\"(?.?)\":" | rex max_match=0 field=_raw "(?s)(?!\IMPORTANT\":\s{)\s{9,}.?(?::|:\s+)(?\d+)" | table timestamp KEYS VALUES
Let me know if it works.

EDIT: Improved. You have to scape quotation marks after the NOT operator, notice the doble quotes. This will filter events without "KEYS".

javiergn · ‎01-21-2016

Still don't get why values are empty but anyway, that's not the important thing.

Your field names are dynamic, so have you try to pipe your events to

 | spath input=yourjsonfieldname

Or:

 | spath input=IMPORTANT

Or maybe this in case all your event is in a Json format:

 | spath

Take a look at the command help. Spath is a brilliant command when you want to extract data from JSON or XML fields.

javiergn · ‎01-21-2016

Another alternative (copy and paste into your search GUI and let me know the result):

| stats count
| eval myjson = "{
    \"timestamp\": \"2016-01-21T14:44:28\", 
    \"SOME_FIELD\": \"etc.\",
    \"ANOTHER_FIELD\": \"...\", 
    \"IMPORTANT\": {
        \"a_string\": 3,
        \"another_strong\": 44,
        \"maybe_another...\":95
    }, 
    \"test\": [
        [
            \"something\", 
            1.0
        ]
    ]
}"
| spath input=myjson path=IMPORTANT
| spath input=IMPORTANT
| fields - count, IMPORTANT, myjson

spike021 · ‎01-21-2016

I think we're getting closer. The values are showing up now. The problem is I don't think having all the keys as field names works. Each event has its own set of keys. So I think it'd be better to have a "Keys" field that contains all possible keys for someevent.IMPORTANT. The ordering of those keys should match the ordering of their respective values in a "Values" field. So three columns. Something like "EVENT_NAME", "Keys", "Values".

Otherwise I get a list of probably over a hundred fields, where each possible key in IMPORTANT is a field and not every event has a value for that key.

Still hoping this makes at least some sense.

javiergn · ‎01-21-2016

What about?

index="myindex" 
| mvexpand A
| rex max_match=0 field="A" "(?mi)(?<C>[\da-z\.-]+\.[a-z\.]{2,6})\n(?<D>\d{1,3})" 
| table "A", "C", "D"

Or

index="myindex" 
| mvexpand A
| rex max_match=0 field="A" "(?si)(?<C>[\da-z\.-]+\.[a-z\.]{2,6})\s+(?<D>\d{1,3})" 
| table "A", "C", "D"

Note the (?mi) and the (?si). More info here: http://www.regular-expressions.info/modifiers.html

javiergn · ‎01-21-2016

Based on your code sample below why don't you try this (ignore the stats count and the eval as this is just for me to be able to generate an event):

| stats count
| eval myjson = "{
     \"timestamp\": \"2016-01-21T14:44:28\", 
     \"SOME_FIELD\": \"etc.\",
     \"ANOTHER_FIELD\": \"...\", 
     \"IMPORTANT\": {
         \"a_string\": 3,
         \"another_strong\": 44,
         \"maybe_another...\":95
     }, 
     \"test\": [
         [
             \"something\", 
             1.0
         ]
     ]
 }"
| spath input=myjson
| table IMPORTANT.*

spike021 · ‎01-21-2016

Hmm. If I do that, it grabs each key of the dictionary, and turns them into the fields of the table. So I get a bunch of fields called "IMPORTANT_key_" and the integer value isn't shown at all.

javiergn · ‎01-21-2016

What you do mean by "the integer value isn't shown at all".
If you copy and paste my code into your search box it should return the following:

IMPORTANT.a_string  IMPORTANT.another_strong    IMPORTANT.maybe_another...
3   44  95

Where you have field names and values in a table. You can then pipe it to "transpose" if you want to capture field names for instance.

Or maybe I'm completely misunderstanding your requirement.

Thanks,
J

spike021 · ‎01-21-2016

Well I do get:

IMPORTANT.a_string     IMPORTANT.another_strong     IMPORTANT.maybe_another...

But the values aren't populating, plus these fields should then be separated per events rather than being actual field names. I think what's happening right now is it sets all the keys from IMPORTANT in the first event as the field names, but there are many events, each with different keys in IMPORTANT. So just setting it the current way doesn't work since those keys will usually be different.

Usually I'll probably do something like | table "timestamp", IMPORTANT

So that I get three columns. First would be timestamp since each event has a unique timestamp. Second column would be the keys, and third would be values. Or something similar to that.

Does that make more sense? Hopefully it does.

alemarzu · ‎01-21-2016

Hi spike,

Did you try using max_match=0, which means unlimited matches ?

| rex max_match=0 field="A"  "(?<C>[\da-z\.-]+\.[a-z\.]{2,6})\n(?<D>\d{1,3})"

Notice that I've also removed the curly brackets.

spike021 · ‎01-21-2016

I hadn't before, so I just tried that but I still don't seem to be getting any results. My full search query winds up looking like: index="myindex" | rex max_match=0 field="A" "(?[\da-z\.-]+\.[a-z\.]{2,6})\n(?\d{1,3})" | table "A", "C", "D" so nothing absurdly complicated here just trying to get things working correctly.

Why is my regex not matching for a multivalue field?

Detecting Remote Code Executions With the Splunk Threat Research Team

Observability | Use Synthetic Monitoring for Website Metadata Verification

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk