Solved: can I extract a field with a regexed dynamic field...

cphair · ‎02-05-2015

Hi. I have JSON-like events that come into my indexer like this:
{foo.field1: value,
foo.field2: value,
foo.field3: value}

I would like to extract field1, field2, and field3 as individual fields. The trouble is that their order is not fixed within the event. I can't use Splunk's default json extraction on this data for long and boring reasons, so I'm trying to handle it manually in props. What I'd like to do is something like the following, to extract to a dynamic field name based on the regex:

EXTRACT-foo = \{foo\.(\w+):\s*(?<\1>[^,\}]*),foo\.(\w+):\s*(?<\2>[^,\}]*),foo\.(\w+):\s*(?<\3>[^,\}]*)\}

Unfortunately this doesn't work, and aside from not knowing what Splunk considers to be capture groups in the extraction, I'm not even sure if this syntax is legal. Is there a way to solve this without sorting the JSON beforehand?

UPDATE: For anyone who doesn't feel like reading the comment chain, the $1::$2 format in the accepted answer doesn't just stop at the first match--it goes through the entire event and does pairwise extractions for everything it matches. Since all my field-value pairs have the same format, I don't have to make a regex to match the whole event--I just need to match one pair, and the extraction automatically finds all the pairs that match. I had tried to do a full-event match with the format $1::$2 $3::$4 $5::$6, which is supposed to work, but it didn't, and Splunk support never figured out why. Anyway, the $1::$2 format is simpler and is automatically extensible if I add fields to these events in future.

MuS · ‎02-05-2015

Hi cphair,

try something like this in your transforms.conf:

REGEX  = ([a-z]+)=([a-z]+)
FORMAT = $1::$2

This will create a field name from capturing group one and the value from capturing group two.

Hope this helps ...

cheers, MuS

View solution in original post

masonmorales · ‎02-05-2015

You can also do this using EXTRACT in props.conf. Here's an example of extracting the same field from four different places in the event:

EXTRACT-foo1 = (?i)(?<foo>[^,]+),\d+\.\d+\.\d+\.\d+,[a-f0-9]+(?:\-[^\-]*){4}
EXTRACT-foo2 = ACCESS-REQUEST,[^,]+,[^,]+,[^,]+,[^,]+,(?<foo>\w+)
EXTRACT-foo3 = ACCESS-ACCEPT,[^,]+,[^,]+,[^,]+,(?<foo>\w+)
EXTRACT-foo4 = (DHCP_REQUEST|DHCP_ACK),[^,]+,(?<foo>\w+)

MuS · ‎02-05-2015

Hi cphair,

try something like this in your transforms.conf:

REGEX  = ([a-z]+)=([a-z]+)
FORMAT = $1::$2

This will create a field name from capturing group one and the value from capturing group two.

Hope this helps ...

cheers, MuS

cphair · ‎02-05-2015

Wouldn't that make it an index-time extraction and hurt performance? All I need is the search-time extraction.

wpreston · ‎02-05-2015

No, this would stay a search time extraction. transforms.conf just has a few more tricks up its sleeve when it comes to field extractions. See the docs on REGEX and the FORMAT attribute here.

cphair · ‎02-05-2015

Right, I forgot TRANSFORMS was search-time. But neither of the syntaxes specified in transforms.conf spec are working. I tried FORMAT=$1::2 $3::$4 $5::$6, and I tried the _KEY_X/_VALUE_X approach. Do I need to run three separate stanzas? That seems inefficient too.

wpreston · ‎02-05-2015

Did you add the name of your transform to a REPORT class in props.conf for your sourcetype/source/host?

transforms.conf

[myTransform]
REGEX = \{foo\.(\w+):\s*(?<\1>[^,\}]*),foo\.(\w+):\s*(?<\2>[^,\}]*),foo\.(\w+):\s*(?<\3>[^,\}]*)\}
FORMAT = $1::2 $3::$4 $5::$6

props.conf

[mySourceType]
REPORT-myUniqueClassName = myTransform

cphair · ‎02-12-2015

Yes, I did. It doesn't work.

wpreston · ‎02-12-2015

I didn't check your regex earlier, but I checked it now and there are some things that need to be addressed:

"\" is an invalid character in a named capturing group
Need to account for new lines.

I've modified the regex and it should work to capture what you're looking for. I would use the $1::$2 notation for this in transforms.conf. I don't know if there are new line characters in your event data or if you just added them for readability here. If there are no new line characters in your events, just remove the \n's from the regex below:

\{foo\.(\w+):\s*([^,\}]*),\nfoo\.(\w+):\s*([^,\}]*),\nfoo\.(\w+):\s*([^,\}]*)\}

cphair · ‎02-12-2015

1) I was trying to indicate that the name of the capturing group should match the string directly in front of it. I already knew the regex I mentioned didn't work. Your proposed format doesn't work for me either, though.
2) Events are strictly single-line.

wpreston · ‎02-12-2015

Let's make this easier. I indexed the following sample data:

{foo.field1: value, foo.field2: value, foo.field3: value}

I don't know if it matches your data or not, but going by your descriptions it is close. With the configuration I'm about to provide, Splunk extracts field1 = value field2 = value field3 = value. In transforms.conf, enter this:

[myTransform]
REGEX = foo\.([^,]+):\s+([^,\}]+)
FORMAT = $1::$2

In props.conf, enter this:

[MySourcetype]
REPORT-myUniqueClassName = myTransform

Splunk will apply the field transform to the events as many times as there are matches for the supplied regex. In my test, it extracts the field names and values from the event and the event is now searchable by the extracted fields.

wrangler2x · ‎05-26-2017

Can this sort of thing be done in a rex in search?

cphair · ‎05-31-2017

@wrangler2x: Yes, but I wanted to make the fields easily available for other users without telling them to run a rex in the middle of their search.

wrangler2x · ‎05-31-2017

Could you give me an example of that? I tried to emulate that in search and was unsuccessful.

can I extract a field with a regexed dynamic fieldname?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!