Solved: EXTRACT from specific field (using 'in' syntax) do...

Adam_Sealey · ‎03-14-2013

I've been trying to do a search time field extraction, using the EXTRACT- stanza in props.conf.

From the props.conf docs (http://docs.splunk.com/Documentation/Splunk/5.0.2/Admin/Propsconf), it appears that there are 2 ways to perform a search time extraction using EXTRACT; either on the _raw field, or on a specific field.

When I try to perform the field extraction on a specific field (using the 'in' syntax), the extraction doesn't run unless I specify '| extract reload=T'

EXTRACT-extractDomain = (?<domain>(?:(?:(?:[^\.]+\.)?(?<tld>(?:[^\.\s]{2})(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))).$ in questionname

When I remove the 'in questionname' portion of the extraction (resulting in the extraction being run on _raw), the extraction runs all the time (doesn't require '| extract reload=T')

EXTRACT-extractDomain = (?<domain>(?:(?:(?:[^\.]+\.)?(?<tld>(?:[^\.\s]{2})(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))).$

Has anyone else run into this problem? In this case, I can rewrite my extraction to work on _raw, but there are other cases that I'm also working with that it would be very convenient to have the regex be applied to only one field.

Ayn · ‎03-14-2013

The problem is most likely that your first extraction runs before the questionname field has been extracted, so there's nothing to extract from. When you run "| extract reload=T" separately that happens after all automatic extractions have already been applied so the questionname field exists in that case.

Extractions are done in alphabetical order, it might be per-sourcetype or globally, I forget which. Anyway EXTRACT-a will run before EXTRACT-b so if you have, for instance, EXTRACT-extractDomain and EXTRACT-questionname that will lead to the problems you're seeing.

View solution in original post

Adam_Sealey · ‎03-14-2013

Exactly correct!

Using btool, I was able to see the order that the extractions are applied, and confirmed what you said.

EXTRACT-extractDomain = (?<domain>(?:(?:(?:[^\.]+\.)?(?<tld>(?:[^\.\s]{2})(?:(?:\.[^\.\s][^\.\s])|(?:[^\.\s]+)))))).$ in questionname
EXTRACT-opcode = (?<operation>[ R]) (?<opcode>.) \[(?<hexflags>[0-9A-Fa-f]+) (?<flags>....) (?<response>[^\]]+)\]
EXTRACT-protocol = (?<packetid>[0-9A-Fa-f]*) (?<protocol>UDP|TCP) (?<direction>\w+) (?<src_ip>[0-9A-Fa-f\.\:]+)\s+
EXTRACT-question1 = \] (?<questiontype>\w+)\s+(?<questionname>.*)
EXTRACT-question2 = \] (?<questionname>[^\s]*)$
EXTRACT-threadid = (?<threadid>[0-9A-Fa-f]+)\s+(?<context>PACKET)

When I renamed to zzExtractDomain, it works great because the questionname has been filled at that point

Thanks!

Ayn · ‎03-14-2013

The problem is most likely that your first extraction runs before the questionname field has been extracted, so there's nothing to extract from. When you run "| extract reload=T" separately that happens after all automatic extractions have already been applied so the questionname field exists in that case.

Extractions are done in alphabetical order, it might be per-sourcetype or globally, I forget which. Anyway EXTRACT-a will run before EXTRACT-b so if you have, for instance, EXTRACT-extractDomain and EXTRACT-questionname that will lead to the problems you're seeing.

EXTRACT from specific field (using 'in' syntax) doesn't work without forcing an extract reload=T

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

New in Observability Cloud - Explicit Bucket Histograms