Splunk Search

How do I edit my regex to parse fields correctly if a field delimiter appears within a field?

markwymer
Path Finder

Hi,

Another regex problem I'm afraid.....

I've got a very long event with 37 fields where all the fields are quoted and separated by comma. Also there are no key=value pairs.
For the most part my regex works nicely with the event data, but there are occasions where a quote also appears in the actual field data thereby breaking my regex separator character.

Working example (extremely simplified regex and event):

^"(?P<dest_ip>[^"]+)","(?P<dest_port>[^"]+)","(?P<uri>[^"]+)","(?P<request>[^"]+)","(?P<response>[^\n]+)"$

Data:

"192.0.0.20","80","fl=city,name,code,group=true&group.field=city","GET /solr/lpbm/select?fl=city","Logging rate limit reached"

No problem with this, all the fields parse out OK. However, this next event fails - note the additional " in fourth field:-

"192.0.0.20","80","fl=city,name,code,group=true&group.field=city","GET /solr/"lpbm"/select?fl=city","Logging rate limit reached"

This now breaks the [^"]+)"," part of my regex and distorts the field extractions.

Is there a way to do the equivalent of:-

......","(?P<request>[^","]+)",".......

I know that this is invalid, but I don't know what the alternative looks like 😞 !!

Thanks for any help,
Mark.

0 Karma
1 Solution

Sebastian2
Path Finder

Your problem should be solvable by using non greedy (or lazy) quantifiers instead of the [^"] syntax. The advantage is, that you can use the whole pattern "," as seperator instead of just [^"]. How ever, I'm not sure if the Splunk RegEx works as I expect to do, but try (something like) this:

^"(?P<dest_ip>.+?)","(?P<dest_port>.+?)","(?P<uri>.+?)","(?P<request>.+?)","(?P<response>[^\n]+)"$

What's the difference:

  • I'd say the [^"] syntax is "old school". The parser is consuming just everything until an " is found.
  • Lazy quantifiers, how ever, parse as much as they can. And "as much" means: As much as possible unless the whole pattern doesn't match. In theory this should (I can't test that right now) therefore consume a single " but no "," as the pattern would no longer match as a whole. (And it should be a little bit slower, again, in theory)

/edit & just as info: a ? makes an quantifier lazy (here: .+?: "Consume lazy at least one character").

View solution in original post

javiergn
Super Champion

Try the following:

^"(?P<dest_ip>[^"]+)","(?P<dest_port>[^"]+)","(?P<uri>[^"]+)","(?P<request>[^"][^,]+)","(?P<response>[^\n]+)"$

You can test it here: https://regex101.com/r/nD3sL1/2

Sebastian2
Path Finder

Your problem should be solvable by using non greedy (or lazy) quantifiers instead of the [^"] syntax. The advantage is, that you can use the whole pattern "," as seperator instead of just [^"]. How ever, I'm not sure if the Splunk RegEx works as I expect to do, but try (something like) this:

^"(?P<dest_ip>.+?)","(?P<dest_port>.+?)","(?P<uri>.+?)","(?P<request>.+?)","(?P<response>[^\n]+)"$

What's the difference:

  • I'd say the [^"] syntax is "old school". The parser is consuming just everything until an " is found.
  • Lazy quantifiers, how ever, parse as much as they can. And "as much" means: As much as possible unless the whole pattern doesn't match. In theory this should (I can't test that right now) therefore consume a single " but no "," as the pattern would no longer match as a whole. (And it should be a little bit slower, again, in theory)

/edit & just as info: a ? makes an quantifier lazy (here: .+?: "Consume lazy at least one character").

Get Updates on the Splunk Community!

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...