Splunk Search

Using regex to extract a string where the following string may or may not exist

rhysjones
Path Finder

Hi,

I am trying to extract some fields which are generally bound by other strings (eg Some Text 1 Some Text 2). I have a situation where a field may or may not have anything following it.

For example, with this data set :

1 Some Text 1 <my field 1> Some Text 2
2 Some Text 1 <my field 1>",
3 Some Text 1 <my field 1> Some Text 2
4 Some Text 1 <my field 1> Some Text 2
5 Some Text 1 <my field 1>",

This regex partly works in that is extracts correctly items 1, 3, and 4:

Some Text 1\s+(?P<my field 1>.+)\s(Some Text 2|\",)

This regex partly works in that is extracts correctly items 2 and 5, but extracts the entirety of items 1, 3, and 4.

Some Text 1\s+(?P<my field 1>.+)(Some Text 2|\",)

The difference is the "\s". I can't seem to include that in the match group, only before it.

I am sure I am missing something obvious but can't seem to see it. Any help much appreciated.

Thankyou.

0 Karma
1 Solution

jincy_18
Path Finder

Hi rhyjones,

Are you trying to extract these fields using search query ie, rex command or doing it in transforms for index time?
For search query, you can try below regex with rex command?

|rex field=FieldName "(?:Some Text 1\s+)(?P<myfield1>.+)(?=\s+Some Text 2|\",)"

Ensure you have specified field=FieldName if your event data is not coming in _raw field, where FieldName is the name of the column/field in which the string to be extracted is present.

View solution in original post

jincy_18
Path Finder

Hi rhyjones,

Are you trying to extract these fields using search query ie, rex command or doing it in transforms for index time?
For search query, you can try below regex with rex command?

|rex field=FieldName "(?:Some Text 1\s+)(?P<myfield1>.+)(?=\s+Some Text 2|\",)"

Ensure you have specified field=FieldName if your event data is not coming in _raw field, where FieldName is the name of the column/field in which the string to be extracted is present.

rhysjones
Path Finder

So effectively, I can get it running correctly with either "match" by themselves, but if I put them in a non-capturing match group, only the second match is "hit". That means items that are at the end of the line already are correctly returned, but items that have "Some Text 2" are actually captured all the way until the ", combination is matched.

0 Karma

rhysjones
Path Finder

Hi jincy_18,

I did some more experimenting and unfortunately have the same issue. I can either extract "myfield1" when followed by ",
or I can extract "myfield1" when followed by a space then a "Some Text 2".

If I try to have both in a match group I get the one followed by ", extracted correctly, and all the other rows extract until they get to a ", combination.

I might try a different tack.

Thanks again.

0 Karma

jincy_18
Path Finder

Hi rhys,

Have you checked if the space characters are actually spaces or tabs?
Also, in the sample you provided, " Some Text 1 Some Text 2", is " Some Text 1 " always present, I mean is it the same always, like wise for "Some Text 2" when ever it is present is it the same?

0 Karma

rhysjones
Path Finder

Hi jincy_18,

Excellent question.

"Some Text 1" is always there. This works for records that do have text following the extracted field:

Some Text 1\s+(?P<my field 1>.+)\sSome Text 2

This works for records that do not have text following the extracted field:

Some Text 1\s+(?P<my field 1>.+)\",

This does not work

Some Text 1\s+(?P<my field 1>.+)(?:\sSome Text 2|\",)

This last one returns correct extracts for records that do not have text following the extracted field. For records that do have text following the extracted field it returns all the following text up to the next instance of the ", combination rather than stopping before the "Some Text 2" literal string.

Hope that makes sense.

0 Karma

cpetterborg
SplunkTrust
SplunkTrust

What about:

Some Text 1\s+(?P<my field 1>.+?)(?:\sSome Text 2|\",)

Making the .+ a lazy match ( .+? ) will help it to not include Some Text 2 as part of the match.

0 Karma

rhysjones
Path Finder

cpetterborg, that was the missing bit !! Thankyou !

This now appears to be pulling the field in correctly in both cases.
Some Text 1\s+(?P.+?)(?:\sSome Text 2|\",)

Thankyou both for all you assistance. Very much appreciated !

0 Karma

rhysjones
Path Finder

Thankyou jincy_18. I will have a go when I get to the office tomorrow.

I was experimenting using the rex command, but mostly in the field extraction wizard. Effectively I am only trying to extract "my field 1" and I am identifying it based on the fact it is preceded by the literal string "Some Text 1" and a space, and followed immediately by either "Some Text 2" OR the ", combination.

I discovered in another extract I was doing that in the event that was immediately followed by the combination

","text3

I had to use the following regex :

Some Text 1\s+(?P<my field 1>.+)\.{7}text3

This kind of made me think I had a Unicode issue.

Thankyou for the hint. I'll check it out tomorrow.

0 Karma

cpetterborg
SplunkTrust
SplunkTrust

I'm a bit confused by what you want in the end. Is this what you want to see:

https://regex101.com/r/xkvSzf/1

0 Karma

rhysjones
Path Finder

Spot on. 5 Matches regardless of whether there is a string following, or a ", following.
That construct does not appear to be working in Splunk (or in my dataset). For example, if I put the \s inside the match brackets then it seems to be ignored and that side of the match fails.

0 Karma

cpetterborg
SplunkTrust
SplunkTrust

I don't know if you noticed, but the name I used in the capture group doesn't have spaces. That is a requirement - no spaces in capture group names. I don't know if that might be causing things to not work for you. You could also just try a space character instead of a \s. I'm not sure if either of those will help, but they are worth a try.

0 Karma

rhysjones
Path Finder

Thankyou.

Yes, I discovered the requirements for no spaces (apologies, my "sample" didn't convey that). I did play around with just using the space character too. I think I ill go home and start tomorrow with fresh eyes !

Thankyou for the suggestions. You have started me on a couple of new paths of testing so much appreciated. I'll update here if I find a solution.

0 Karma

rhysjones
Path Finder

I am partly wondering if the ".+" may be part of the issue. Given the content of can be varied and contain spaces and special characters I am not sure how to get around that.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...