Getting Data In

How do I extract multiple multi-value fields from a multi-line event at index time via regex?

jasonstanek
New Member

Needing help with multiple multi-value field extraction from a multiline event.

Expecting the result of the following extraction to index each of rowA values with each of rowC identifiers, and index each of rowB values with each of rowC identifiers, and extract the endtime into the record timestamp(s).

An acceptable alternative to these associations is a record timestamped with EndTime with multivalue field rowA, multivalue field rowB, and multivalue field rowC.

RowNameA,1432,4363,6223,7543,19182,...
RowNameB,8383,2727,3221,...
RowNameC,NumericalIdentifierA,NumericalIdentifierB,...
RowNameD,TheDate,StartTime,EndTime,OtherNumbers,...

I am stuck at (,(?\d+)[^\S]+) for the regex to pull out rowA values, which unfortunately cuts across all lines. Apparently adding wildcard to the beginning of the regex misses values. Apparently the tokenizer-based approach requires named columns. Can someone demonstrate to me that Splunk is expressive enough at index time to extract the information in the manner I'm requesting?

I am working with Splunk Cloud, with data files sourced via a Heavy Forwarder. I've been unable to get the MV_ADD feature to work in transforms.conf, but have been able to get a single multi-value field to extract via the transform+field extraction console.

0 Karma

woodcock
Esteemed Legend

This is not possible with just splunk. I suggest you talk to the folks at cribl (@cgales):

http://cribl.com/

0 Karma

woodcock
Esteemed Legend

Layout real events that cover all variants and what field/value pairs you need to get. That is the only way that we can get through this.

0 Karma

efavreau
Motivator

You mentioned

"An acceptable alternative to these
associations is a record timestamped
with EndTime with multivalue field
rowA, multivalue field rowB, and
multivalue field rowC."

Here's a run anywhere example of getting that far. You can then mix and match as you need.

| makeresults
| eval logentry="
RowNameA,1432,4363,6223,7543,19182,...
RowNameB,8383,2727,3221,...
RowNameC,NumericalIdentifierA,NumericalIdentifierB,...
RowNameD,TheDate,StartTime,2019-01-04 12:00:00,OtherNumbers,...
"
| rex field=logentry "(?<RowNameA>.+)\n(?<RowNameB>.+)\n(?<RowNameC>.+)\n(RowNameD),(?<TheDate>[^,]+),(?<StartTime>[^,]+),(?<EndTime>[^,]+),(?<RowNameD_Remainder>.+)"
| eval A_timestamped=EndTime+","+RowNameA
| eval B_timestamped=EndTime+","+RowNameB
| eval C_timestamped=EndTime+","+RowNameC
| rex field=A_timestamped "(?<TimeA>[^,]+),(?<ARowLabel>[^,]+),(?<AValues>.+)"
| rex field=B_timestamped "(?<TimeB>[^,]+),(?<BRowLabel>[^,]+),(?<BValues>.+)"
| rex field=C_timestamped "(?<TimeC>[^,]+),(?<CRowLabel>[^,]+),(?<CValues>.+)"

Explanation:

  1. The first rex splits the rows
  2. Split row D into fields. I combined this into the regular expression used for step 1.
  3. For the *_timestamped fields, we cat the fields together under a new name each, giving you a new unique field that has what you are looking for. In the sample logentry, I gave EndTime a real value, so it was easier to follow.
  4. The example ends with timestamped fields, with multivalue fields. You can then further parse the multivalue fields using rex or split.
###

If this reply helps you, an upvote would be appreciated.
0 Karma

jasonstanek
New Member

Thanks for the proposed answer. Was looking for an index-time solution. If I understand your proposal correctly, it appears this is a search-time solution. Please correct if not the case.

0 Karma

efavreau
Motivator

That's correct. I interpreted alternative differently.

###

If this reply helps you, an upvote would be appreciated.
0 Karma

woodcock
Esteemed Legend

Assuming that the record format is consistent, you can add (?ms)^ to the very beginning of the RegEx to tell Splunk to make it a multiline search and the ^ is a Start of record anchor character. Then you build a RegEx that covers 4 lines.

0 Karma

jasonstanek
New Member

Thank you for the reply to the question - would like to understand the proposed answer better.

If by "record format is consistent" you mean the number of comma separated items (tokens) on each row, the number of items on rows A,B,&C varies. Row D is the only row that is predictably the same "format".

If you mean something else, can you please elaborate?

0 Karma

woodcock
Esteemed Legend

You can probably still do it, but the RegEx must accommodate all variants.

0 Karma

jasonstanek
New Member

Because the number of tokens on a line is unbounded, it is not apparent that a RegEx can accommodate all variants. For unbounded tokenization, the current understanding is that is for what the tokenization feature is intended - however the tokenization feature appears to work by columns only, not rows. If you think a regex can accommodate all variants, please provide specifics.

0 Karma

woodcock
Esteemed Legend

You get what you need on a line and the end with [^\r\n]*[\r\n]+ and then start capturing on the next line until you get what you need and then use that same RegEx to skip to the next line.

0 Karma

jasonstanek
New Member

I need all of the comma separated values on a line. I do not want to skip lines. How does one use a regex instead of tokenization to pull all the comma separated values on a line?

0 Karma

woodcock
Esteemed Legend

You can use kv for that at search time.

0 Karma

jasonstanek
New Member

What I am hearing is that "How do I extract multiple multi-value fields from a multi-line event at index time via regex" is not possible.

0 Karma

woodcock
Esteemed Legend

Somewhat. Very few field extractions should be done at index-time anyway.

0 Karma

efavreau
Motivator

Not sure that's true, because I do a multi-value rex on a single multi-line event. Need more details. Is the text on any of these lines fixed? If so, these consistent words can be used as positioning anchors in the rex statement. For example:
Are RowNameA, RowNameB, RowNameC, RowNameD all fixed text? Any other pieces that are consistently found from entry to entry? That will help.

###

If this reply helps you, an upvote would be appreciated.
0 Karma

jasonstanek
New Member

Yes, the RowName's are all fixed text. The only other pieces that are consistently found from entry to entry are the line breaks and comma separators.

0 Karma

efavreau
Motivator

Are the number of entries consistent? A real sample might help.

###

If this reply helps you, an upvote would be appreciated.
0 Karma

jasonstanek
New Member

The number of rows is consistent. The number of items separated by commas is not consistent.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...