Re: How do I extract multiple multi-value fields f...

jasonstanek · ‎12-13-2018

Needing help with multiple multi-value field extraction from a multiline event.

Expecting the result of the following extraction to index each of rowA values with each of rowC identifiers, and index each of rowB values with each of rowC identifiers, and extract the endtime into the record timestamp(s).

An acceptable alternative to these associations is a record timestamped with EndTime with multivalue field rowA, multivalue field rowB, and multivalue field rowC.

RowNameA,1432,4363,6223,7543,19182,...
RowNameB,8383,2727,3221,...
RowNameC,NumericalIdentifierA,NumericalIdentifierB,...
RowNameD,TheDate,StartTime,EndTime,OtherNumbers,...

I am stuck at (,(?\d+)[^\S]+) for the regex to pull out rowA values, which unfortunately cuts across all lines. Apparently adding wildcard to the beginning of the regex misses values. Apparently the tokenizer-based approach requires named columns. Can someone demonstrate to me that Splunk is expressive enough at index time to extract the information in the manner I'm requesting?

I am working with Splunk Cloud, with data files sourced via a Heavy Forwarder. I've been unable to get the MV_ADD feature to work in transforms.conf, but have been able to get a single multi-value field to extract via the transform+field extraction console.

woodcock · ‎02-18-2019

This is not possible with just splunk. I suggest you talk to the folks at cribl (@cgales):

http://cribl.com/

woodcock · ‎02-13-2019

Layout real events that cover all variants and what field/value pairs you need to get. That is the only way that we can get through this.

efavreau · ‎01-04-2019

You mentioned

"An acceptable alternative to these
associations is a record timestamped
with EndTime with multivalue field
rowA, multivalue field rowB, and
multivalue field rowC."

Here's a run anywhere example of getting that far. You can then mix and match as you need.

| makeresults
| eval logentry="
RowNameA,1432,4363,6223,7543,19182,...
RowNameB,8383,2727,3221,...
RowNameC,NumericalIdentifierA,NumericalIdentifierB,...
RowNameD,TheDate,StartTime,2019-01-04 12:00:00,OtherNumbers,...
"
| rex field=logentry "(?<RowNameA>.+)\n(?<RowNameB>.+)\n(?<RowNameC>.+)\n(RowNameD),(?<TheDate>[^,]+),(?<StartTime>[^,]+),(?<EndTime>[^,]+),(?<RowNameD_Remainder>.+)"
| eval A_timestamped=EndTime+","+RowNameA
| eval B_timestamped=EndTime+","+RowNameB
| eval C_timestamped=EndTime+","+RowNameC
| rex field=A_timestamped "(?<TimeA>[^,]+),(?<ARowLabel>[^,]+),(?<AValues>.+)"
| rex field=B_timestamped "(?<TimeB>[^,]+),(?<BRowLabel>[^,]+),(?<BValues>.+)"
| rex field=C_timestamped "(?<TimeC>[^,]+),(?<CRowLabel>[^,]+),(?<CValues>.+)"

Explanation:

The first rex splits the rows
Split row D into fields. I combined this into the regular expression used for step 1.
For the *_timestamped fields, we cat the fields together under a new name each, giving you a new unique field that has what you are looking for. In the sample logentry, I gave EndTime a real value, so it was easier to follow.
The example ends with timestamped fields, with multivalue fields. You can then further parse the multivalue fields using rex or split.

###

If this reply helps you, an upvote would be appreciated.

jasonstanek · ‎01-04-2019

Thanks for the proposed answer. Was looking for an index-time solution. If I understand your proposal correctly, it appears this is a search-time solution. Please correct if not the case.

efavreau · ‎01-04-2019

That's correct. I interpreted alternative differently.

###

If this reply helps you, an upvote would be appreciated.

woodcock · ‎01-03-2019

Assuming that the record format is consistent, you can add (?ms)^ to the very beginning of the RegEx to tell Splunk to make it a multiline search and the ^ is a Start of record anchor character. Then you build a RegEx that covers 4 lines.

jasonstanek · ‎01-03-2019

Thank you for the reply to the question - would like to understand the proposed answer better.

If by "record format is consistent" you mean the number of comma separated items (tokens) on each row, the number of items on rows A,B,&C varies. Row D is the only row that is predictably the same "format".

If you mean something else, can you please elaborate?

woodcock · ‎01-03-2019

You can probably still do it, but the RegEx must accommodate all variants.

jasonstanek · ‎01-03-2019

Because the number of tokens on a line is unbounded, it is not apparent that a RegEx can accommodate all variants. For unbounded tokenization, the current understanding is that is for what the tokenization feature is intended - however the tokenization feature appears to work by columns only, not rows. If you think a regex can accommodate all variants, please provide specifics.

woodcock · ‎01-03-2019

You get what you need on a line and the end with [^\r\n]*[\r\n]+ and then start capturing on the next line until you get what you need and then use that same RegEx to skip to the next line.

jasonstanek · ‎01-03-2019

I need all of the comma separated values on a line. I do not want to skip lines. How does one use a regex instead of tokenization to pull all the comma separated values on a line?

woodcock · ‎01-03-2019

You can use kv for that at search time.

jasonstanek · ‎01-03-2019

What I am hearing is that "How do I extract multiple multi-value fields from a multi-line event at index time via regex" is not possible.

woodcock · ‎02-13-2019

Somewhat. Very few field extractions should be done at index-time anyway.

efavreau · ‎01-04-2019

Not sure that's true, because I do a multi-value rex on a single multi-line event. Need more details. Is the text on any of these lines fixed? If so, these consistent words can be used as positioning anchors in the rex statement. For example:
Are RowNameA, RowNameB, RowNameC, RowNameD all fixed text? Any other pieces that are consistently found from entry to entry? That will help.

###

If this reply helps you, an upvote would be appreciated.

jasonstanek · ‎01-04-2019

Yes, the RowName's are all fixed text. The only other pieces that are consistently found from entry to entry are the line breaks and comma separators.

efavreau · ‎01-04-2019

Are the number of entries consistent? A real sample might help.

###

If this reply helps you, an upvote would be appreciated.

jasonstanek · ‎01-04-2019

The number of rows is consistent. The number of items separated by commas is not consistent.

How do I extract multiple multi-value fields from a multi-line event at index time via regex?

Introducing the Splunk Community Dashboard Challenge!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

Wondering How to Build Resiliency in the Cloud?