Needing help with multiple multi-value field extraction from a multiline event.
Expecting the result of the following extraction to index each of rowA values with each of rowC identifiers, and index each of rowB values with each of rowC identifiers, and extract the endtime into the record timestamp(s).
An acceptable alternative to these associations is a record timestamped with EndTime with multivalue field rowA, multivalue field rowB, and multivalue field rowC.
RowNameA,1432,4363,6223,7543,19182,...
RowNameB,8383,2727,3221,...
RowNameC,NumericalIdentifierA,NumericalIdentifierB,...
RowNameD,TheDate,StartTime,EndTime,OtherNumbers,...
I am stuck at (,(?\d+)[^\S]+) for the regex to pull out rowA values, which unfortunately cuts across all lines. Apparently adding wildcard to the beginning of the regex misses values. Apparently the tokenizer-based approach requires named columns. Can someone demonstrate to me that Splunk is expressive enough at index time to extract the information in the manner I'm requesting?
I am working with Splunk Cloud, with data files sourced via a Heavy Forwarder. I've been unable to get the MV_ADD feature to work in transforms.conf, but have been able to get a single multi-value field to extract via the transform+field extraction console.
This is not possible with just splunk. I suggest you talk to the folks at cribl (@cgales):
Layout real events that cover all variants and what field/value pairs you need to get. That is the only way that we can get through this.
You mentioned
"An acceptable alternative to these
associations is a record timestamped
with EndTime with multivalue field
rowA, multivalue field rowB, and
multivalue field rowC."
Here's a run anywhere example of getting that far. You can then mix and match as you need.
| makeresults
| eval logentry="
RowNameA,1432,4363,6223,7543,19182,...
RowNameB,8383,2727,3221,...
RowNameC,NumericalIdentifierA,NumericalIdentifierB,...
RowNameD,TheDate,StartTime,2019-01-04 12:00:00,OtherNumbers,...
"
| rex field=logentry "(?<RowNameA>.+)\n(?<RowNameB>.+)\n(?<RowNameC>.+)\n(RowNameD),(?<TheDate>[^,]+),(?<StartTime>[^,]+),(?<EndTime>[^,]+),(?<RowNameD_Remainder>.+)"
| eval A_timestamped=EndTime+","+RowNameA
| eval B_timestamped=EndTime+","+RowNameB
| eval C_timestamped=EndTime+","+RowNameC
| rex field=A_timestamped "(?<TimeA>[^,]+),(?<ARowLabel>[^,]+),(?<AValues>.+)"
| rex field=B_timestamped "(?<TimeB>[^,]+),(?<BRowLabel>[^,]+),(?<BValues>.+)"
| rex field=C_timestamped "(?<TimeC>[^,]+),(?<CRowLabel>[^,]+),(?<CValues>.+)"
Explanation:
Thanks for the proposed answer. Was looking for an index-time solution. If I understand your proposal correctly, it appears this is a search-time solution. Please correct if not the case.
That's correct. I interpreted alternative differently.
Assuming that the record format is consistent, you can add (?ms)^
to the very beginning of the RegEx to tell Splunk to make it a multiline search and the ^
is a Start of record
anchor character. Then you build a RegEx that covers 4 lines.
Thank you for the reply to the question - would like to understand the proposed answer better.
If by "record format is consistent" you mean the number of comma separated items (tokens) on each row, the number of items on rows A,B,&C varies. Row D is the only row that is predictably the same "format".
If you mean something else, can you please elaborate?
You can probably still do it, but the RegEx must accommodate all variants.
Because the number of tokens on a line is unbounded, it is not apparent that a RegEx can accommodate all variants. For unbounded tokenization, the current understanding is that is for what the tokenization feature is intended - however the tokenization feature appears to work by columns only, not rows. If you think a regex can accommodate all variants, please provide specifics.
You get what you need on a line and the end with [^\r\n]*[\r\n]+
and then start capturing on the next line until you get what you need and then use that same RegEx to skip to the next line.
I need all of the comma separated values on a line. I do not want to skip lines. How does one use a regex instead of tokenization to pull all the comma separated values on a line?
You can use kv
for that at search time.
What I am hearing is that "How do I extract multiple multi-value fields from a multi-line event at index time via regex" is not possible.
Somewhat. Very few field extractions should be done at index-time anyway.
Not sure that's true, because I do a multi-value rex on a single multi-line event. Need more details. Is the text on any of these lines fixed? If so, these consistent words can be used as positioning anchors in the rex statement. For example:
Are RowNameA, RowNameB, RowNameC, RowNameD all fixed text? Any other pieces that are consistently found from entry to entry? That will help.
Yes, the RowName's are all fixed text. The only other pieces that are consistently found from entry to entry are the line breaks and comma separators.
Are the number of entries consistent? A real sample might help.
The number of rows is consistent. The number of items separated by commas is not consistent.