Getting Data In

CSV Field Contains Full XML Event

coltwanger
Contributor

I have an ugly looking log format which has pipe-separated values, but one of the fields in the event is a full XML event tied to that particular event. Here's an example of this terrible looking format:

"Application"|""|"123456789"|"12"|""|""|"456"|"lkasjdf3lkajsxzkjfa"|"2016-04-14 23:59:52.487000000"|""|"127.0.0.1"|"HostName"|"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 "|""|""|""|""|""|"123456789"|"2016-04-15 00:20:15.057000000"|"321"|"321654987654"|"2016-04-15 00:20:15.060000000"|"<XMLEvent xmlns="http://www.domain.com" xmlns:abc="http://www.domain.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" field="v1.0">
<Attachment="0">
<TimeStamp>2016-04-15T00:20:08Z</Timestamp>
<Id>000000000</Id>
<Version>1.0</Version>
<Parents>
<Parent1>
<FirstName>John</FirstName>
<LastName>Doe</LastName>
<DOB>03-13-1979</DOB>
</Parent1>
<Parent2>
<FirstName>Jane</FirstName>
<LastName>Doe</LastName>
<DOB>10-30-1980</DOB>
</Parent2>
</Parents>
<Children>
<Child1>
<FirstName>Johnny</FirstName>
<LastName>Doe</LastName>
<DOB>06-30-2009</DOB>
</Child1>
<Child2>
<FirstName>Jenny</FirstName>
<LastName>Doe</LastName>
<DOB>01-01-2010</DOB>
</Child2>
</Children>"|"END"

Here are my settings:

props.conf

[xml_hell]
DATETIME_CONFIG = 
FIELD_DELIMITER = |
HEADER_FIELD_DELIMITER = |
INDEXED_EXTRACTIONS = psv
KV_MODE = none
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
category = Custom
description = Pipe-separated value format. Set header and other settings in "Delimited Settings"
disabled = false
pulldown_type = true

# Extract XML Data into KV Pairs
REPORT-01_extract_xml_fields = 01_extract_xml_fields

transforms.conf

# Breaks out the content from the xmlstring field generated in the psv file
[01_extract_xml_fields]
SOURCE_KEY = xmlstring
REGEX = <([^>]*)>([^<]*)</\1>
FORMAT = $1::$2
MV_ADD = true

So what I've got in Splunk are correctly identified character-separated fields (using | as the delim), and the full XML event in a field called "xmlstring". I need to be able to parse this "xmlstring" field out into their own KV pairs, and I'd like to do this in a way that mimics KV_MODE=XML, if possible.

The problem I'm currently facing is that there are multiple fields of the same name throughout the XML. So without MV_ADD, I only end up returning one value for the event above for "FirstName". But, with MV_ADD I now get:

  • FirstName=
  • John
  • Jane
  • Johnny
  • Jenny

However, each of these FirstName values means something different -- "John" belongs to "Parent1", "Jane" belongs to "Parent2", etc. I'd like to be able to extract them as:

  • Parents.Parent1.FirstName = John
  • Parents.Parent2.FirstName = Jane
  • Children.Child1 = Johnny
  • Children.Child2 = Jenny

I'm able to capture each grouping using Regex, but I think Splunk is failing on the assignment because not ALL of the capture groups contain data for each line. My Regex for this is:

(?<Tag1><(?<Tag2>[^>\/]*)>\n |<(?<Tag3>[^>\/]*)>)(?<Tag4>[^<]*)

So when I hit on "Parents", Tag1 is populated with , but none of the other tags are. When I hit on "Parent1", Tag1 is populated with "", Tag2 is populated with "Parent1", Tag3 and Tag4 are empty. When I hit on "FirstName", Tag1 gets ", Tag3 is populated with "FirstName" and Tag4 gets "John", but Tag2 is empty.

I tried FORMAT=$1.$2.$3::$4, but I think Splunk is failing when one of the groups is null. Is there any way to perform something similar to KV_MODE=XML on a single field?

0 Karma

Masa
Splunk Employee
Splunk Employee

Splunk itself does not have a feature to do this in configuration file, and I doubt one regex can parse all possible tree structures. If you need to do such things, you need to create multiple regex for each tree. But, probably you need to know your tag names.

Assuming your main goal is to parse tree structure of xmlstring field, you can do ugly workaround using spath.

<Your base search> | eval orig_raw=_raw | eval _raw=xmlstring |  spath | table *

You should find xml path as fields.
If this is acceptable, you might want to create a macro to reduce typing this extra field extraction every time.

coltwanger
Contributor

This is really interesting -- you are right that I'm getting the fields out as xml paths now. I'll see what I can do with this.

Thank you!

0 Karma

Masa
Splunk Employee
Splunk Employee

Glad to see it helped.

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...