Getting Data In

Is there a way to use XML / spath for files that aren't pure XML?

jordansamuels_h
Explorer

I'm using Splunk 6.1.3 (build 220630) on RH 6.5, and I've read much about parsing XML into Splunk. Nonetheless, I think I've found an issue that others have not yet documented or asked about (but if I'm wrong, please tell me). I have legacy input that is mostly XML, but the timestamps are on a separate line outside of the XML (corresponding to the bad_xml type in the example below). I cannot seem to get Splunk to recognize the input as XML, at least insofar as spath doesn't work with it.

Here is a distilled version of my situation. I set up this in props.conf:

[good_xml]
BREAK_ONLY_BEFORE = <\?xml
DATETIME_CONFIG = CURRENT 
NO_BINARY_CHECK = 1
pulldown_type = 1

[bad_xml]
NO_BINARY_CHECK = 1
pulldown_type = 1

And have two input files, good:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <child id="1"/>
  <child id="2"/> 
</root>

and bad:

DATE = 2014-09-24 11:25:47
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <child id="1"/>
  <child id="2"/>
</root>

I then index good and bad with sourcetype good_xml and bad_xml respectively. When I search sourcetype=good_xml | spath I get beautiful extraction of root.child{@id}; however, searching sourcetype=bad_xml | spath yields no such extraction. Throwing | xmlkv in the middle changes nothing (but to be honest, I've never seen hard evidence of what xmlkv does anyway). Note that in both cases, the events are properly extracted from the appropriate multiple lines. I also tried adding kv_mode = XML in props.conf but that didn't work either.

Is there a way to get this to work?

Bonus question: is there a way to get more insight into what Splunk is doing here? When spath works - what is really different about the indexing/searching results and intermediate processing, and does Splunk offer some sort of metadata/transparency on it? Does it consider bad_xml as valid XML? (It does actually parse some key-value pairs, but it doesn't properly handle duplicate values like spath does.) I tried thinks like searching on index=_internal and digging around ... but I didn't see anything revealing.

Tags (3)
1 Solution

trsavela
Path Finder

Put the xml into a field with a field extraction, then

| spath input=your_new_field

View solution in original post

trsavela
Path Finder

Put the xml into a field with a field extraction, then

| spath input=your_new_field

jordansamuels_h
Explorer

Wow, that actually works. I was able to use something like:

 | rex field=_raw "(?s)(?<xxx>..xml version=.*)" | spath input=xxx

and now I see all the field. I'm not sure how performant it will be compared to extraction upstream, but it's a huge win compared to not having spath at all. Thanks!

0 Karma

jordansamuels_h
Explorer

To clarify, spath on good_xml extracts all fields properly; on bad_xml, it extracts nothing.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...