Solved: Is there a way to use XML / spath for files that a...

jordansamuels_h · ‎09-24-2014

I'm using Splunk 6.1.3 (build 220630) on RH 6.5, and I've read much about parsing XML into Splunk. Nonetheless, I think I've found an issue that others have not yet documented or asked about (but if I'm wrong, please tell me). I have legacy input that is mostly XML, but the timestamps are on a separate line outside of the XML (corresponding to the bad_xml type in the example below). I cannot seem to get Splunk to recognize the input as XML, at least insofar as spath doesn't work with it.

Here is a distilled version of my situation. I set up this in props.conf:

[good_xml]
BREAK_ONLY_BEFORE = <\?xml
DATETIME_CONFIG = CURRENT 
NO_BINARY_CHECK = 1
pulldown_type = 1

[bad_xml]
NO_BINARY_CHECK = 1
pulldown_type = 1

And have two input files, good:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <child id="1"/>
  <child id="2"/> 
</root>

and bad:

DATE = 2014-09-24 11:25:47
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <child id="1"/>
  <child id="2"/>
</root>

I then index good and bad with sourcetype good_xml and bad_xml respectively. When I search sourcetype=good_xml | spath I get beautiful extraction of root.child{@id}; however, searching sourcetype=bad_xml | spath yields no such extraction. Throwing | xmlkv in the middle changes nothing (but to be honest, I've never seen hard evidence of what xmlkv does anyway). Note that in both cases, the events are properly extracted from the appropriate multiple lines. I also tried adding kv_mode = XML in props.conf but that didn't work either.

Is there a way to get this to work?

Bonus question: is there a way to get more insight into what Splunk is doing here? When spath works - what is really different about the indexing/searching results and intermediate processing, and does Splunk offer some sort of metadata/transparency on it? Does it consider bad_xml as valid XML? (It does actually parse some key-value pairs, but it doesn't properly handle duplicate values like spath does.) I tried thinks like searching on index=_internal and digging around ... but I didn't see anything revealing.

trsavela · ‎10-08-2014

Put the xml into a field with a field extraction, then

| spath input=your_new_field

View solution in original post

trsavela · ‎10-08-2014

Put the xml into a field with a field extraction, then

| spath input=your_new_field

jordansamuels_h · ‎10-09-2014

Wow, that actually works. I was able to use something like:

 | rex field=_raw "(?s)(?<xxx>..xml version=.*)" | spath input=xxx

and now I see all the field. I'm not sure how performant it will be compared to extraction upstream, but it's a huge win compared to not having spath at all. Thanks!

jordansamuels_h · ‎09-24-2014

To clarify, spath on good_xml extracts all fields properly; on bad_xml, it extracts nothing.

Is there a way to use XML / spath for files that aren't pure XML?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!