I'm using Splunk 6.1.3 (build 220630) on RH 6.5, and I've read much about parsing XML into Splunk. Nonetheless, I think I've found an issue that others have not yet documented or asked about (but if I'm wrong, please tell me). I have legacy input that is mostly XML, but the timestamps are on a separate line outside of the XML (corresponding to the bad_xml
type in the example below). I cannot seem to get Splunk to recognize the input as XML, at least insofar as spath doesn't work with it.
Here is a distilled version of my situation. I set up this in props.conf
:
[good_xml]
BREAK_ONLY_BEFORE = <\?xml
DATETIME_CONFIG = CURRENT
NO_BINARY_CHECK = 1
pulldown_type = 1
[bad_xml]
NO_BINARY_CHECK = 1
pulldown_type = 1
And have two input files, good
:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child id="1"/>
<child id="2"/>
</root>
and bad
:
DATE = 2014-09-24 11:25:47
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child id="1"/>
<child id="2"/>
</root>
I then index good
and bad
with sourcetype good_xml
and bad_xml
respectively. When I search sourcetype=good_xml | spath
I get beautiful extraction of root.child{@id}
; however, searching sourcetype=bad_xml | spath
yields no such extraction. Throwing | xmlkv
in the middle changes nothing (but to be honest, I've never seen hard evidence of what xmlkv
does anyway). Note that in both cases, the events are properly extracted from the appropriate multiple lines. I also tried adding kv_mode = XML
in props.conf
but that didn't work either.
Is there a way to get this to work?
Bonus question: is there a way to get more insight into what Splunk is doing here? When spath
works - what is really different about the indexing/searching results and intermediate processing, and does Splunk offer some sort of metadata/transparency on it? Does it consider bad_xml
as valid XML? (It does actually parse some key-value pairs, but it doesn't properly handle duplicate values like spath
does.) I tried thinks like searching on index=_internal
and digging around ... but I didn't see anything revealing.
Put the xml into a field with a field extraction, then
| spath input=your_new_field
Put the xml into a field with a field extraction, then
| spath input=your_new_field
Wow, that actually works. I was able to use something like:
| rex field=_raw "(?s)(?<xxx>..xml version=.*)" | spath input=xxx
and now I see all the field. I'm not sure how performant it will be compared to extraction upstream, but it's a huge win compared to not having spath at all. Thanks!
To clarify, spath
on good_xml
extracts all fields properly; on bad_xml
, it extracts nothing.