Dashboards & Visualizations

Transform XML log data before indexing - How can I get this to work?

sc0tt
Builder

I have an semi-formatted XML log that is currently being indexed by Splunk with no problems. However, it indexes a lot of additional data that is not needed. In order to save on space I would like to transform this log and only index what is needed. I'm using Anonymize data as an example. Below is a sample of the log data and the desired formatted version and my current configuration.

log data

2013-10-23 12:22:17,286 INFO  ==== Outgoing ==== xml version="1.0" encoding="UTF-8"?> <INTERFACE_API><UserId>55555555555</UserId><MsgType>Response</MsgType><Key>001</Key><SessionID>1000</SessionID></INTERFACE_API><EOM>
2013-10-23 12:22:17,274 INFO  ==== Incoming ==== <INTERFACE_API><UserId>55555555555</UserId><MsgType>Request</MsgType><Item>5</Item><Internal>INTERNAL_VALUE</Internal><SessionID>999999999999</SessionID></INTERFACE_API>

desired indexed data

2013-10-23 12:22:17,286 UserId=55555555555 Key=001
2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999

props.conf

[test_io]
NO_BINARY_CHECK = 1
pulldown_type = 1
TRANSFORMS-xml = test-io-xml

transforms.conf

[test-io-xml]
REGEX = <(.*?)(?:\s[^>]*)?>([^<]*)</\\1>
FORMAT = $1=$2
DEST_KEY = _raw

I see no difference in my indexed data. I used the regex that is included in the xmlkv.py script because that does work in a search, so I figured that it is able to extract the xml values correctly.

How can I get this to work?

update:
I've also tried modifying my props.conf to use SEDCMD. I added a script that will remove all tags to see if I'm able to alter the indexed data but there is still no effect.

[test_io]
NO_BINARY_CHECK = 1
pulldown_type = 1
SEDCMD-testio = s/<[^<>]\{1,\}>//g

update 2:
The above sed script doesn't appear to work but I was able to get another test script to work. Now that I know the props.conf is working correctly, is there a way to remove the xml tags and create a key=value pair with sed?

0 Karma
1 Solution

sc0tt
Builder

I was able to use a sed script like below to extract and format the fields I wanted.

SEDCMD-testio = s/(.*) INFO.*<UserId>(.*)<\/UserId>.*<Item>(.*)<\/Item>.*<SessionID>(.*)<\/SessionID>.*/\1 UserId=\2 Item=\3 SessionID=\4/

The indexed data then is formatted as

2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999

View solution in original post

sc0tt
Builder

I was able to use a sed script like below to extract and format the fields I wanted.

SEDCMD-testio = s/(.*) INFO.*<UserId>(.*)<\/UserId>.*<Item>(.*)<\/Item>.*<SessionID>(.*)<\/SessionID>.*/\1 UserId=\2 Item=\3 SessionID=\4/

The indexed data then is formatted as

2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...