Solved: Transform XML log data before indexing - How can I...

sc0tt · ‎10-24-2013

I have an semi-formatted XML log that is currently being indexed by Splunk with no problems. However, it indexes a lot of additional data that is not needed. In order to save on space I would like to transform this log and only index what is needed. I'm using Anonymize data as an example. Below is a sample of the log data and the desired formatted version and my current configuration.

log data

2013-10-23 12:22:17,286 INFO  ==== Outgoing ==== xml version="1.0" encoding="UTF-8"?> <INTERFACE_API><UserId>55555555555</UserId><MsgType>Response</MsgType><Key>001</Key><SessionID>1000</SessionID></INTERFACE_API><EOM>
2013-10-23 12:22:17,274 INFO  ==== Incoming ==== <INTERFACE_API><UserId>55555555555</UserId><MsgType>Request</MsgType><Item>5</Item><Internal>INTERNAL_VALUE</Internal><SessionID>999999999999</SessionID></INTERFACE_API>

desired indexed data

2013-10-23 12:22:17,286 UserId=55555555555 Key=001
2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999

props.conf

[test_io]
NO_BINARY_CHECK = 1
pulldown_type = 1
TRANSFORMS-xml = test-io-xml

transforms.conf

[test-io-xml]
REGEX = <(.*?)(?:\s[^>]*)?>([^<]*)</\\1>
FORMAT = $1=$2
DEST_KEY = _raw

I see no difference in my indexed data. I used the regex that is included in the xmlkv.py script because that does work in a search, so I figured that it is able to extract the xml values correctly.

How can I get this to work?

update:
I've also tried modifying my props.conf to use SEDCMD. I added a script that will remove all tags to see if I'm able to alter the indexed data but there is still no effect.

[test_io]
NO_BINARY_CHECK = 1
pulldown_type = 1
SEDCMD-testio = s/<[^<>]\{1,\}>//g

update 2:
The above sed script doesn't appear to work but I was able to get another test script to work. Now that I know the props.conf is working correctly, is there a way to remove the xml tags and create a key=value pair with sed?

sc0tt · ‎10-30-2013

I was able to use a sed script like below to extract and format the fields I wanted.

SEDCMD-testio = s/(.*) INFO.*<UserId>(.*)<\/UserId>.*<Item>(.*)<\/Item>.*<SessionID>(.*)<\/SessionID>.*/\1 UserId=\2 Item=\3 SessionID=\4/

The indexed data then is formatted as

2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999

View solution in original post

sc0tt · ‎10-30-2013

I was able to use a sed script like below to extract and format the fields I wanted.

SEDCMD-testio = s/(.*) INFO.*<UserId>(.*)<\/UserId>.*<Item>(.*)<\/Item>.*<SessionID>(.*)<\/SessionID>.*/\1 UserId=\2 Item=\3 SessionID=\4/

The indexed data then is formatted as

2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999

Transform XML log data before indexing - How can I get this to work?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!