I have an semi-formatted XML log that is currently being indexed by Splunk with no problems. However, it indexes a lot of additional data that is not needed. In order to save on space I would like to transform this log and only index what is needed. I'm using Anonymize data as an example. Below is a sample of the log data and the desired formatted version and my current configuration.
log data
2013-10-23 12:22:17,286 INFO ==== Outgoing ==== xml version="1.0" encoding="UTF-8"?> <INTERFACE_API><UserId>55555555555</UserId><MsgType>Response</MsgType><Key>001</Key><SessionID>1000</SessionID></INTERFACE_API><EOM>
2013-10-23 12:22:17,274 INFO ==== Incoming ==== <INTERFACE_API><UserId>55555555555</UserId><MsgType>Request</MsgType><Item>5</Item><Internal>INTERNAL_VALUE</Internal><SessionID>999999999999</SessionID></INTERFACE_API>
desired indexed data
2013-10-23 12:22:17,286 UserId=55555555555 Key=001
2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999
props.conf
[test_io]
NO_BINARY_CHECK = 1
pulldown_type = 1
TRANSFORMS-xml = test-io-xml
transforms.conf
[test-io-xml]
REGEX = <(.*?)(?:\s[^>]*)?>([^<]*)</\\1>
FORMAT = $1=$2
DEST_KEY = _raw
I see no difference in my indexed data. I used the regex that is included in the xmlkv.py
script because that does work in a search, so I figured that it is able to extract the xml values correctly.
How can I get this to work?
update:
I've also tried modifying my props.conf
to use SEDCMD
. I added a script that will remove all tags to see if I'm able to alter the indexed data but there is still no effect.
[test_io]
NO_BINARY_CHECK = 1
pulldown_type = 1
SEDCMD-testio = s/<[^<>]\{1,\}>//g
update 2:
The above sed script doesn't appear to work but I was able to get another test script to work. Now that I know the props.conf is working correctly, is there a way to remove the xml tags and create a key=value pair with sed?
I was able to use a sed script like below to extract and format the fields I wanted.
SEDCMD-testio = s/(.*) INFO.*<UserId>(.*)<\/UserId>.*<Item>(.*)<\/Item>.*<SessionID>(.*)<\/SessionID>.*/\1 UserId=\2 Item=\3 SessionID=\4/
The indexed data then is formatted as
2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999
I was able to use a sed script like below to extract and format the fields I wanted.
SEDCMD-testio = s/(.*) INFO.*<UserId>(.*)<\/UserId>.*<Item>(.*)<\/Item>.*<SessionID>(.*)<\/SessionID>.*/\1 UserId=\2 Item=\3 SessionID=\4/
The indexed data then is formatted as
2013-10-23 12:22:17,274 UserId=55555555555 Item=5 SessionID=999999999999