I'm using a sed script to clean up some events before they are indexed by Splunk in order to reduce the license usage. My raw data has some XML tags. Prior to indexing, I reformat these tags as key=value pairs. The below sed script was working correctly. However, there has been a change to the log that introduces an angle bracket character (<) which is causing the data to not be indexed as desired.
Sed script in props.conf
s/<([^\s\>]*)[^\>]*\>([^<].*?)\<\/\1\>/ \1="\2"/g
Sample data
2014-03-20 09:35:46,193 Outgoing UserSessionLog <UserId>55555555555</UserId><MsgType>Menu</MsgType><Title>My Title</Title><MenuId>1</MenuId><Text>This is some text</Text><MenuId>2</MenuId><Text><This is text with an angle bracket</Text><Internal>User Menu</Internal><IsActive>true</IsActive><SessionID>1000</SessionID>
The above sample data is indexed as:
2014-03-20 09:35:46,193 Outgoing UserSessionLog UserId="55555555555" MsgType="Menu" Title="My Title" MenuId="1" Text="This is some text" MenuId="2"<Text><This is text with an angle bracket</Text> Internal="System Menu" IsActive="true" SessionID="1000"
As you can see, the regular expression is not matching the second Text key because of the angle bracket (<) so the value is not getting assigned properly. It should be Text="<This is text with an angle bracket"
. I have been unable to modify the regular expression to handle this scenario.
Any help or suggestions would be greatly appreciated!
Try this SED
To remove "<" from value of Text:-
s/<([^\s\>]*)[^\>]*\>[<]*([^<].*?)\<\/\1\>/ \1=\"\2\"/g
To keep the "<" with value of Text:-
s/<([^\s\>]*)[^\>]*\>(.*?)\<\/\1\>/ \1=\"\2\"/g
Try this SED
To remove "<" from value of Text:-
s/<([^\s\>]*)[^\>]*\>[<]*([^<].*?)\<\/\1\>/ \1=\"\2\"/g
To keep the "<" with value of Text:-
s/<([^\s\>]*)[^\>]*\>(.*?)\<\/\1\>/ \1=\"\2\"/g
Try second option.
Thanks! Is there a way to keep the "<" if it is part of the value? Other than that, this seems to work so I may use it anyways and just discard the bracket.