Hi Splunker,
How would like to learn how can i rex out these fields names and i don't want to rex out startTimestamp and endTimestamp in it.
<activityName>TubeSales<activityName>
<activityStatus>Play<activityStatus>
<startTimestamp>Do not want to extract<startTimestamp>
<endTimestamp>Do not want to extract<endTimestamp>
<JourneyID>3DF62A1191152ED064B039AFD2C6A81E.node-app-1<JourneyID>
<startID>C3FE7047-E9EA-78DE-D719-8D3D66EF4A1F<startID>
<JourneyOrderPointsByProductCode>
<ProductCode>16<ProductCode>
<JourneyOrderPoints>130<JourneyOrderPoints>
<JourneyOrderPointsByProductCode>
<success>
<GetRequiredJourneyOrderPointsend>
</S:Body>
Thanks in advance
Two things -
1) To be proper XML or HTML, the second time the field is named, to close the tag, it must have a slash in front of it. Example:
<activityName>TubeSales</activityName>
I'm going to assume that is the case, because otherwise you have much bigger problems than how to write the rex.
This one here will extract all the individual fields, including the two timestamps you don't want, but not including the multi-line JourneyOrderPointsByProductCode...
\<(?<fieldname>\w+)\>(?<fieldvalue>[^\<]+)\<\/?\1\>
Here it is, built up with a negative assertion to ignore the two Timestamps...
\<(?!startTimestamp|endTimestamp)(?<fieldname>\w+)\>(?<fieldvalue>[^\<]+)\<\/?\1\>
Both of those regexes will work for any tags that are opened and closed, even if they lack the slash in the end tag. If you verify that your markup language has the proper slashes on the close tags, then remove the very last question mark from both regexes.
Now, that all being said, you are much better off using @nikeynilay's advice and using the spath command.
Two things -
1) To be proper XML or HTML, the second time the field is named, to close the tag, it must have a slash in front of it. Example:
<activityName>TubeSales</activityName>
I'm going to assume that is the case, because otherwise you have much bigger problems than how to write the rex.
This one here will extract all the individual fields, including the two timestamps you don't want, but not including the multi-line JourneyOrderPointsByProductCode...
\<(?<fieldname>\w+)\>(?<fieldvalue>[^\<]+)\<\/?\1\>
Here it is, built up with a negative assertion to ignore the two Timestamps...
\<(?!startTimestamp|endTimestamp)(?<fieldname>\w+)\>(?<fieldvalue>[^\<]+)\<\/?\1\>
Both of those regexes will work for any tags that are opened and closed, even if they lack the slash in the end tag. If you verify that your markup language has the proper slashes on the close tags, then remove the very last question mark from both regexes.
Now, that all being said, you are much better off using @nikeynilay's advice and using the spath command.
Hi DalJeanis,
It was great stuff,queried worked absolutely fine.
Just wanted to ask one question
<(?!startTimestamp|endTimestamp)(?\w+)>(?[^<]+)<\/?\1>
--> <\/?\1> <----
What is actually doing this thing i am able to understand the whole query but not the use of this last part and 1 which you have written in the last.
Thanks again it was really awesome stuff.
Regards,
Tarun Malhotra
That whole thing is to find the closing tag for the same opening tag. That is how we avoid picking up the <success>
keyword, because it is not followed by a close tag, so it is not calling out a field name and value.
\<
means "match only the opening <
of the next html/xml tag"
\/?
means "match an optional slash \/
if it is there, but due to the ?
if it is not there then that's okay too."
\1
means "match another copy of the first group that was previously matched... in this case that would be the group called fieldname"
\>
means "match only the ending >
of the html/xml tag"
Hi DalJeanis,
Thanks for the explanation it was really help full. 🙂
@m7787580, You should use spath (which is meant to parse XML or JSON data) to Output the fields you need.(http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Spath)
You should also see the feasibility of taking care of extracting XML data at the search time using KV_MODE = xml while defining the sourcetype (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf)
Hi,
how about this one.
(?:\<activityName\>)(?<activityName>[^\<]+)(?:\<activityName\>)[\r\n](?:\<activityStatus\>)(?<activityStatus>[^\<]+)(?:\<activityStatus\>)[\r\n].+[\r\n].+[\r\n](?:\<JourneyID\>)(?<JourneyID>[^\<]+)(?:\<JourneyID\>)[\r\n](?:\<startID\>)(?<startID>[^\<]+)(?:\<startID\>)[\r\n].+[\r\n](?:\<ProductCode\>)(?<ProductCode>[^\<]+)(?:\<ProductCode\>)[\r\n](?:\<JourneyOrderPoints\>)(?<JourneyOrderPoints>[^\<]+)(?:\<JourneyOrderPoints\>)
Hi Pyro_wood,
Thanks for the solution i understood.
but what if i don't want to write whole fields names again and again.
We can see that all fields are staring from < and ending on />
Can this be possible if we right single rex command like
rex field=_raw starting from <(capturing Name)>(Capturing Value)</
As we can see all the fields are following same format present below starting from < and ending on </
<ProductCode>16</ProductCode>
<JourneyOrderPoints>130</JourneyOrderPoints>
If i can have single standard rex query then i can run it on any service irrespective of any field name and value.
Thanks in advance