Splunk Search

XML input line-breaking and field extraction - how?

Justin_Grant
Contributor

I am trying to index an XML file which looks like this:

 <?xml version="1.0" encoding="utf-8" ?> 
 <Posts2Votes>
  <row>
   <Id>1</Id> 
   <PostId>7</PostId> 
   <UserId>2</UserId> 
   <VoteTypeId>2</VoteTypeId> 
   <CreationDate>2009-11-06T02:22:37.063</CreationDate> 
   <TargetUserId>7</TargetUserId> 
   <TargetRepChange>10</TargetRepChange> 
   <IPAddress>64.127.105.60</IPAddress> 
  </row>
  <row>
   <Id>2</Id> 
   <PostId>6</PostId> 
   <UserId>2</UserId> 
   <VoteTypeId>2</VoteTypeId> 
   <CreationDate>2009-11-06T02:22:38.25</CreationDate> 
   <TargetUserId>31</TargetUserId> 
   <TargetRepChange>10</TargetRepChange> 
   <IPAddress>64.127.105.60</IPAddress> 
  </row>
  <!-- more "row" elements go here -->
 </Posts2Votes>

Splunk's default parser will recognizes the timestamps correctly but does not split the events on each <row> element, and no fields are extracted by default. OK, now I need to figure out how to extract these fields and break the lines correctly. Any ideas?

1 Solution

gkanapathy
Splunk Employee
Splunk Employee

props.conf

TIME_PREFIX = \<CreationDate\>
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%3N
SHOULD_LINEMERGE = false
LINE_BREAKER = \>\s*(?=\<row\>)
REPORT-xmlext = xml-extr

transforms.conf

[xml-extr]
REGEX = \<(\w+)\>([^\>]*)\<\1\>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

should do it.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

props.conf

TIME_PREFIX = \<CreationDate\>
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%3N
SHOULD_LINEMERGE = false
LINE_BREAKER = \>\s*(?=\<row\>)
REPORT-xmlext = xml-extr

transforms.conf

[xml-extr]
REGEX = \<(\w+)\>([^\>]*)\<\1\>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

should do it.

charlie_park2
Explorer

Thanks. This is a very helpful post. The documentation really should be a lot more newbie-friendly. Thanks.

0 Karma

woodcock
Esteemed Legend

This is tested working:

REGEX = <([^>]+)>([^<]*)<\/\1>
0 Karma

gljiva
Path Finder

There is a small error in above regex, correct one is:

REGEX = \<(\w+)\>([^\<]*)\</\1\>
0 Karma

BunnyHop
Contributor

Where you able to get this work? I tried it but it does not break the events from one another cleanly.

I do have a subdata within the top group, so after the row group, I have a subrow that contains data for the row group, so that might be what's skewing me.

0 Karma
Get Updates on the Splunk Community!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Wednesday, May 29, 2024  |  11AM PST / 2PM ESTRegister now and join us to learn more about how you can ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer at Splunk .conf24 ...

We’re excited to announce a new Splunk certification exam being released at .conf24! If you’re headed to Vegas ...

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...

Splunk Lantern is Splunk’s customer success center that provides advice from Splunk experts on valuable data ...