Solved: Fixed-width field extraction but removing trailing...

rturk · ‎07-12-2012

I am currently defining some sourcetypes for some db2 SMF logs (oh joy). Luckily, the fields are well defined and are fixed length, so extracting the values at the correct boundaries is not an issue. As an example, using this regex:

rex "(?m)^0(?<primauth>.{8})\s(?<connect>.{8})\s(?<instance>\S+)\s(?<end_user>.{16})\s(?<ws_name>.{29})\s(?<transact>.{50})\s\n0(?<origauth>.{8})\s(?<corrname>.{8})\s(?<conntype>.{12})\s"

On this event:

0=======================================================================================================
 PRIMAUTH CONNECT  INSTANCE       END_USER      WS_NAME                       TRANSACT                                               
 ORIGAUTH CORRNAME CONNTYPE       RECORD TIME   DESTNO     IFC DESCRIPTION    DATA                                                   
 PLANNAME CORRNMBR                TCB CPU TIME             ID                                                                        
 -------- -------- ------------ -------------------------- --- -------------- --------------------------
0A1B2C3   SERVER   X'123456789012' A12345           ABCD123                       SQLA.exe                                           
0Z9Y8X7   N/A      REMOTE     M 15:46:05        1234567890 140 Audit Auth Failures                                                   
0DISTSERV 'BLANK'

Gives me the following fields and their values (the quotes are mine obviously):

primauth = "A1B2C3  "
connect = "SERVER  "
instance = "X'123456789012'"
end_user = "A12345          "
ws_name = "ABCD123                      "
transact = "SQLA.exe                                          "
...

You get the idea, and you probably see where I'm going with this.

How can I maintain the fixed field widths for extractions, but disregard trailing spaces in the values (something that you might do with "chomp" in perl). I would rather not have to append a wildcard every time I do a search for a specific field value just because I'm unsure how many spaces there are.

I very much suspect regex is the solution to my problems here but I've been scratching my head about this for a while.

Hoping someone can help 🙂

rturk · ‎07-12-2012

Ahhh... so once again I answer my own question.

Turns out that specifying the regular expression in your search query will capture spaces in the extracted values, but putting that same regex in your props.conf or transforms.conf will "chomp" the values for you while keeping your fixed-width field definitions.

props.conf:

[db2_system_parameters]
KV_MODE                 = none
SHOULD_LINEMERGE        = true
BREAK_ONLY_BEFORE       = ^0==
TIME_FORMAT             = %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 800
REPORT-header_fields    = db2_header_fields

transforms.conf:

[db2_header_fields]
REGEX = (?m)^0(?<primauth>.{8})\s(?<connect>.{8}) ...

Hope this helps someone else, as it wasn't behaviour I was aware of 🙂

View solution in original post

marcoscala · ‎11-20-2014

beware that that solution works until you have AT LEAST one blank between two fields. It you have adjacent fields, with no space in between, then Splunk has problems in handling the two fields as two different strings.

I had that problem in a case where the log had fixed lenght fields but with adjacent fields(no separators).
For instance, envent il something like

xxxxxyyyyyyyyy....... where field1 are the first 5 chars, then field2 and so on.

In that case, fields extractions works, but if you try a search line "field1=xxxxx", that Splunk returns no result. To have some result you should write "field1=xxxxx*".
This is because Splunk indexes just one string made of "xxxxxyyyyyyy...." And if you have to search for field2, it's event worst: you must write "field2=yyyyyyyyy". As you can imagine, performance implications are terrible!

The solution I tried (and it worked perfectly) was to use a transform during the parsing phase and write the _raw adding a separator (. or | or a blank) between fields. Then Splunk indexes correctly the fields.

Marco

Saurabhsood · ‎05-08-2017

@Marcoscala
U wrote
"The solution I tried (and it worked perfectly) was to use a transform during the parsing phase and write the _raw adding a separator (. or | or a blank) between fields. Then Splunk indexes correctly the fields."

Can you please guide me how to do this? ..I am new in splunk.

rturk · ‎07-12-2012

Ahhh... so once again I answer my own question.

Turns out that specifying the regular expression in your search query will capture spaces in the extracted values, but putting that same regex in your props.conf or transforms.conf will "chomp" the values for you while keeping your fixed-width field definitions.

props.conf:

[db2_system_parameters]
KV_MODE                 = none
SHOULD_LINEMERGE        = true
BREAK_ONLY_BEFORE       = ^0==
TIME_FORMAT             = %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 800
REPORT-header_fields    = db2_header_fields

transforms.conf:

[db2_header_fields]
REGEX = (?m)^0(?<primauth>.{8})\s(?<connect>.{8}) ...

Hope this helps someone else, as it wasn't behaviour I was aware of 🙂

Drainy · ‎07-12-2012

Why not just use a [^\s]+ to capture everything up until a space character? you can then use a \s+ to move ahead to the start of the next character set.

rturk · ‎07-12-2012

That was how I approached it initially, but unfortunately not all fields are mandatory so the fixed-width definitions have to be kept for data integrity. Thanks for the suggestion though 🙂

Fixed-width field extraction but removing trailing spaces

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!