I'm trying to build 1 regex to capture multiple sets of data. Below is a sample:
1. 20110221124637|21410|SENT:0.646861|51B11A011801830658
2. 20110221124854|21411|RECVD|00345251B1
3. 20110221124854|362|003452|SENT: 3.198847|51B11A011801830658
4. 20110221124854|362|003452|RECVD|00345251B1
5. 20110221160534|431|011867|RECVD|01186751B1
6. 20110221160534|431|011867|SENT: 0.278782|51B11A011801830658
Basically these lines are from different versions of a piece of software, but all contain the same information in different places. These are the fields I'd like to extract: Date, BinNumber, ClaimTrackNum, Direction, RespTime and ClaimType.
Here is the date:
1. 20110221124637 |21410|SENT:0.646861|51B11A011801830658
2. 20110221124854 |21411|RECVD|00345251B1
3. 20110221124854 |362|003452|SENT: 3.198847|51B11A011801830658
4. 20110221124854 |362|003452|RECVD|00345251B1
5. 20110221160534 |431|011867|RECVD|01186751B1
6. 20110221160534 |431|011867|SENT: 0.278782|51B11A011801830658
ClaimTrackNum:
1. 20110221124637 |21410| SENT:0.646861|51B11A011801830658
2. 20110221124854 |21411| RECVD|00345251B1
3. 20110221124854 |362| 003452|SENT: 3.198847|51B11A011801830658
4. 20110221124854 |362| 003452|RECVD|00345251B1
5. 20110221160534 |431| 011867|RECVD|01186751B1
6. 20110221160534 |431| 011867|SENT: 0.278782|51B11A011801830658
BinNumber (May exist or not, or be in 2 spots):
1. 20110221124637|21410|SENT:0.646861|51B11A011801830658
2. 20110221124854|21411|RECVD| 003452 51B1
3. 20110221124854|362| 003452 |SENT: 3.198847|51B11A011801830658
4. 20110221124854|362| 003452 |RECVD| 003452 51B1
5. 20110221160534|431| 011867 |RECVD| 011867 51B1
6. 20110221160534|431| 011867 |SENT: 0.278782|51B11A011801830658
RespTime (Only exists after a "SENT:" (old style) or "SENT: " (new)):
1. 20110221124637|21410|SENT: 0.646861 |51B11A011801830658
2. 20110221124854|21411|RECVD|00345251B1
3. 20110221124854|362|003452|SENT: 3.198847 |51B11A011801830658
4. 20110221124854|362|003452|RECVD|00345251B1
5. 20110221160534|431|011867|RECVD|01186751B1
6. 20110221160534|431|011867|SENT: 0.278782 |51B11A011801830658
ClaimType:
1. 20110221124637|21410|SENT:0.646861| 51B1 1A011801830658
2. 20110221124854|21411|RECVD|003452 51B1
3. 20110221124854|362|003452|SENT: 3.198847| 51B1 1A011801830658
4. 20110221124854|362|003452|RECVD|003452 51B1
5. 20110221160534|431|011867|RECVD|011867 51B1
6. 20110221160534|431|011867|SENT: 0.278782| 51B1 1A011801830658
And lastly the direction is either the word "SENT" or "RECVD"
I've come up with this regex, which fully matches everything:
(?'SW_Date'\d{14})\|(?'SW_ClaimTrackNum'\d+)\|(?>(?'SW_BinNumber'[0-9]{6})\||)(?>([A-z: ]+(?'SW_RespTime'[0-9.]+))|([A-z: ]+))\|(?>(?'SW_BinNumber'[0-9]{6})(?'SW_ClaimType'[0-9A-z]{4})|(?'SW_ClaimType'[0-9A-z]{4}))
Which splunk doesn't like... If I simply rename any duplicate field to different names, splunk doesn't have an issue.
(?'SW_Date'\d{14})\|(?'SW_ClaimTrackNum'\d+)\|(?>(?'SW_NewBinNumber'[0-9]{6})\||)(?>([A-z: ]+(?'SW_RespTime'[0-9.]+))|([A-z: ]+))\|(?>(?'SW_OldBinNumber'[0-9]{6})(?'SW_RecvdClaimType'[0-9A-z]{4})|(?'SW_SentClaimType'[0-9A-z]{4}))
but naturally splunk now thinks that "newbinnumber" and "oldbinnumber" are different fields when they are the exact same.
How can I write this regex to always extract those fields as "BinNumber" and "ClaimType"?
Use coalesce
like this:
... (?'SW_Date'd{14})|(?'SW_ClaimTrackNum'd+)|(?>(?'temp_NewBinNumber'[0-9]{6})||)(?>([A-z: ]+(?'SW_RespTime'[0-9.]+))|([A-z: ]+))|(?>(?'temp_OldBinNumber'[0-9]{6})(?'temp_RecvdClaimType'[0-9A-z]{4})|(?'temp_SentClaimType'[0-9A-z]{4})) | eval SW_BinNumber=coalesce(temp_NewBinNumber, temp_OldBinNumber) |eval SW_ClaimType=coalesce(temp_RecvdClaimType, temp_SentClaimType) | fields - temp*