Splunk Search

regex expression works sometimes and not at others

bsizemore
Path Finder

Hello,

I am working with Omniture Data contained in a csv in a scripted data input. I fetch a file using curl, chop the header of the file off using sed. The data looks like this before it is sent to the indexers: http://pastebin.com/HfjuT0Nq

I use the following query to regex out the data into the following fields:

index=entapps sourcetype=seoAdHocAnalysis | rex "^(?P<Dimension>[^,]+)[^,\n]*,(?P<Item>[^,]+),(?P<Visits>[^x]+),(?P<ChannelImageSearch>.+)[^,\n]*,(?P<ChannelOrganicSearch>.+)" | table _time Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

Some of the results are totally correct, while others are wrong (denoted with pink highlighting):
https://tmpdmp.com/bdd93316f0fc85f7/a63faaad56b830b1

I know why: some csv rows have numbers with quotations around them that contains commas as a part of the value. This situation makes it impossible to use the field extraction feature of 6.x in my experience thusfar. I assume there is a better way, and I am looking at sed/awk as a method for grooming k/v pairs before ingestion, but that has its own bugaboos.

Tags (3)
0 Karma
1 Solution

bsizemore
Path Finder

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

View solution in original post

bsizemore
Path Finder

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...