Splunk Search

regex expression works sometimes and not at others

bsizemore
Path Finder

Hello,

I am working with Omniture Data contained in a csv in a scripted data input. I fetch a file using curl, chop the header of the file off using sed. The data looks like this before it is sent to the indexers: http://pastebin.com/HfjuT0Nq

I use the following query to regex out the data into the following fields:

index=entapps sourcetype=seoAdHocAnalysis | rex "^(?P<Dimension>[^,]+)[^,\n]*,(?P<Item>[^,]+),(?P<Visits>[^x]+),(?P<ChannelImageSearch>.+)[^,\n]*,(?P<ChannelOrganicSearch>.+)" | table _time Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

Some of the results are totally correct, while others are wrong (denoted with pink highlighting):
https://tmpdmp.com/bdd93316f0fc85f7/a63faaad56b830b1

I know why: some csv rows have numbers with quotations around them that contains commas as a part of the value. This situation makes it impossible to use the field extraction feature of 6.x in my experience thusfar. I assume there is a better way, and I am looking at sed/awk as a method for grooming k/v pairs before ingestion, but that has its own bugaboos.

Tags (3)
0 Karma
1 Solution

bsizemore
Path Finder

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

View solution in original post

bsizemore
Path Finder

I solved this with the following regex:

^(?P<Dimension>[^,]+),(?P<Item>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<Visits>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelImageSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b)),(?P<ChannelOrganicSearch>(?:\"[^,]+,[^,]+\"|\b[^,]+\b))$

The field extraction looks like this:

index=entapps sourcetype=seoAdHocAnalysis | eval Visits=trim(Visits,"\"") | eval ChannelImageSearch=trim(ChannelImageSearch,"\"") | eval ChannelOrganicSearch=trim(ChannelOrganicSearch,"\"") | table Dimension, Item, Visits, ChannelImageSearch, ChannelOrganicSearch | rename ChannelImageSearch as "Channel Image Search" ChannelOrganicSearch as "Channel Organic Search"

The outcome of this regex, and turning it into a field extraction yielded the following results: https://tmpdmp.com/5242b8819245b48d/ade0131ff0693292

Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...