Splunk Search

Regex Question - empty field?

msarro
Builder

Hey everyone. I am working on trying to assemble a regular expression to pull fields out of a set of CSV files. The issue is that some of the fields are often empty, but other times, they aren't. I need to parse through them because some values are important, others aren't and I need the ability to send unimportant things to the nullQueue.

Here is an example:

1,blah@blah.com,3,4,5,6
1,blar@blar.com,,4,5,6
1,fnyuh@fnyuh.com,3,4,,6

How can I use a regular expression capture group to still accept this?
Right now I'm building it like this:

(?P<digit1>[^,]+)(?:,)(?P<contact_id>[^,]+)(?:,)(?P<digit3>[^,]+)(?:,)(?P<digit4>[^,]+)(?:,)(?P<digit5>[^,]+)(?:,)(?P<digit6>[^,]+)(?:,)

Any help would be appreciated. This is the first of multiple data sources I have to go through and do this for. Splunk's handling of structured data sucks when you need to selectively keep/throw away data.

Tags (1)
0 Karma
1 Solution

hazekamp
Builder

msarro,

When writing regular expressions for TRANSFORMS statements, you don't necessarily need to capture anything. Simply write a regular expression that identifies things you want to nullQueue. For instance if you wanted to nullQueue whenever there is a '3' in column 3 you would specify:

## props.conf
[<spec>]
TRANSFORMS-nullQueue_for_spec = nullQueue_for_spec

## transforms.conf
[nullQueue_for_spec]
## Here we use * to specify 0 or more
REGEX = [^,]*,[^,]*,3,
DEST_KEY = queue
FORMAT = nullQueue

We can add field extraction to events that make it past the nullQueue by adding the following props/transforms:

## props.conf
[<spec>]
KV_MODE = none
REPORT-kv_for_spec = kv_for_spec

## transforms.conf
[kv_for_spec]
DELIMS = ","
FIELDS = column1,column2,column3...

If you really don't want certain columns extracted, instead of specifying FIELDS/DELIMS in your REPORT-kv_for_spec property use a similar regular expression to the nullQueue with capture groups around the columns you want to pull out. The point is separate the nullQueue process from the field extraction process.

Update,

If you were interested in capturing all events, but only certain columns try:

## props.conf
[<spec>]
TRANSFORMS-make_raw_for_spec = make_raw_for_spec

## transforms.conf
[make_raw_for_spec]
DEST_KEY = _raw
REGEX = ([^,]*),([^,]*),([^,]*),
## Drop column 2
FORMAT = $1,$3

View solution in original post

msarro
Builder

Hazekemp, will that completely stop column 2 from being indexed/counting against license usage? If so, I'm mailing you a beer 🙂

0 Karma

hazekamp
Builder

I provided an update on how to set _raw based on selective capture groups. If you don't need to worry about commas in the data itself you can use "([^,]*)," for each column.

0 Karma

msarro
Builder

For the first go through I was just trying to break everything down into a single regular expression that I could then include/exclude the capture groups as I so desired. This is the first time I've ever had to write a regex for anything this large so I'm still fumbling with the best way to go about doing it. Sorry if I'm not the clearest - under a pressing deadline.

0 Karma

hazekamp
Builder

I think I missed what you were trying to do in terms of indexing partial events as opposed to only indexing certain events. You ultimately want to construct _raw based on selective capture groups?

0 Karma

msarro
Builder

Sadly that isn't an option. We need to strip out a large portion of the data since it is junk. Because the files move after being generated, we need splunk's ability to track them if we're going to get anywhere close to realtime tracking of the data. Taking in the data whole vs taking in the data and throwing 70% of it into the nullqueue is a difference of a few hundred thousand dollars in splunk licensing costs.

0 Karma

bwooden
Splunk Employee
Splunk Employee

hazekamp's answer will send columns to the null queue per your question - but it is a better practice to keep the event 'as is', extract the fields of which you are currently interested, and do any filtering at search/report time.

0 Karma

hazekamp
Builder

msarro,

When writing regular expressions for TRANSFORMS statements, you don't necessarily need to capture anything. Simply write a regular expression that identifies things you want to nullQueue. For instance if you wanted to nullQueue whenever there is a '3' in column 3 you would specify:

## props.conf
[<spec>]
TRANSFORMS-nullQueue_for_spec = nullQueue_for_spec

## transforms.conf
[nullQueue_for_spec]
## Here we use * to specify 0 or more
REGEX = [^,]*,[^,]*,3,
DEST_KEY = queue
FORMAT = nullQueue

We can add field extraction to events that make it past the nullQueue by adding the following props/transforms:

## props.conf
[<spec>]
KV_MODE = none
REPORT-kv_for_spec = kv_for_spec

## transforms.conf
[kv_for_spec]
DELIMS = ","
FIELDS = column1,column2,column3...

If you really don't want certain columns extracted, instead of specifying FIELDS/DELIMS in your REPORT-kv_for_spec property use a similar regular expression to the nullQueue with capture groups around the columns you want to pull out. The point is separate the nullQueue process from the field extraction process.

Update,

If you were interested in capturing all events, but only certain columns try:

## props.conf
[<spec>]
TRANSFORMS-make_raw_for_spec = make_raw_for_spec

## transforms.conf
[make_raw_for_spec]
DEST_KEY = _raw
REGEX = ([^,]*),([^,]*),([^,]*),
## Drop column 2
FORMAT = $1,$3

msarro
Builder

Awesome, thank you so much!

0 Karma

hazekamp
Builder

That's correct. You are altering the _raw message before the data hits the index queue. In this case the alteration is removing data resulting in a smaller message size and less indexing.

msarro
Builder

Hazekemp, will that completely stop column 2 from being indexed/counting against license usage? If so, I'm mailing you a beer 🙂

0 Karma

msarro
Builder

This may work? I'm not sure. The issue isn't field extraction for certain fields. The issue is a massive portion of our data is junk that will cause our splunk licensing costs to skyrocket if we leave it in thanks to the volume of data. Maybe I am still going at this wrong? The transforms statement above would work, but I'd need to remove column 3 in its entirety, and do that for 100+ other fields.

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...