Monitoring Splunk

Ignoring events during indexing Vs ignoring events during search

strive
Influencer

Hi,

We use splunk 5.0.4. We have customers with daily log volume ranging from 10GB to 50GB.

Our customers do not want to show up URLs with *.jpg, *.png and etc in charts and reports.

We have two options:
1. Filter out these events from indexing.
2. Ignore these events while creating summary index.

Given their log volumes, I would like to know which is performance intensive operation. We do not want to compromise on performance. Which option is better w.r.t performance.

Also, I have this stanza in my transforms.conf

[strip_images_header]
REGEX = (?i)^(?:[^ ]*( {1,2})){6}(?P<URL>[^ ]*)(?= )=(*.net|*vod|*.jpeg)
DEST_KEY = queue
FORMAT = nullQueue

and i have included this in my props.conf. But the events are not getting filtered.

Thanks

Strive

Tags (1)
0 Karma
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

Filtering these events during indexing should only be done if you're 100% certain you will not need those events for anything in the future.
If you really are certain you will never need those events, filtering at index time is a good choice because it reduces storage, search, and license load... until you discover you do need the events after all.

Personally I'd define an eventtype "charting_URLs" or similar that defines what you want to see in such a chart, and then use that as a search time filter for all your charts (basically option 2). That way you have a single configuration to adapt if the charting requirements change, and you still have the option of using the events in the future.

View solution in original post

martin_mueller
SplunkTrust
SplunkTrust

Up until that point - the lookahead for a space - the regex appears fine to me. However, the part after that - =(*.net|*vod|*.jpeg) - is what looks wrong to me.

0 Karma

strive
Influencer

We have log events with space as the separator. The regex i have mentioned (?i)^(?:[^ ]( {1,2})){6}(?P[^ ])(?= ) is used to extract the URL field during search time and it works perfectly to fetch the 7th field (that is URL) in the log events. My use case is to take the 7th field and check if that field ends with .net or vod or jpeg

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Looking further into the regex, this can't be right:

(?= )=

That's a contradiction in and of itself. "Look ahead for a space, don't consume a char, match for an equals sign"... the char can't be both a space and an equals sign.

martin_mueller
SplunkTrust
SplunkTrust

Well, what does the asterisk apply to? There's nothing in front of it that could be matched zero or more times.

Same question for the asterisks before .net and .jpeg

strive
Influencer

It is *vod only. I have kept it purposefully.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

There likely is a typo in the last capturing group near the asterisk before "vod".

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Filtering these events during indexing should only be done if you're 100% certain you will not need those events for anything in the future.
If you really are certain you will never need those events, filtering at index time is a good choice because it reduces storage, search, and license load... until you discover you do need the events after all.

Personally I'd define an eventtype "charting_URLs" or similar that defines what you want to see in such a chart, and then use that as a search time filter for all your charts (basically option 2). That way you have a single configuration to adapt if the charting requirements change, and you still have the option of using the events in the future.

martin_mueller
SplunkTrust
SplunkTrust

Filtering at index time affects all charts and all searches. If the data has been ditched during indexing no search can find it, regardless of realtime or not.

martin_mueller
SplunkTrust
SplunkTrust

Filtering does affect indexing performance negatively because it has to test the filter for many events - but it also affects indexing performance positively because fewer events need to be indexed and written to disk.

Which effect prevails depends on the complexity of the filter and the ratio of events tossed into nullQueue. If you test a million events to move one into nullQueue you're going to have worse performance, if you move half a million into nullQueue with a simple filter you may even improve performance.

strive
Influencer

Completely agree with your points on reduced storage, search, and license load.
Filtering events during index time in high log volume systems affects indexing performance right. It has to check each and every event. I am assuming that this may affect other charts which depends on the summary indexes that are created every 5 minutes and also real time charts.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...