Hi,
We use splunk 5.0.4. We have customers with daily log volume ranging from 10GB to 50GB.
Our customers do not want to show up URLs with *.jpg, *.png and etc in charts and reports.
We have two options:
1. Filter out these events from indexing.
2. Ignore these events while creating summary index.
Given their log volumes, I would like to know which is performance intensive operation. We do not want to compromise on performance. Which option is better w.r.t performance.
Also, I have this stanza in my transforms.conf
[strip_images_header]
REGEX = (?i)^(?:[^ ]*( {1,2})){6}(?P<URL>[^ ]*)(?= )=(*.net|*vod|*.jpeg)
DEST_KEY = queue
FORMAT = nullQueue
and i have included this in my props.conf. But the events are not getting filtered.
Thanks
Strive
Filtering these events during indexing should only be done if you're 100% certain you will not need those events for anything in the future.
If you really are certain you will never need those events, filtering at index time is a good choice because it reduces storage, search, and license load... until you discover you do need the events after all.
Personally I'd define an eventtype "charting_URLs" or similar that defines what you want to see in such a chart, and then use that as a search time filter for all your charts (basically option 2). That way you have a single configuration to adapt if the charting requirements change, and you still have the option of using the events in the future.
Up until that point - the lookahead for a space - the regex appears fine to me. However, the part after that - =(*.net|*vod|*.jpeg)
- is what looks wrong to me.
We have log events with space as the separator. The regex i have mentioned (?i)^(?:[^ ]( {1,2})){6}(?P
Looking further into the regex, this can't be right:
(?= )=
That's a contradiction in and of itself. "Look ahead for a space, don't consume a char, match for an equals sign"... the char can't be both a space and an equals sign.
Well, what does the asterisk apply to? There's nothing in front of it that could be matched zero or more times.
Same question for the asterisks before .net and .jpeg
It is *vod only. I have kept it purposefully.
There likely is a typo in the last capturing group near the asterisk before "vod".
Filtering these events during indexing should only be done if you're 100% certain you will not need those events for anything in the future.
If you really are certain you will never need those events, filtering at index time is a good choice because it reduces storage, search, and license load... until you discover you do need the events after all.
Personally I'd define an eventtype "charting_URLs" or similar that defines what you want to see in such a chart, and then use that as a search time filter for all your charts (basically option 2). That way you have a single configuration to adapt if the charting requirements change, and you still have the option of using the events in the future.
Filtering at index time affects all charts and all searches. If the data has been ditched during indexing no search can find it, regardless of realtime or not.
Filtering does affect indexing performance negatively because it has to test the filter for many events - but it also affects indexing performance positively because fewer events need to be indexed and written to disk.
Which effect prevails depends on the complexity of the filter and the ratio of events tossed into nullQueue. If you test a million events to move one into nullQueue you're going to have worse performance, if you move half a million into nullQueue with a simple filter you may even improve performance.
Completely agree with your points on reduced storage, search, and license load.
Filtering events during index time in high log volume systems affects indexing performance right. It has to check each and every event. I am assuming that this may affect other charts which depends on the summary indexes that are created every 5 minutes and also real time charts.