I'm using SEDCMD to cleanup (and reduce) iislogs:
# remove all path info but the (unique) file name
SEDCMD-uritrim = s% /commonurlbase[^ ]*/% ./%
# reduce chrome version (chrome mentions safari, so separate sed needed?)
SEDCMD-chrome = s% Mozilla[^ ]*Chrome.([0-9.]*)[^ ]*% Chrome-\1%
# reduce agent name version
SEDCMD-agents = s% Mozilla[^ ]*(Safari|Firefox|MSIE).([0-9.]*)[^ ]*% \1-\2%
# trim sid query string from referral url
SEDCMD-reftrim = s%.aspx\?s[iI][dD]=[^ ]*%.aspx%
# trim portal sign-on shenanigans from referral
SEDCMD-portalreftrim = s%/\!ut[^ ]*%%
I was thinking of ways to combine (some of) these, and/or maybe try to come up with a more efficient regex on some of them. What are some options for testing the performance effect of the changes? With 200+ GB of logs passing through daily, I want to be sure we're as efficient as we can be -- allowing that the logs need to be 'cleaned' as outlined above.
Thanks,
jon
This might be rather difficult to measure. Assuming that your daily indexing volume remains mostly-flat day to day, you might be able to come up with a measurement based on CPU seconds used by the indexing process day over day. The main issue is that these regexes will be firing as events come in, potentially changing the raw value of the event. Each test of "does this regex match?" uses a miniscule amount of CPU time, and each substitution if it does match uses a only a little more.
Your most accurate bet (which is a lot of work) would be to implement a simple regex profiler. We know that Splunk uses PCRE, which is open source. You could build a test harness to evaluate the use of each of these regexes, over a sample of several hundred thousand events, in a controlled fashion. No, not easy at all - but it would be more accurate than trying to measure it in-situ in a running indexer.
Well, not so much interested in specific regex tips, but how to evaluate whether a new regex is helpful or hurtful. Log example: 2012-02-27 21:57:00 172.20.90.43 POST /websiterooturi/subfolder/somepage.aspx sID=abcdef1234567890ABCDEF 80 - 172.20.176.20 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1;+Trident/4.0;+.NET+CLR+2.0.50727;+.NET+CLR+3.0.4506.2152;+.NET+CLR+3.5.30729) https://snazzywebsitedotcom/websiterooturi/subfolder/referringpage.aspx?sID=abcdef1234567890ABCDEF 200 1146 4870 375
Need sample logs to provide if you are really looking for better regex.