About markgo

markgo · ‎07-13-2011

I've had the misfortune of feeding 30K input files from Amazon S3 Cloudfront logs into my live Splunk instance, without specifying a sourcetype. This has created a serious problem in that it has resulted in thousands of automatically created variants of sourcetype-too-small from the bizarre headers that Amazon likes to use (note that the REAL data does not cause this issue). As a result, performance has slowed to a crawl. I've deleted the "bad" events, but is there something I can do about the bad automatically created sourcetypes? As to why I didn't notice this--it didn't become a problem until the number of sourcetypes grew to a prodigous value. And since my searches excluded bad events, I never noticed the sourcetypes.

markgo · ‎03-07-2011

Thanks for the answers! I'll add my own final solution, which is much along the lines of Stephen's in that I perform one primary search and then slice and dice the results. The really tricky part is that I want this answer more for more than one build at a time, so the only way I could figure it was to use the autoregress command to peek back at events of the "other" type and then calculate the answer only where the requisite data was available. It's pretty awesome that the Splunk query language can express this, as it's really an iterative process: host=myhost | extract reload=T | search (script="vercheck.cgi" OR script="crashreport.cgi") | stats count as startupCount by bld, script | autoregress startupCount as crashCount | autoregress script as prevScript | autoregress bld as prevBld | eval BSI=if(script="vercheck.cgi" AND prevScript="crashreport.cgi",100*(startupCount-crashCount)/startupCount,0) | search BSI>0 AND startupCount>100 | fields + bld, BSI, startupCount, crashCount which produces the nice, clean output: bld BSI startupCount crashCount 1 350447 70.967742 434 126 2 350352 75.700935 107 26 In the end, I realized that the output data is sensitive to the timescale of the search, but it's not really bucketed the way the a time chart is. Thanks again for the community support!

markgo · ‎03-05-2011

Here's the situation: I have one set of web log events that represent people using my app which I generally display in a timescaled graph split by the version they are using. I have another set of web log events that represent crash reports, which I display the same way. I'd like to calculate a "crash incidence / usage", which would require taking the count of crashes for a particular time bucket and dividing it by the count of usage events over the same time period. Short of running extractions using the CLI and doing my own math (and injecting the results back in as an input), is there any way to express these kind of queries in native Splunk?

Posts	3
Solutions	0
Karma Given	0
Karma Received	7
Member Since	‎03-05-2011

Online Status	Offline
Date Last Visited	‎06-05-2020 02:02 AM

How can I get rid of thousands of automatically cr...

How can I perform math on the counts from two quer...

How can I get rid of thousands of automatically cr...

Re: How can I perform math on the counts from two ...

How can I perform math on the counts from two quer...