I'm new-ish to Splunk, so forgive me if I'm not sure of the best way to do this.
Basically, I'm trying to find out two things:
I currently have the following search:
sourcetype=ihs_log "GET /shop/product/" | rex field=url "/shop/product/(?<productname>.+?)\?ID=(?<productid>.+?)\&CategoryID=(?<categoryid>.+?)[\&\#$]" | stats count by productid | sort count | reverse
which gives me the total number of hits which include both ID and CategoryID parameters (in that order, one after another), but if I run the same search without the categoryID bit, e.g.:
sourcetype=ihs_log "GET /shop/product/" | rex field=url "/shop/product/(?<productname>.+?)\?ID=(?<productid>.+?)[\&\#$]" | stats count by productid | sort count | reverse
I get the same result. I would expect the second query to give a higher count, since it should include those cases where ID is passed, but where CategoryID is not passed. Or am I misunderstanding?
At any rate, what do I specify to get only the URL's which don't include the CategoryID parameter (no matter whether it appears before or after ID in the query parameters)? and then sort by productid?
Basically what's the regex for does-not-include?
Like I said, it's probably a trivial question...
Thanks!
Try this for the first query:
sourcetype=ihs_log "GET /shop/product/" | rex field=url "/shop/product/(?<productname>.+?)\?ID=(?<productid>.+?)\&CategoryID=(?<categoryid>.+?)[\&\#$]" | stats count by productid,categoryid | sort count | reverse
And live the second as it is, and let me know if results are still the same
Try this for the first query:
sourcetype=ihs_log "GET /shop/product/" | rex field=url "/shop/product/(?<productname>.+?)\?ID=(?<productid>.+?)\&CategoryID=(?<categoryid>.+?)[\&\#$]" | stats count by productid,categoryid | sort count | reverse
And live the second as it is, and let me know if results are still the same
Hmmm. They do give different values. So how come?
And how can I get the number of URL's which don't include CategoryID (or is it simpler just to look at the difference between the two queries?
No. just to say, to have the total number of hits which include both productID and CategoryID parameters, you must count by productid and by categoryd: | stats count by productid,categoryid
But if you just want to have the total number by just productID you must count only by productid: | stats count by productid
Aha - so "count by a,b,c" will return only those which include a, b and c.
So is there an example of how to use either regex or rex to return URL's which explicitly don't include a particular query parameter?
You use the regular expression to extract fields (parameters) in your events. For example when you do something like this: | rex field=url "/shop/product/(?.+?)\?ID=(?.+?)\&CategoryID=(?.+?)[\&\#$]"
, you have just extracted three fields ( productname, productid and categoryid), and that fields does not have any effect to the search criteria, and you will decide which field to use in your search criteria, only after the extraction. That is why when you write | stats count by productid | sort count | reverse
, you are not taking into acccount the producname, nor the categoryid in your search criteria. You have just extracted them, but you didn't use them in your search criteria
So I admit that I'm still a bit lost 😞
I get that the basic query bit is just the stuff before the first pipe, and then I'm trying to get specific data out from that. That seems to be why the total number of matching events is the same.
What if I don't care about the specifics at all? What if I simply want a count of all events which do (or don't) match certain URL formats (without caring about what the actual values are)? Do I even need to get stats for this?
Basically, I want a total count of all URL's which match this regex
/shop/product/any-value?ID=any-value
which don't include the CategoryID parameter. Can I include that in the basic query?
So I don't need to know what values the ID or CategoryID parameters have, just whether the CategoryID parameter exists in the URL. Am I being too complicated?