I am using this search to produce a monthly report ranking top pages in a section of a site. My date range always starts on Sep 1, 2016 and goes to the end of each month. I am finding is that the count through Feb 28 is actually lower than it is through Jan 31 for some urls. It seems like such a simple search. What other factor could be reducing the effective number of events over time?
source="/home/suchandso/access-logs/website.com" AND uri_path="/sectionpath*" | stats count by uri_path
Thanks!
Okay, here's a candidate. Please run the following two items and see which one matches your numbers for Sept only.
earliest="09/01/2016:00:00:00" latest="09/30/2016:00:00:00" source="/home/suchandso/access-logs/website.com" AND uri_path="/sectionpath*" | stats count by uri_path
earliest="09/01/2016:00:00:00" latest="10/01/2016:00:00:00" source="/home/suchandso/access-logs/website.com" AND uri_path="/sectionpath*" | stats count by uri_path
NEVER MIND THE ABOVE.
Here's a quick and useful way to look at your data. Try this. Modify the URLS to the one that was consistent and two that were inconsistent.
earliest="09/01/2016:00:00:00" latest="10/01/2016:00:00:00"
source="/home/suchandso/access-logs/website.com" AND
(uri_path="/sectionpath/mytestA" OR uri_path="/sectionpath/mytestB" OR uri_path="/sectionpath/mytestC" )
| bin _time span=1month
| stats count as eventcount by _time uri_path
| chart sum(eventcount) as eventcount over _time by uri_path
| addtotals row=t col=t labelfield="Month" label="Total for Period" fieldname="Total for Month"
Giving you results that look somewhat like this
Month /sectionpath/mytestA /sectionpath/mytestB /sectionpath/mytestC Total for Month
2016-09 70 31 46 147
2016-10 95 49 102 246
2016-11 95 48 107 250
2016-12 105 45 106 256
2017-01 89 51 106 246
2017-02 90 40 91 221
2017-03 14 6 18 38
Total for Period 558 270 576 1404
Thanks for the query logic, much faster than exporting one month at a time.
As to your two candidate queries, the second one is the exact query I ran for Sept only. What was your hypothesis in comparing the two?
Follow up after a long time... that particular combination was just trying to eliminate the possibility that some records on the last day of the month were being dropped due to their event time.
With regard to the overall issue, it seems like the subsearch limits could cause numbers for each individual URL to go either direction, since I don't believe there is a guarantee which events from which indexers will arrive first. Nonetheless, I don't see any way that the described symptoms could be generated based on the code and comments above. Sure wish he had posted what he found.
I'm expecting, based on the fact that the op disappeared shortly after we posted the chart... over _time
code, that that code enabled him to effectively find and stomp the bug, since it automatically provides summary row and column for the individual data.
By the way, are you comparing the results that you saved from last month, or are you comparing the results that you get NOW for a search that ends on Jan 31? If records had been deleted in the meantime, a later search is obviously not going to find the missing records.
I am comparing both the results I saved from last month AND the results I get now for a search that ends on Jan 31. For some urls, the Sep-Jan number is higher than Sep-Feb when I run both reports now. AND the Sep-Jan number when I ran it last month was higher than the Sep-Jan number when I run it now. See comments I will post below, it gets more bizarre...
Okay, I'd limit the query to a single one of the URLs that is different and see if the behavior can be replicated at a more granular level. Most of the ways that splunk inherently limits search results will disappear if the volume of results get under the throttle. Thus, if the problem goes away when you limit the query to a small selection, then it tells you where to look for the glitch.
If I'm understanding what you're suggesting, then I would expect to see lower totals for larger aggregate time ranges. But in this case, I'm seeing the opposite: lower numbers when I manually total up smaller time ranges (e.g. 1 month increments) than when I use larger time ranges (e.g. 6 months). And just now I tried one URL and got a smaller number for Sep-Jan than when I pull that same time range for all URLs, which would seem to be a much bigger chunk for Splunk to process.
So could the glitch or limit you're suggesting be that some events are being counted more than once?
Depends on how you are using it. There are some places where splunk has hidden and counter-intuitive limits on what it will return. Is that search in a panel? Are you retrieving the results though loadjob?
Do the same search from the search head, specifying these parameters, and send to background...
earliest="09/01/2016:00:00:00" latest="02/01/2017:00:00:00" source="/home/suchandso/access-logs/website.com" AND uri_path="/sectionpath*" | stats count by uri_path
earliest="09/01/2016:00:00:00" latest="03/01/2017:00:00:00" source="/home/suchandso/access-logs/website.com" AND uri_path="/sectionpath*" | stats count by uri_path
If the results don't look reasonable for the second one relative to the first, then find out what is falling out by breaking out Feb independently...
earliest="09/01/2016:00:00:00" latest="03/01/2017:00:00:00" source="/home/suchandso/access-logs/website.com" AND uri_path="/sectionpath*" | eval month=strftime(_time,"%m")| eval month=case(month==2,2,true(),0)| stats count by month uri_path
or break every month out independently and see if anything jumps out at you...
earliest="09/01/2016:00:00:00" latest="03/01/2017:00:00:00" source="/home/suchandso/access-logs/website.com" AND uri_path="/sectionpath*" | eval month=strftime(_time,"%m")| stats count by month uri_path
updated to put quotes around the time format string "%m"
I am not running the search in a panel, I am entering it as a raw search and I save those searches as reports. To my knowledge I am not using loadjob.
Your search parameters eval month=strftime(_time,%m)| eval month=case(month==2,2,true(),0)| stats count by month uri_path
and eval month=strftime(_time,%m)| stats count by month uri_path
both threw errors for me.
So I ran the report for each month, limiting the date range to only that month, e.g. Sep 1 - Sep 30, Oct 1 - Oct 31, etc. Then I manually totaled each month of data for 5 selected urls. For each url my manually totaled number for Sep-Feb exactly matches what I get when I run the report with the full Sep-Feb date range.
However, my manually totaled numbers for Sep-Jan match full Sep-Jan date range reports exactly for only two of the five urls. The manual totals are lower for the other three. And for Sep-Dec, the data for only one of the five urls match.
i doubt this would cause the problem, but you don't need the AND
have you tried taking out the uri_path="/sectionpath*"
and checking? I also doubt that's it but you never know. it seems simple enough...
You are correct: I didn't need the AND
. I did not yet try taking out the uri_path="/sectionpath*" yet because there would be millions of events and it would take way too long to run.