Searching a table with 252092 events for the number of distinct ORDERID with "dedup" and "dc" I get different results. The following task "(index=swbdlogs sourcetype=shopdownloadlogs) | chart dc(ORDERID)" returns 71908 and the task "(index=swbdlogs sourcetype=shopdownloadlogs) | dedup ORDERID | chart count" returns 66785. In my opinion the resukts should be the same. A sorting by ORDERID gives values in between "(index=swbdlogs sourcetype=shopdownloadlogs) | sort 300000 ORDERID | chart dc(ORDERID)" returns eg. 71383.
Which value can I thrust on?
Splunk 6.1.1 on RHEL
For me all values can be reliable for two reasons:
- Your time range picker is not the same when you execute your different search with both command dc and dedup
-your data have been indexed the continuously way (if you continuously indexed data then the indexing because your data is very big, is very possible that splunk return you the different results)
For the search that follow who are executed in “All time” (note: I don’t continuously index my data); the results is could be normally the same thing with dc and dedup command:
1- I have a search (index=tuto sourcetype=access_combined_wcookie) that returns initially 39532 events
2- When I execute search “index=tuto sourcetype=access_combined_wcookie | chart dc(categoryId)”, it returns 39532 events and statistics like this :
dc(categoryId)
8
This is because the chart command is apply only upon the distinct count of all categoryId in events.
3- When I execute “index=tuto sourcetype=access_combined_wcookie | dedup categoryId | chart count”, I obtain 8 events and statistic table that follow:
count
8
This means that we dedup events based on categoryId criteria before do the count
4- When I execute “index=tuto sourcetype=access_combined_wcookie | sort 40000 categoryId |chart dc(categoryId)” I have the same thing with step 2
I have actually a case open by splunk, where I have different count of event on the same query, when runing a couple of time... So it could be possible
To clarify what @Strive says: Are you searching for the exact same period of time? Not like 'last 4 hours', which is essentially a sliding window.
Have you tested this with earliest
and latest
, e.g. earliest-3h@h latest=@h
to ensure that exact same underlying events are being returned to your calculation?
/K
To make things even more interesting you could also do this:
base search | stats count by ORDERID
and look at the number of rows returned.
are there any null ORDERIDs?
Are you choosing same time range for all these searches?