Splunk Search

Difference between dedup and dc counting?

aan_gst_dk
New Member

Searching a table with 252092 events for the number of distinct ORDERID with "dedup" and "dc" I get different results. The following task "(index=swbdlogs sourcetype=shopdownloadlogs) | chart dc(ORDERID)" returns 71908 and the task "(index=swbdlogs sourcetype=shopdownloadlogs) | dedup ORDERID | chart count" returns 66785. In my opinion the resukts should be the same. A sorting by ORDERID gives values in between "(index=swbdlogs sourcetype=shopdownloadlogs) | sort 300000 ORDERID | chart dc(ORDERID)" returns eg. 71383.
Which value can I thrust on?

Splunk 6.1.1 on RHEL

Tags (1)
0 Karma

ngatchasandra
Builder

For me all values can be reliable for two reasons:
- Your time range picker is not the same when you execute your different search with both command dc and dedup
-your data have been indexed the continuously way (if you continuously indexed data then the indexing because your data is very big, is very possible that splunk return you the different results)
For the search that follow who are executed in “All time” (note: I don’t continuously index my data); the results is could be normally the same thing with dc and dedup command:
1- I have a search (index=tuto sourcetype=access_combined_wcookie) that returns initially 39532 events
2- When I execute search “index=tuto sourcetype=access_combined_wcookie | chart dc(categoryId)”, it returns 39532 events and statistics like this :

dc(categoryId)
8

This is because the chart command is apply only upon the distinct count of all categoryId in events.
3- When I execute “index=tuto sourcetype=access_combined_wcookie | dedup categoryId | chart count”, I obtain 8 events and statistic table that follow:

count
8

This means that we dedup events based on categoryId criteria before do the count
4- When I execute “index=tuto sourcetype=access_combined_wcookie | sort 40000 categoryId |chart dc(categoryId)” I have the same thing with step 2

0 Karma

sbsbb
Builder

I have actually a case open by splunk, where I have different count of event on the same query, when runing a couple of time... So it could be possible

0 Karma

kristian_kolb
Ultra Champion

To clarify what @Strive says: Are you searching for the exact same period of time? Not like 'last 4 hours', which is essentially a sliding window.

Have you tested this with earliest and latest, e.g. earliest-3h@h latest=@h to ensure that exact same underlying events are being returned to your calculation?

/K

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

To make things even more interesting you could also do this:

base search | stats count by ORDERID

and look at the number of rows returned.

0 Karma

strive
Influencer

are there any null ORDERIDs?

Are you choosing same time range for all these searches?

Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...