Solved: equivalence of sql join of two different group bys

eyaler · ‎03-30-2015

i have data of the form:
day, hour, seller, buyer

i want to find all instances where a seller appears only on a single day and sells to more than 10 different buyers in the same hour.

in sql i would do:
select a.* from (select day, hour, seller, count(distinct buyer) as cnt_buy from sales group by day, hour, seller) a join (select seller, count(distinct day) as cnt_date from sales group by seller) b on a.seller=b.seller where cnt_buy>10 and cnt_date=1

clarification: I would like to combine the two following queries:

(1) find all sellers which appear only on single day:
| stats dc(day) as cnt_date by seller| where cnt_date=1
(2) find all instances where one seller sold to more than 10 buyers in
one hour:
| stats dc(buyer) as cnt_buy by day, hour, seller | where cnt_buy>10
now i want to join the above two so that only the sellers in (1) are used in (2). what would be an efficient
splunk way to do so?

somesoni2 · ‎04-09-2015

Try this

base search | stats dc(buyer) as buyers by seller, day, hour | eventstats dc(day) as days by seller | where buyers>10 AND days=1

View solution in original post

somesoni2 · ‎04-09-2015

Try this

base search | stats dc(buyer) as buyers by seller, day, hour | eventstats dc(day) as days by seller | where buyers>10 AND days=1

eyaler · ‎04-09-2015

looks good. it would however seem more efficient to first filter sellers by the second condition (days=1) and only then do the hourly counts on those sellers. could you provide a way to do it?

rsennett_splunk · ‎04-09-2015

@somesoni2: I was actually thinking sub search given the clarification... but I'm stuck on the time factor at the moment...

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!

rsennett_splunk · ‎03-30-2015

This may seem oversimplified... but if your data in Splunk holds day,hour,seller,buyer fields for each event, all you would have to do is group the fields, so that you were counting the "buy" and sorting by the other stuff so it narrowed down by seller:

|stats count by day hour seller|where count>=10

In this case, the "buyer" is present on all of them... and how you order the fields after the 'by' will determine what exactly you are counting. In other words, the last field in the list:seller is counted per hour, per day

the transition from thinking in terms of rows and columns to thinking in terms of events and how to manipulate them, does take a bit of acclimation...

If your data is really in two physical data sources, you need to show the data more clearly so we can give you the syntax but unless I'm mistaken, the join was recursive so you could get the count. (my sql is quite rusty at this point). In this case, you are counting the events so you don't have to create something to represent the "count". It's more of a by product of the calculation...

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!

eyaler · ‎04-09-2015

not sure we understand each other. essentially i want to combine the following queries:
1. find all sellers which appear only on single day:
| stats dc(day) as cnt_date by seller| where cnt_date=1
2. find all instances where one seller sold to more than 10 buyers in one hour:
| stats dc(buyer) as cnt_buy by day, hour, seller | where cnt_buy>10
now i want to join the above two so that only the sellers in (1) are used in (2).
what would be an efficient splunk way to do so?

equivalence of sql join of two different group bys

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!