We are trying to see whether the out-of-the-box join command works well in Hunk. We tried the following:
index="claims" | join prov_tin [search index=provider | eval prov_tin = tin]
In claims we have two billion events and in provider we have a couple of million events. The above command runs for some time and returns some results. However, it doesn't even produce a MapR job and therefore it's very slow and it just can't traverse and entire index.
Any ideas?
Yeah, this spins and spins because the subsearch has to finish before the main search can start - retrieving all the things from index=provider.
Read the two links I provided. A possible solution would be index=claims OR index=provider | eval ID = if(index="claims", prov_tin, tin) | stats values(field1) as field1 values(field2) as field2 ... by ID
Written that way, retrieving data from both indexes and the prestats computation are map-able.
If you're just searching for one subset of claims you could do this:
(index="claims" clm_aud_nbr="585478461201*") OR (index=provider [search index="claims" clm_aud_nbr="585478461201*" | stats count by prov_tin | rename prov_tin as tin | fields tin]) | | stats values(field1) as field1 values(field2) as field2 ... by ID
The subsearch will retrieve all tins, allowing the search for index=provider to be scoped narrowly for just the tins you need. If you throw billions of events at this it'll still be slow of course.
Yeah, this spins and spins because the subsearch has to finish before the main search can start - retrieving all the things from index=provider.
Read the two links I provided. A possible solution would be index=claims OR index=provider | eval ID = if(index="claims", prov_tin, tin) | stats values(field1) as field1 values(field2) as field2 ... by ID
Written that way, retrieving data from both indexes and the prestats computation are map-able.
If you're just searching for one subset of claims you could do this:
(index="claims" clm_aud_nbr="585478461201*") OR (index=provider [search index="claims" clm_aud_nbr="585478461201*" | stats count by prov_tin | rename prov_tin as tin | fields tin]) | | stats values(field1) as field1 values(field2) as field2 ... by ID
The subsearch will retrieve all tins, allowing the search for index=provider to be scoped narrowly for just the tins you need. If you throw billions of events at this it'll still be slow of course.
I don't see why it shouldn't. Supply three OR'd sources of data, compute a "join field", run stats, filter.
Thank you Martin!!!
Based on what you suggested we did -
(index=claims calc_pd_amt!=0.0) OR (index=provider cos_prov_spcl_cd = 13)
| eval prov_tin = coalesce(tin, prov_tin)
| stats sum(calc_pd_amt) as total, values(cos_prov_spcl_cd) as cos_prov_spcl_cd by prov_tin
| search total != 0.0 cos_prov_spcl_cd = 13
It works amazingly well with the two billion claims.
The client is now asking whether the solution would work with three tables as well. Their use case, is based on three tables...
This inherently has to be either slow or impossible, both in Hunk and Splunk. You're asking for two entire indexes to be returned to the search head to then be joined - that's literally looking for trouble.
Instead, consider more Splunky ways of joining, see https://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-jo... or http://wiki.splunk.com/Virtual_.conf (March 2016).
What are you actually trying to achieve?
I'm trying to see whether the join command works at all in Hunk at this scale.
The following works for one claim and returns results immediately -
index="claims" clm_aud_nbr="585478461201*"
However,
index="claims" clm_aud_nbr="585478461201*" | join prov_tin [search index=provider | eval prov_tin = tin]
Spins and spins.... after 1/4 billion events the query processing just stops. Apparently, the first query part, until the first pipe tries to retrieve all results from the two billion claims. How can we avoid this lengthy process and return, let's say, just the first match?