Splunk Search

Is there a way to use the join command effectively in Hunk?

ddrillic
Ultra Champion

We are trying to see whether the out-of-the-box join command works well in Hunk. We tried the following:

index="claims" | join prov_tin [search index=provider | eval prov_tin = tin]

In claims we have two billion events and in provider we have a couple of million events. The above command runs for some time and returns some results. However, it doesn't even produce a MapR job and therefore it's very slow and it just can't traverse and entire index.

Any ideas?

Tags (2)
0 Karma
1 Solution

martin_mueller
SplunkTrust
SplunkTrust

Yeah, this spins and spins because the subsearch has to finish before the main search can start - retrieving all the things from index=provider.

Read the two links I provided. A possible solution would be index=claims OR index=provider | eval ID = if(index="claims", prov_tin, tin) | stats values(field1) as field1 values(field2) as field2 ... by ID
Written that way, retrieving data from both indexes and the prestats computation are map-able.

If you're just searching for one subset of claims you could do this:

(index="claims" clm_aud_nbr="585478461201*") OR (index=provider [search index="claims" clm_aud_nbr="585478461201*" | stats count by prov_tin | rename prov_tin as tin | fields tin]) |  | stats values(field1) as field1 values(field2) as field2 ... by ID

The subsearch will retrieve all tins, allowing the search for index=provider to be scoped narrowly for just the tins you need. If you throw billions of events at this it'll still be slow of course.

View solution in original post

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Yeah, this spins and spins because the subsearch has to finish before the main search can start - retrieving all the things from index=provider.

Read the two links I provided. A possible solution would be index=claims OR index=provider | eval ID = if(index="claims", prov_tin, tin) | stats values(field1) as field1 values(field2) as field2 ... by ID
Written that way, retrieving data from both indexes and the prestats computation are map-able.

If you're just searching for one subset of claims you could do this:

(index="claims" clm_aud_nbr="585478461201*") OR (index=provider [search index="claims" clm_aud_nbr="585478461201*" | stats count by prov_tin | rename prov_tin as tin | fields tin]) |  | stats values(field1) as field1 values(field2) as field2 ... by ID

The subsearch will retrieve all tins, allowing the search for index=provider to be scoped narrowly for just the tins you need. If you throw billions of events at this it'll still be slow of course.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I don't see why it shouldn't. Supply three OR'd sources of data, compute a "join field", run stats, filter.

0 Karma

ddrillic
Ultra Champion

Thank you Martin!!!

Based on what you suggested we did -

(index=claims calc_pd_amt!=0.0) OR (index=provider cos_prov_spcl_cd = 13)
| eval prov_tin = coalesce(tin, prov_tin)
| stats sum(calc_pd_amt) as total, values(cos_prov_spcl_cd) as cos_prov_spcl_cd by prov_tin 
| search total != 0.0 cos_prov_spcl_cd = 13

It works amazingly well with the two billion claims.

The client is now asking whether the solution would work with three tables as well. Their use case, is based on three tables...

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

This inherently has to be either slow or impossible, both in Hunk and Splunk. You're asking for two entire indexes to be returned to the search head to then be joined - that's literally looking for trouble.

Instead, consider more Splunky ways of joining, see https://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-jo... or http://wiki.splunk.com/Virtual_.conf (March 2016).

What are you actually trying to achieve?

ddrillic
Ultra Champion

I'm trying to see whether the join command works at all in Hunk at this scale.

The following works for one claim and returns results immediately -

 index="claims"  clm_aud_nbr="585478461201*"

However,

index="claims"  clm_aud_nbr="585478461201*" | join prov_tin [search index=provider | eval prov_tin = tin]

Spins and spins.... after 1/4 billion events the query processing just stops. Apparently, the first query part, until the first pipe tries to retrieve all results from the two billion claims. How can we avoid this lengthy process and return, let's say, just the first match?

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...