Solved: Merging 3 interlinked large data sets with differe...

splunk4now · ‎08-17-2017

I have 3 data sets (say src1, src2, sr3), with merged resultsets of single merge greater than the 50k limit - hence normal are ruled out. In this case src1 references a field in src2 and src2 references a different field in src3. The objective is to merge across all three data sets based on certain conditions in two individual merges (src1 & src2 ; sr2 & src3). I managed to get two sources merged, however merging the third seems perplexing due to limitations on sub-search or join over the stats results. Any thoughts ?

Code for merging two sources:

sourcetype="srctype:source1" OR sourcetype="srctype:source2"
 | eval new_gid = coalesce(ref_src1,ref_src2)
 | stats list(field1_src1) as field1_src1 
         list(field2_src1) as field2_src1
         list(field3_src1) as field3_src1
         list(field1_src2) as field1_src2 
         list(field2_src2) as field2_src2
   by new_gid
 | where field3_src1="cat1" OR field3_src1="cat2"

A single stats result row for the above search would look something like this.

new_gid               field1_src1    field2_src1    field3_src1    field1_src2    field2_src2
03759542db9fb2404     3958294        5              cat1           69.13          c2bf762edb7b1600ab
                                                                   266.7          7f4f9f7adb44e600b1
                                                                   0.01           7f4f9f7adb44e600b1

This result-set would have to be merged with src_3 using field2_sr2, so that only one value remains instead of three (based on values of certain fields in src3).

DalJeanis · ‎08-18-2017

Start with getting ONLY the relevant records from all files. Do the limits on source 3 up front.

Use eventstats to roll the data from the farthest-right file and then kill the file, one file at a time, until you're ready for stats.

   (sourcetype="S1" (field3_src1="cat1" OR field3_src1="cat2") )
OR  sourcetype="S2" 
OR (sourcetype="S3" (field1_src3="bar" OR field2_src3="baz") )

| rename COMMENT as "now we limit to the relevant fields"
| fields sourcetype ref_src1 field1_src1 field2_src1 field3_src1 ref_src2 field1_src2 ref_src3 field2_src2 field1_src3 field2_src3

| rename COMMENT as "roll the data from source3 onto source2, then drop all source3 and any source2 that have no match "
| eval ref_src3=coalesce(ref_src3,field2_src2)
| eventstats values(field1_src3) as field1_src3, values(field2_src3) as field2_src3 by ref_src3
| where sourcetype="S1" OR ( sourcetype="S2" AND isnotnull(field1_src3))

| rename COMMENT as "roll all the remaining data together now"
| eval new_gid = coalesce(ref_src1,ref_src2)
| stats list(*) as * by new_gid

View solution in original post

DalJeanis · ‎08-18-2017

Start with getting ONLY the relevant records from all files. Do the limits on source 3 up front.

Use eventstats to roll the data from the farthest-right file and then kill the file, one file at a time, until you're ready for stats.

   (sourcetype="S1" (field3_src1="cat1" OR field3_src1="cat2") )
OR  sourcetype="S2" 
OR (sourcetype="S3" (field1_src3="bar" OR field2_src3="baz") )

| rename COMMENT as "now we limit to the relevant fields"
| fields sourcetype ref_src1 field1_src1 field2_src1 field3_src1 ref_src2 field1_src2 ref_src3 field2_src2 field1_src3 field2_src3

| rename COMMENT as "roll the data from source3 onto source2, then drop all source3 and any source2 that have no match "
| eval ref_src3=coalesce(ref_src3,field2_src2)
| eventstats values(field1_src3) as field1_src3, values(field2_src3) as field2_src3 by ref_src3
| where sourcetype="S1" OR ( sourcetype="S2" AND isnotnull(field1_src3))

| rename COMMENT as "roll all the remaining data together now"
| eval new_gid = coalesce(ref_src1,ref_src2)
| stats list(*) as * by new_gid

splunk4now · ‎08-19-2017

This works ! Thank for the new perspective. I will post the final results more once I do a further analysis on Monday

richgalloway · ‎08-18-2017

I'm confused about your "limitations on sub-search or join" comment since your query contains neither a subsearch nor a join.
What happens when you run this query?

sourcetype="srctype:source1" OR sourcetype="srctype:source2" OR sourcetype="srctype:source3"
  | eval new_gid = coalesce(ref_src1,ref_src2,ref_src3)
  | stats list(field1_src1) as field1_src1 
          list(field2_src1) as field2_src1
          list(field3_src1) as field3_src1
          list(field1_src2) as field1_src2 
          list(field2_src2) as field2_src2
          list(field1_src3) as field1_src3
          list(field2_src3) as field2_src3
    by new_gid
  | where field3_src1="cat1" OR field3_src1="cat2" OR field3_src1="cat3"

---
If this reply helps you, Karma would be appreciated.

splunk4now · ‎08-18-2017

I meant the imposed limitations on join/sub-search resultset and I was trying out a mechanism to merge "resultset of sr1 and src2" with src3. The above search query (which i had already tried earlier) does not provide any values for field1/field2 from src3, since I assume that coalesce function is expecting for a same key value in all 3 sources. In this case src1/src2 share a common reference key and src2/sr3 share a different common reference key.

Merging 3 interlinked large data sets with different ref keys in two individual merges

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk