Splunk Search

Merging 3 interlinked large data sets with different ref keys in two individual merges

splunk4now
Explorer

I have 3 data sets (say src1, src2, sr3), with merged resultsets of single merge greater than the 50k limit - hence normal are ruled out. In this case src1 references a field in src2 and src2 references a different field in src3. The objective is to merge across all three data sets based on certain conditions in two individual merges (src1 & src2 ; sr2 & src3). I managed to get two sources merged, however merging the third seems perplexing due to limitations on sub-search or join over the stats results. Any thoughts ?

Code for merging two sources:

sourcetype="srctype:source1" OR sourcetype="srctype:source2"
 | eval new_gid = coalesce(ref_src1,ref_src2)
 | stats list(field1_src1) as field1_src1 
         list(field2_src1) as field2_src1
         list(field3_src1) as field3_src1
         list(field1_src2) as field1_src2 
         list(field2_src2) as field2_src2
   by new_gid
 | where field3_src1="cat1" OR field3_src1="cat2"

A single stats result row for the above search would look something like this.

new_gid               field1_src1    field2_src1    field3_src1    field1_src2    field2_src2
03759542db9fb2404     3958294        5              cat1           69.13          c2bf762edb7b1600ab
                                                                   266.7          7f4f9f7adb44e600b1
                                                                   0.01           7f4f9f7adb44e600b1

This result-set would have to be merged with src_3 using field2_sr2, so that only one value remains instead of three (based on values of certain fields in src3).

0 Karma
1 Solution

DalJeanis
SplunkTrust
SplunkTrust

Start with getting ONLY the relevant records from all files. Do the limits on source 3 up front.

Use eventstats to roll the data from the farthest-right file and then kill the file, one file at a time, until you're ready for stats.

   (sourcetype="S1" (field3_src1="cat1" OR field3_src1="cat2") )
OR  sourcetype="S2" 
OR (sourcetype="S3" (field1_src3="bar" OR field2_src3="baz") )

| rename COMMENT as "now we limit to the relevant fields"
| fields sourcetype ref_src1 field1_src1 field2_src1 field3_src1 ref_src2 field1_src2 ref_src3 field2_src2 field1_src3 field2_src3

| rename COMMENT as "roll the data from source3 onto source2, then drop all source3 and any source2 that have no match "
| eval ref_src3=coalesce(ref_src3,field2_src2)
| eventstats values(field1_src3) as field1_src3, values(field2_src3) as field2_src3 by ref_src3
| where sourcetype="S1" OR ( sourcetype="S2" AND isnotnull(field1_src3))

| rename COMMENT as "roll all the remaining data together now"
| eval new_gid = coalesce(ref_src1,ref_src2)
| stats list(*) as * by new_gid

View solution in original post

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

Start with getting ONLY the relevant records from all files. Do the limits on source 3 up front.

Use eventstats to roll the data from the farthest-right file and then kill the file, one file at a time, until you're ready for stats.

   (sourcetype="S1" (field3_src1="cat1" OR field3_src1="cat2") )
OR  sourcetype="S2" 
OR (sourcetype="S3" (field1_src3="bar" OR field2_src3="baz") )

| rename COMMENT as "now we limit to the relevant fields"
| fields sourcetype ref_src1 field1_src1 field2_src1 field3_src1 ref_src2 field1_src2 ref_src3 field2_src2 field1_src3 field2_src3

| rename COMMENT as "roll the data from source3 onto source2, then drop all source3 and any source2 that have no match "
| eval ref_src3=coalesce(ref_src3,field2_src2)
| eventstats values(field1_src3) as field1_src3, values(field2_src3) as field2_src3 by ref_src3
| where sourcetype="S1" OR ( sourcetype="S2" AND isnotnull(field1_src3))

| rename COMMENT as "roll all the remaining data together now"
| eval new_gid = coalesce(ref_src1,ref_src2)
| stats list(*) as * by new_gid
0 Karma

splunk4now
Explorer

This works ! Thank for the new perspective. I will post the final results more once I do a further analysis on Monday

richgalloway
SplunkTrust
SplunkTrust

I'm confused about your "limitations on sub-search or join" comment since your query contains neither a subsearch nor a join.
What happens when you run this query?

sourcetype="srctype:source1" OR sourcetype="srctype:source2" OR sourcetype="srctype:source3"
  | eval new_gid = coalesce(ref_src1,ref_src2,ref_src3)
  | stats list(field1_src1) as field1_src1 
          list(field2_src1) as field2_src1
          list(field3_src1) as field3_src1
          list(field1_src2) as field1_src2 
          list(field2_src2) as field2_src2
          list(field1_src3) as field1_src3
          list(field2_src3) as field2_src3
    by new_gid
  | where field3_src1="cat1" OR field3_src1="cat2" OR field3_src1="cat3"
---
If this reply helps you, Karma would be appreciated.
0 Karma

splunk4now
Explorer

I meant the imposed limitations on join/sub-search resultset and I was trying out a mechanism to merge "resultset of sr1 and src2" with src3. The above search query (which i had already tried earlier) does not provide any values for field1/field2 from src3, since I assume that coalesce function is expecting for a same key value in all 3 sources. In this case src1/src2 share a common reference key and src2/sr3 share a different common reference key.

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...