I have 3 data sets (say src1, src2, sr3), with merged resultsets of single merge greater than the 50k limit - hence normal are ruled out. In this case src1 references a field in src2 and src2 references a different field in src3. The objective is to merge across all three data sets based on certain conditions in two individual merges (src1 & src2 ; sr2 & src3). I managed to get two sources merged, however merging the third seems perplexing due to limitations on sub-search or join over the stats results. Any thoughts ?
Code for merging two sources:
sourcetype="srctype:source1" OR sourcetype="srctype:source2"
| eval new_gid = coalesce(ref_src1,ref_src2)
| stats list(field1_src1) as field1_src1
list(field2_src1) as field2_src1
list(field3_src1) as field3_src1
list(field1_src2) as field1_src2
list(field2_src2) as field2_src2
by new_gid
| where field3_src1="cat1" OR field3_src1="cat2"
A single stats result row for the above search would look something like this.
new_gid field1_src1 field2_src1 field3_src1 field1_src2 field2_src2
03759542db9fb2404 3958294 5 cat1 69.13 c2bf762edb7b1600ab
266.7 7f4f9f7adb44e600b1
0.01 7f4f9f7adb44e600b1
This result-set would have to be merged with src_3 using field2_sr2, so that only one value remains instead of three (based on values of certain fields in src3).
Start with getting ONLY the relevant records from all files. Do the limits on source 3 up front.
Use eventstats
to roll the data from the farthest-right file and then kill the file, one file at a time, until you're ready for stats
.
(sourcetype="S1" (field3_src1="cat1" OR field3_src1="cat2") )
OR sourcetype="S2"
OR (sourcetype="S3" (field1_src3="bar" OR field2_src3="baz") )
| rename COMMENT as "now we limit to the relevant fields"
| fields sourcetype ref_src1 field1_src1 field2_src1 field3_src1 ref_src2 field1_src2 ref_src3 field2_src2 field1_src3 field2_src3
| rename COMMENT as "roll the data from source3 onto source2, then drop all source3 and any source2 that have no match "
| eval ref_src3=coalesce(ref_src3,field2_src2)
| eventstats values(field1_src3) as field1_src3, values(field2_src3) as field2_src3 by ref_src3
| where sourcetype="S1" OR ( sourcetype="S2" AND isnotnull(field1_src3))
| rename COMMENT as "roll all the remaining data together now"
| eval new_gid = coalesce(ref_src1,ref_src2)
| stats list(*) as * by new_gid
Start with getting ONLY the relevant records from all files. Do the limits on source 3 up front.
Use eventstats
to roll the data from the farthest-right file and then kill the file, one file at a time, until you're ready for stats
.
(sourcetype="S1" (field3_src1="cat1" OR field3_src1="cat2") )
OR sourcetype="S2"
OR (sourcetype="S3" (field1_src3="bar" OR field2_src3="baz") )
| rename COMMENT as "now we limit to the relevant fields"
| fields sourcetype ref_src1 field1_src1 field2_src1 field3_src1 ref_src2 field1_src2 ref_src3 field2_src2 field1_src3 field2_src3
| rename COMMENT as "roll the data from source3 onto source2, then drop all source3 and any source2 that have no match "
| eval ref_src3=coalesce(ref_src3,field2_src2)
| eventstats values(field1_src3) as field1_src3, values(field2_src3) as field2_src3 by ref_src3
| where sourcetype="S1" OR ( sourcetype="S2" AND isnotnull(field1_src3))
| rename COMMENT as "roll all the remaining data together now"
| eval new_gid = coalesce(ref_src1,ref_src2)
| stats list(*) as * by new_gid
This works ! Thank for the new perspective. I will post the final results more once I do a further analysis on Monday
I'm confused about your "limitations on sub-search or join" comment since your query contains neither a subsearch nor a join.
What happens when you run this query?
sourcetype="srctype:source1" OR sourcetype="srctype:source2" OR sourcetype="srctype:source3"
| eval new_gid = coalesce(ref_src1,ref_src2,ref_src3)
| stats list(field1_src1) as field1_src1
list(field2_src1) as field2_src1
list(field3_src1) as field3_src1
list(field1_src2) as field1_src2
list(field2_src2) as field2_src2
list(field1_src3) as field1_src3
list(field2_src3) as field2_src3
by new_gid
| where field3_src1="cat1" OR field3_src1="cat2" OR field3_src1="cat3"
I meant the imposed limitations on join/sub-search resultset and I was trying out a mechanism to merge "resultset of sr1 and src2" with src3. The above search query (which i had already tried earlier) does not provide any values for field1/field2 from src3, since I assume that coalesce function is expecting for a same key value in all 3 sources. In this case src1/src2 share a common reference key and src2/sr3 share a different common reference key.