Solved: How can I join two huge datasets (millions of reco...

thambisetty · ‎08-10-2017

Hi Splunkers,

I have tried stats dc(sourcetype) as count by commonfield | where count > 1. I assume this search is fetching all records to SH and doing statistics as a results its consuming space in Gigs in search head. its very time consuming. if you have done some thing to two compare two big data sets please guys advice me.

Note: These two data sets don't have a timestamp. The size of the two data sets is 300 million records x 1 field and 20 million records x 3 fields.

————————————
If this helps, give a like below.

somesoni2 · ‎08-10-2017

First of all, no sir.

You can do something like this (for inner join, data available in both indexes). It would be better if you can also provide your expected output.

index=a OR index=b | fields index field1 field2 field3 field4 | eval commonfield=coalesce(field3,field4) | stats dc(index) as indexes values(field1) as field1 values(field2) as field2 by commonfield | where indexes=2

View solution in original post

somesoni2 · ‎08-10-2017

First of all, no sir.

You can do something like this (for inner join, data available in both indexes). It would be better if you can also provide your expected output.

index=a OR index=b | fields index field1 field2 field3 field4 | eval commonfield=coalesce(field3,field4) | stats dc(index) as indexes values(field1) as field1 values(field2) as field2 by commonfield | where indexes=2

thambisetty · ‎08-11-2017

Thanks for your answers. fields and coalesce did good job.

————————————
If this helps, give a like below.

DalJeanis · ‎08-11-2017

@thambisetty - you are welcome! We've moved the comment to answer. Please accept somesoni2's answer so that the question will show as solved.

thambisetty · ‎08-10-2017

first of all thanks a lot for your swift response..
Expected output:

field1,field2,field3

but, my concern is its taking more storage space while calculating. I have not used coalesce.. i used stats after renaming the field4 to field3.

this is what i have done.
index=a OR index=b | rename field4 as field3 | stats dc(index) as count values(field1) values(field2) by field3 | where count>1

its killing my storage. is there any difference using rename and coaleasce.

————————————
If this helps, give a like below.

DalJeanis · ‎08-10-2017

The fields command is the important one for reducing storage. If you don't put that one in, then your | stats list(*) is keeping literally EVERY field in both files.

Try this -

index=a OR index=b | fields field1 field2 field3 field4 
| rename field4 as field3 
| stats dc(index) as indexcount, values(field1) as field1, values(field2) as field2 by field3 
| where indexcount>1

thambisetty · ‎08-10-2017

Thanks for your answers..

I have tried renaming field and i dont have more than 3 fields in one dataset and one field in another dataset.

I am much interested to know if there is any way to achieve this .easily

————————————
If this helps, give a like below.

somesoni2 · ‎08-10-2017

I'm afraid it's because of too many rows. Sometime including a fields or table command to limit the number of fields helps.

index=a OR index=b | table field1 field2 field3 field3 | eval field3=coalesce(field3,field4)  | stats dc(index) as count values(field1) values(field2) by field3 | where count>1

thambisetty · ‎08-10-2017

@somesoni2,

Sir, I have been told that we can use coalesce to join two big data sets.

I have seen that you have used coalesce in post like below,

but, I am not sure how to use that.

please help me.

————————————
If this helps, give a like below.

thambisetty · ‎08-10-2017

sample data:

index a:
field1,field2,field3

index b:
field4

field3 and field4 are same. but there is not timestamp in those fields.

————————————
If this helps, give a like below.

How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!