Splunk Search

How can I join two huge datasets (millions of records) with different numbers of fields (and no timestamp)?

thambisetty
SplunkTrust
SplunkTrust

Hi Splunkers,

I have tried stats dc(sourcetype) as count by commonfield | where count > 1. I assume this search is fetching all records to SH and doing statistics as a results its consuming space in Gigs in search head. its very time consuming. if you have done some thing to two compare two big data sets please guys advice me.

Note: These two data sets don't have a timestamp. The size of the two data sets is 300 million records x 1 field and 20 million records x 3 fields.

————————————
If this helps, give a like below.
0 Karma
1 Solution

somesoni2
Revered Legend

First of all, no sir.

You can do something like this (for inner join, data available in both indexes). It would be better if you can also provide your expected output.

index=a OR index=b | fields index field1 field2 field3 field4 | eval commonfield=coalesce(field3,field4) | stats dc(index) as indexes values(field1) as field1 values(field2) as field2 by commonfield | where indexes=2

View solution in original post

somesoni2
Revered Legend

First of all, no sir.

You can do something like this (for inner join, data available in both indexes). It would be better if you can also provide your expected output.

index=a OR index=b | fields index field1 field2 field3 field4 | eval commonfield=coalesce(field3,field4) | stats dc(index) as indexes values(field1) as field1 values(field2) as field2 by commonfield | where indexes=2

thambisetty
SplunkTrust
SplunkTrust

Thanks for your answers. fields and coalesce did good job.

————————————
If this helps, give a like below.

DalJeanis
Legend

@thambisetty - you are welcome! We've moved the comment to answer. Please accept somesoni2's answer so that the question will show as solved.

0 Karma

thambisetty
SplunkTrust
SplunkTrust

first of all thanks a lot for your swift response..
Expected output:

field1,field2,field3

but, my concern is its taking more storage space while calculating. I have not used coalesce.. i used stats after renaming the field4 to field3.

this is what i have done.
index=a OR index=b | rename field4 as field3 | stats dc(index) as count values(field1) values(field2) by field3 | where count>1

its killing my storage. is there any difference using rename and coaleasce.

————————————
If this helps, give a like below.
0 Karma

DalJeanis
Legend

The fields command is the important one for reducing storage. If you don't put that one in, then your | stats list(*) is keeping literally EVERY field in both files.

Try this -

index=a OR index=b | fields field1 field2 field3 field4 
| rename field4 as field3 
| stats dc(index) as indexcount, values(field1) as field1, values(field2) as field2 by field3 
| where indexcount>1
0 Karma

thambisetty
SplunkTrust
SplunkTrust

Thanks for your answers..

I have tried renaming field and i dont have more than 3 fields in one dataset and one field in another dataset.

I am much interested to know if there is any way to achieve this .easily

————————————
If this helps, give a like below.
0 Karma

somesoni2
Revered Legend

I'm afraid it's because of too many rows. Sometime including a fields or table command to limit the number of fields helps.

index=a OR index=b | table field1 field2 field3 field3 | eval field3=coalesce(field3,field4)  | stats dc(index) as count values(field1) values(field2) by field3 | where count>1
0 Karma

thambisetty
SplunkTrust
SplunkTrust

@somesoni2,

Sir, I have been told that we can use coalesce to join two big data sets.

I have seen that you have used coalesce in post like below,

index=abc OR index=def | eval commonfield=coalesce(field1,field4) | makemv commonfield delim="," | mvexpand commonfield | stats list() as * by commonfield | where isnotnull(field4) | eval temp=mvzip(field4,field5,"#") | mvexpand temp | rex field=temp "(?.)#(?.*)" | fields - temp commonfield

but, I am not sure how to use that.

please help me.

————————————
If this helps, give a like below.
0 Karma

thambisetty
SplunkTrust
SplunkTrust

sample data:

index a:
field1,field2,field3

index b:
field4

field3 and field4 are same. but there is not timestamp in those fields.

————————————
If this helps, give a like below.
0 Karma
Get Updates on the Splunk Community!

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...