Splunk Search

Join is not working correctly for certain values

bshega
Explorer

I'm trying to join 3 types of data, we have recordings which belong to a shower which belongs to a user. For some users the join works perfectly but for others, it isn't returning anything.

A simplified version of the join looks something like this:

index="iot-production" trigger="NewShowerEvent" | spath output=serial_number path=object.serial_number | join serial_number 
max=0 [ search index="iot-production" trigger="NewRecordingEvent" | spath output=serial_number path=object.serial_number ] | 
search serial_number=1004285

But I am getting 0 results for this. If I run the 2 queries separate:

index="iot-production" trigger="NewShowerEvent" | search object.serial_number=1004285

index="iot-production" trigger="NewRecordingEvent" | search object.serial_number=1004285

I get results in both with the serial numbers matching.

What could be the reasoning behind this?

Tags (2)
0 Karma

woodcock
Esteemed Legend

In such cases, I try to make a RegEx-based field extraction that ignores the XML/JSON and pulls out the fields/values that I need for this sourcetype to do a proper base search with filters to keep only the stuff that I know that I need and then pass the relevant subset of events to spath to do the fine-detail work that requires the other pile of fields.

0 Karma

somesoni2
Revered Legend

Try this (you missed object. in your search condition)

 index="iot-production" trigger="NewShowerEvent" | spath output=serial_number path=object.serial_number | join serial_number 
 max=0 [ search index="iot-production" trigger="NewRecordingEvent" | spath output=serial_number path=object.serial_number ] |     search object.serial_number=1004285

Also, join command is expensive to think about changing your query to use stats, similar to this

 index="iot-production" trigger="NewShowerEvent" OR trigger="NewRecordingEvent" 
| spath output=serial_number path=object.serial_number  | search object.serial_number=1004285
| stats values(fieldyouwant1) as fieldyouwant1 values(fieldyouwant2) as fieldyouwant2.. by object.serial_number
0 Karma

DalJeanis
Legend

Try each of these

 index="iot-production" trigger="NewShowerEvent" 
 | spath output=serial_number path=object.serial_number 
 | search serial_number=1004285


 | index="iot-production" trigger="NewRecordingEvent" 
 | spath output=serial_number path=object.serial_number
 | search serial_number=1004285

If you get valid results from both of those, then the issue is probably that you have more than 50K results in the right side of the join, or you have run into a timeout situation. In other words, that right side of the join does not i practice include every result that it should, in theory.

This is one reason that we prefer to avoid join when other verbs will do the job.


If you are only looking for a single returned value, then restrict each side of the search., so the number of results for the subsearch becomes irrelevant.

  index="iot-production" trigger="NewShowerEvent" "1004285"
 | spath output=serial_number path=object.serial_number 
 | join serial_number  max=0 [ search 
      index="iot-production" trigger="NewRecordingEvent"  "1004285"
     | spath output=serial_number path=object.serial_number ] 
 | search serial_number=1004285

Note - the final search may not be redundant, based on your data. If there are any other serial numbers that happen to have the number "1004285" as a single unit somewhere in their data fields, they would be allowed back from the initial search(es).


The stats version of your join is going to look something like this...

  index="iot-production" (trigger="NewShowerEvent" OR trigger="NewRecordingEvent" )
  | fields (... list every field you need to retain from either kind of event ...)
  | spath output=serial_number path=object.serial_number 
  | stats values(foo) as foo values(bar) as bar values(baz) as baz by serial_number

if you want literally every field that is remaining on the events on both sides, then the last command can be simplified as ...

  | stats values(*) as * by serial_number

I do not know precisely how Splunk handles JSONs in that stats clause, so you will need to experiment to verify you are getting everything you need. Also, this will lump both sides together, so if there is a variable named "foo" on each side, you may need to use one of these constructs...

  index="iot-production" (trigger="NewShowerEvent" OR trigger="NewRecordingEvent" )
  | fields (... list every field you need to retain from either kind of event ...)
  | spath output=serial_number path=object.serial_number 
  | eval foo_shower = case(trigger="NewShowerEvent",foo)
  | eval foo_record = case(trigger="NewRecordingEvent",foo)
  | stats values(foo*) as foo* values(bar) as bar values(baz) as baz by serial_number
0 Karma

bshega
Explorer

Yes, I have both of those in my question, they are both returning valid results. What can be used instead of a join then?

0 Karma

bshega
Explorer

So I think I'm not conveying my issue good enough here. We're doing our initial search on the email address of the user, this is tied to a NewUserEvent then we're trying to get a value from the last recording for this user's showers.

So our support team types in an email address and we search for all of the showers for that email.

This is a join between NewUserEvent and NewShowerEvent but then we need information for the recording.

So we have an email address that gives us a user, this user has an ID. We join where the ID of the user matches the user_id of the shower. The shower has a serial_number we use this to then join the shower with the recording on serial_number. I hope this makes sense and that I'm explaining it well enough

in psuedocode I basically need SELECT * FROM Users JOIN Showers on Users.id = Showers.user_id JOIN Recordings on Showers.serial_number = Recordings.serial_number where Users.email = "blah@blah.com"

0 Karma

DalJeanis
Legend

@bshega - the code you posted for each side didn't include the spath, so I wanted to be doubly sure. You should be able to adapt the code I've given you to your use case, based on what information you want to keep from each side of the "join".

0 Karma

bshega
Explorer

Hmm, you might be on to something with the timeout, the spath really hurts us for the recordings because we have millions of them so I think it's scanning them sequentially with the spath in there. There has to be a way to search/join without doing spath on each object, right?

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...