Is it possible to join on one OR another field wit...

redc · ‎02-09-2015

I have two sets of data that I'm trying to join. Both data sets have a field for SystemMessageId value, but in the second data set, that value may be empty (when the value of another field is set to "Domain" instead of "Message"). Is it possible to do a join on the SystemMessageId or another field, as opposed to SystemMessageId and another field?

Sample data set A:

_time SystemMessageId User
2015-01-28 08:49:31 6e00000018fb1d user@domain3.com
2015-01-28 08:15:47 6e0000001263cf user@domain2.com
2015-01-28 07:50:07 6e000000117402 user@domain1.com

Sample data set B:

_time Message_or_Domain SystemMessageId User_or_Domain Message
2015-01-28 15:48:05 Message 6e000000117402 user@domain1.com Try+again+later
2015-01-28 08:15:47 Domain - domain2.com DNS+fail
2015-01-28 07:50:07 Message 6e000000117402 user@domain1.com Try+again+later

Desired result of join:

_time SystemMessageId User Message_or_Domain User_or_Domain Message Count
2015-01-28 08:49:31 6e00000018fb1d user@domain3.com - - - -
2015-01-28 08:15:47 6e0000001263cf user@domain2.com Domain domain2.com DNS+fail 1
2015-01-28 07:50:07 6e000000117402 user@domain1.com Message user@domain1.com Try+again+later 2

For event 2 in data set A, I would want to join on the SystemMessageId, but for event 3 in data set A, I'd want to do an eval to extract the domain from data set A, then join on that to data set B to User_or_Domain (and I don't want to drop event 1 in data set A, I just want to populate "-" in the empty fields for it). Because SystemMessageId may be empty, I can't join on both SystemMessageId and User_or_Domain.

The search I'm imagining would look something like this...

index="datasetA" | 
eval User_or_Domain=mvindex(split(User, "@"), 1) | 
join (SystemMessageId OR User_or_Domain) type=outer [search index="datasetB" | 
stats values(Message_or_Domain) as Message_or_Domain, values(Message) as Message, count as Count by SystemMessageId, User_or_Domain] | 
table _time, SystemMessageId, User, Message_or_Domain, User_or_Domain, Message, Count | 
fillnull value="-" Message_or_Domain, User_or_Domain, Message, Count

I could, of course, do two joins, once to match on SystemMessageId, then eval the User_or_Domain field in data set A, and if that's not empty, join again to data set B...but that seems like a lot of extra overhead.

Example search with two joins:

index="datasetA" | 
join SystemMessageId type=outer [search index="datasetB" | 
stats values(Message_or_Domain) as Message_or_Domain, values(Message) as Message, values(User_or_Domain) as UOD, count as Count by SystemMessageId] | 
eval User_or_Domain=if(isnull(UOD), mvindex(split(User), "@"), 1), "") | 
join User_or_Domain type=outer [search index="datasetB" | 
stats values(Message_or_Domain) as Message_or_Domain, values(Message) as Message, count as Count by User_or_Domain]
table _time, SystemMessageId, User, Message_or_Domain, User_or_Domain, Message, Count | 
fillnull value="-" Message_or_Domain, User_or_Domain, Message, Count

Runals · ‎02-11-2015

My brain isn't quite awake and am doing this without the data in front of me in Splunk. There are a couple tricky elements to your data given the time element that might or might not factor in. If it was just sourcetype B then it would be easy. Both of those come into play given line 2 from both data set examples - how do you know it was THAT user from domain2 unless you are correlating on time? So with either a high speed series of stats commands OR a join/append you are probably going to get into multivalue fields given you want to preserve the user from domain3. I'd try something like this as a start but not sure how it would play out IRL and there would be more work to do. I'm guessing you are using 6.x otherwise you would need to break out the foreach into separate evals

sourcetype=a OR sourcetype=b | rename User_or_Domain as User | rex field=User "(?:(?<user>[^@]+)@)?(?<domain>\S+)"  | foreach SystemMessageID user [eval <<FIELD>> = if(isnotnull(<<FIELD>>), <<FIELD>>, "-")] | table _time user domain SystemMessageID Message_or_Domain Message | stats last(_time) as first_time first(_time) as last_time values(Message_or_Domain) as MoD values(Message) as Message by user domain SystemMessageID | fillnull value=0 count

The table thrown in there isn't needed but would allow you to cut out the pipes after that to see what the data looks like before moving to the next step. I think that would give you a line item for domain3 but probably a count that is too high for domains1 & 2. Might need to do this as an append

sourcetype=a | rex field=User "(?:(?<user>[^@]+)@)?(?<domain>\S+)" | table _time SystemMessageID user domain  | append [search sourcetype=b | rex field=User_or_Domain "(?:(?<user>[^@]+)@)?(?<domain>\S+)" | foreach user SystemMessageID [eval <<FIELD>> = if(isnotnull(<<FIELD>>), <<FIELD>>, "-")] | stats count by SystemMessageID user domain Message ]

What you might run into though is appends, by default, will stop running after 60s or 50k results. Hopefully you don't that many DNS issues but /shrug. To merge the data after the append you probably won't be able to get away with a | stats values(*) as * by as that will get the users/messages out of whack. You are probably looking at doing a | stats sum or | stats count by several fields. I didn't monkey with time in this second one but you will want to preserve/use that from one or the other dataset (or both if you really need but that gets fun). Remember that Splunk searches in reverse chronological order which is why the first time seen is really the most recent or last event. If you use min/max in stats on time that is reversed as epoch time is counted forward from a time in the past (max(_time) as last min(time) as first). You could join on time if you wanted if your margin of error is within the same minute or hour. Hopefully the date_hour etc fields are present otherwise you could create them from the time elements in the data.

Anyway, hopefully that is enough to get you going on the full dataset. Interesting issue.

redc · ‎02-11-2015

You might be onto something here...

The data I'm working with IRL is a LOT more complex than the example I presented (and the use-case for the data has some dictates that I'm not presenting), so I can't just do index="datasetA" OR index="datasetB", but I can grab one and then append the other...so I did that. In the append for datasetB, I created a new column called "Count" and set it equal to "1", then after the append, set a fillnull to "0" (zero) for that column (which made all the datasetA data have "0"). That allowed me to throw a "sum(Count) as Count" in the stats command.

I'm still trying to figure out how to make the "Domain" type events from datasetB match the events from datasetA... Right now, the "Domain" datasetB entries never match any entries from datasetA because datasetA ALWAYS has a non-"-" value for user and for SystemMessageID (datasetB doesn't). I think with the start/end times, I should be able to do something like, "if datasetA event's timestamp falls within datasetB's events start/end, then match." Still working on that part...

Just a side note: there's a typo in your foreach; should be:

if(isnotnull(<<FIELD>>), <<FIELD>>, "-")

I'm also not sure where you intended for the "count" column to be originally created in the stats command, but the search the way it's currently written it causes "count" to always be 0.

Runals · ‎02-11-2015

I usually forget the trailing parens even while creating searches in Splunk so no surprise I forget them in this context; updated =). The intent of the stats command within the append of the second query was to get the count of events in the second dataset prior to linking it back up with the data from the parent search. That way your user@domain1 would show a count of 2. Guess it isn't working /shrug. Your method of adding a 0 and then doing a sum is a great solution.

One thing to think about is sometimes the facets of data are complex enough to warrant multiple views - perhaps a dashboard with a couple panels is the better way to go. If you are after alerts you could create saved searches and link them into your dashboard (so they could have individual scheduled alerts) or just make the queries all inline and schedule the dashboard to be PDFed every . I guess if you went down that route you could make all the searches saved searches but that adds clutter imo /shrug.

At any rate my motto is solve for 80% and move to the next thing. Having multiple panels on a dashboard that gets reviewed once a week (whatever) via emailed PDF or that someone can reference after getting an alert solves for a number of use cases. It won't solve for one query to rule them all though 😉

somesoni2 · ‎02-09-2015

Is there any relation between timestamp of two data source?

redc · ‎02-10-2015

There can be, but it's not reliable (I'd say in maybe 10-20% of lines from data set B, the timestamps are within a few seconds of each other, but the rest of the time, they could be minutes, hours, or even days apart).

I guess I didn't show that very well in the sample data...

MuS · ‎02-10-2015

My favorite link related to this topic: http://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-joi...

redc · ‎02-10-2015

The problem isn't with using different sourcetypes (95% of what I do uses multiple sourcetypes). The problem is that the two sourcetypes don't have a consistent/reliable related field (or a consistent/reliable way of generating one).

I'm actually probably going to be stuck dropping the "Domain" type messsages because I'm starting to find there's a lot of complications in relating that data to the data in data set A. Without going directly to the originating database (which is a path I'm not eager to go down), it may simply not be possible to match "Domain" messages with their corresponding messages in data set A.

Is it possible to join on one OR another field with a single join command?

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk

.conf24 | Personalize your .conf experience with Learning Paths!

Threat Hunting Unlocked: How to Uplevel Your Threat Hunting With the PEAK Framework ...