Solved: Define results when duplicate events have dissimil...

dorgra · ‎12-11-2019

HR data I'm working with has multiple entries for the same user. The hr_id always starts with an Alpha character
followed by from 5-7 numeric characters. The Lan_name varies with no discernable structure. Sometimes, the
Lan_name matches the hr_id. The hr_id is always consistent for each user. The Lan_name value is not always
consistent. Unfortunately, some users show both the hr_id and Lan_name in these multiple entries.
The following is an example of my data:

full_name Job_Title Email Lan_name hr_id
Smith,Tom job1 tsmith@domain.com ts004 S12345
Smith,Tom job1 tsmith@domain.com ts004 S12345
Smith,Tom job1 tsmith@domain.com S12345 S12345
Smith,Tom job1 tsmith@domain.com S12345 S12345
Jones,Jill job2 jjones@domain.com j723b2 j1234567
Jones,Jill job2 jjones@domain.com j1234567 j1234567

Because I dedup both the Lan_name and the hr_id, sometimes I get the preferred

full_name Job_Title Email Lan_name hr_id
Smith,Tom job1 tsmith@domain.com ts004 S12345

But sometimes I get the row that the Lan_name is the same as the hr_id.

full_name Job_Title Email Lan_name hr_id
Smith,Tom job1 tsmith@domain.com S12345 S12345

Keep in mind, sometimes the Lan_name and the hr_id is the same due to users that are newer to the organization.
Can someone could show me how use the Lan_name for Lan_name when it is dissimilar to the hr_id? I can't request
HR make corrections due to the sheer volume of entries.

Below is the sample sample search. I removed most of the noise leaving the full_name evals and regex because of
other HR data input that is not accurate. I can't use subsearches because I'm using a REST call from a windows
application. I would like to avoid lookup tables if at all possible. Finally, I can't make any changes to any of
configuration files on the Splunk servers.

index="hr_index" sourcetype="hr_user_accounts" 
| rename job_title AS Job_Title, email_address AS Email, hr_id as HR_ID
| eval full_name=replace(full_name,"\."," ")
| eval full_name=replace(full_name,"   "," ")
| eval full_name=replace(full_name,"  "," ")
| rex field=full_name "^(?P<full_name>[\d|\s]+)"
| eval tmp=split(full_name,",")
| eval Last_Name=mvindex(tmp,0),First_Name=mvindex(tmp,1) 
| table Last_Name, First_Name, Job_Title, Email, Lan_name, HR_ID

aberkow · ‎12-11-2019

Interesting problem! The answer I came up with was essentially taking all values of Lan_name and hr_id (since dedup has a lot of the same functionality as stats at it's base use-case), expanding all options (when there are more than 1), and then filtering those that are both duplicates and have the same hr_id as lan_name. Example below:

| makeresults count=4              
| streamstats count                 
| eval Lan_name=case(count=1 OR count=2, "ts004", count=3, "j723b2", count=4, "j1234567")        
| eval hr_id=case(count=1 OR count=2, "S12345", count=3 OR count=4, "j1234567")       
| eval full_name=case(count=1 OR count=2, "TSmith", count=3 OR count=4, "JJones")        
| stats values(Lan_name) as Lan_name, values(hr_id) as hr_id by full_name 
| eval countOfLan_name=mvcount(Lan_name) 
| mvexpand Lan_name 
| eval duplicate=if(countOfLan_name>1 AND Lan_name=hr_id, 1, 0) 
| search duplicate=0

The first 5 lines are for data creation after which, each line does:
6. Gets unique list of Lan_name/hr_id by "group term", in this case full_name, in your case, whatever is unique
7. Figure out if there are multiple Lan_names
8. Split out Lan_names, keeping everything else consistent
9. Determine if something is a duplicate by saying "are there two rows in my entry? Yes? Am i the same Lan_name and hr_id?"
10. Filter duplicates out
I made a few assumptions (that in the case where there are multiple Lan_names, you always want the one that is different than hr_id), and that Lan_names is always populated. If Lan_names is not always populated, mvexpand will filter that row out, so add in a fillnull Lan_names value="null".

Hope this helps!

View solution in original post

aberkow · ‎12-11-2019

Interesting problem! The answer I came up with was essentially taking all values of Lan_name and hr_id (since dedup has a lot of the same functionality as stats at it's base use-case), expanding all options (when there are more than 1), and then filtering those that are both duplicates and have the same hr_id as lan_name. Example below:

| makeresults count=4              
| streamstats count                 
| eval Lan_name=case(count=1 OR count=2, "ts004", count=3, "j723b2", count=4, "j1234567")        
| eval hr_id=case(count=1 OR count=2, "S12345", count=3 OR count=4, "j1234567")       
| eval full_name=case(count=1 OR count=2, "TSmith", count=3 OR count=4, "JJones")        
| stats values(Lan_name) as Lan_name, values(hr_id) as hr_id by full_name 
| eval countOfLan_name=mvcount(Lan_name) 
| mvexpand Lan_name 
| eval duplicate=if(countOfLan_name>1 AND Lan_name=hr_id, 1, 0) 
| search duplicate=0

The first 5 lines are for data creation after which, each line does:
6. Gets unique list of Lan_name/hr_id by "group term", in this case full_name, in your case, whatever is unique
7. Figure out if there are multiple Lan_names
8. Split out Lan_names, keeping everything else consistent
9. Determine if something is a duplicate by saying "are there two rows in my entry? Yes? Am i the same Lan_name and hr_id?"
10. Filter duplicates out
I made a few assumptions (that in the case where there are multiple Lan_names, you always want the one that is different than hr_id), and that Lan_names is always populated. If Lan_names is not always populated, mvexpand will filter that row out, so add in a fillnull Lan_names value="null".

Hope this helps!

dorgra · ‎12-11-2019

the 3 eval rows |\
| eval Lan_name=case(count=1 OR count=2, "ts004", count=3, "j723b2", count=4, "j1234567")

| eval hr_id=case(count=1 OR count=2, "S12345", count=3 OR count=4, "j1234567")

| eval full_name=case(count=1 OR count=2, "TSmith", count=3 OR count=4, "JJones")

aberkow · ‎12-11-2019

Oh. That's just my data creation to recreate what you had done above to get to the state you said your output was. If you slap on lines 6-10 to your query that should solve your problem.

dorgra · ‎12-12-2019

This has been very helpful. I can't figure out why this search leaves Job_Title and Email blank. It also shows full_name instead of Last_Name and First_Name. I'm going to go ahead and award the points. You've earned them and you replied so quickly. Very much appreciated. I'm sure the remaining problem is something basic that I'm missing.

dorgra · ‎12-12-2019

Unfortunately, I don't have enough Karma points to award points. Odd that.

aberkow · ‎12-12-2019

Accepting and upvoting the answer is more than enough 🙂 The answer to why that is happening is one of the most common Splunk bugs people run into - transforming commands (such as stats are used by Splunk to optimize raw information (row logs) into tabled information). Therefore, you need to pass any fields you want later in your search through that line. Check out some of the functions available https://docs.splunk.com/Documentation/Splunk/8.0.0/SearchReference/Stats#Stats_function_options. If it's a single value field (1 to 1 mapping of thing to unique bucket in your split by), list and values return the same thing. Otherwise, list returns the full list, and values, the unique set of. I used values because I wanted all unique values, and you probably want to do the same! Try out the syntax and let me know if you run into issues.

dorgra · ‎12-11-2019

Any suggestion on how to pass the values for Lan_name and hr_id to rows 3, 4 and 5? I'm liking this direction. Thanks for getting to it so quickly.

aberkow · ‎12-11-2019

I don't follow your question. What are rows 3, 4 and 5 to you?

Define results when duplicate events have dissimilar field values

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk