Solved: How to search and find all sender email addresses ...

packet_hunter · ‎02-01-2016

Scenario: I want to find all sender email addresses that are not exact matches to a list, but "similar" to any domain of the list (or contains any part of a domain on the list).

For example: Correct sender email domain could be sender@company.com, Incorrect sender email domain could be sender@company.org, or sender@company-corp.net, or sender@companycorporation.us, etc...
Sample code:

index=mail sourcetype=xemail
[search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID] 
|stats list(subject) as subj list(sender) as sender list(recipient) as recp by UID

Please provide an example using correct_domain.csv as the good domain list.

Thank you

aljohnson_splun · ‎02-01-2016

Have you considered using the cluster command? You can use match=ngramset to look at subcomponents of the domain, then tweak the t threshold value to make the clusters more / less similar. Ideally you want to do some fuzzy matching which IDK how to do in Splunk. The nice thing about using cluster is that it is looking at 3-character substrings.

index=mail sourcetype=xemail
 [search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID] 
|stats count by sender subject recipient UID
| cluster field=sender match=ngramset labelonly=t t=0.8
| stats values(sender) by cluster_label

Here is an example of using it with a list of email addresses, where cluster correctly groups domains that contain yahoo (even if they end in a different TLD):

View solution in original post

aljohnson_splun · ‎02-01-2016

Have you considered using the cluster command? You can use match=ngramset to look at subcomponents of the domain, then tweak the t threshold value to make the clusters more / less similar. Ideally you want to do some fuzzy matching which IDK how to do in Splunk. The nice thing about using cluster is that it is looking at 3-character substrings.

index=mail sourcetype=xemail
 [search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID] 
|stats count by sender subject recipient UID
| cluster field=sender match=ngramset labelonly=t t=0.8
| stats values(sender) by cluster_label

Here is an example of using it with a list of email addresses, where cluster correctly groups domains that contain yahoo (even if they end in a different TLD):

packet_hunter · ‎02-01-2016

Thanks I will give it a shot with cluster

packet_hunter · ‎02-04-2016

Before I accept your answer I need a bit more advice.

Lets say I have a large number of white list email domains (around 500K in my correct_domain.csv) that I need to check for variations in the sender values.

Cluster seems to be rather resource expensive. Is there a way to optimize the comparison? Or another way to do this domain variation check?

Thank you

aljohnson_splun · ‎02-04-2016

Make sure you are clustering on the smallest possible table on the smallest subset of data that you can manage. The cluster command does not have any memory/resource control options.

If you do not want to use cluster, the next best option would be to use a custom python command. There is one already written and explained here. Best of luck.

packet_hunter · ‎02-05-2016

Thank you!!!

How to search and find all sender email addresses that are similar, but not exact matches to any domain in a list?

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life