Splunk Search

How to search and find all sender email addresses that are similar, but not exact matches to any domain in a list?

packet_hunter
Contributor

Scenario: I want to find all sender email addresses that are not exact matches to a list, but "similar" to any domain of the list (or contains any part of a domain on the list).

For example: Correct sender email domain could be sender@company.com, Incorrect sender email domain could be sender@company.org, or sender@company-corp.net, or sender@companycorporation.us, etc...
Sample code:

index=mail sourcetype=xemail
[search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID] 
|stats list(subject) as subj list(sender) as sender list(recipient) as recp by UID

Please provide an example using correct_domain.csv as the good domain list.

Thank you

Tags (4)
0 Karma
1 Solution

aljohnson_splun
Splunk Employee
Splunk Employee

Have you considered using the cluster command? You can use match=ngramset to look at subcomponents of the domain, then tweak the t threshold value to make the clusters more / less similar. Ideally you want to do some fuzzy matching which IDK how to do in Splunk. The nice thing about using cluster is that it is looking at 3-character substrings.

index=mail sourcetype=xemail
 [search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID] 
|stats count by sender subject recipient UID
| cluster field=sender match=ngramset labelonly=t t=0.8
| stats values(sender) by cluster_label

Here is an example of using it with a list of email addresses, where cluster correctly groups domains that contain yahoo (even if they end in a different TLD):

alt text

View solution in original post

aljohnson_splun
Splunk Employee
Splunk Employee

Have you considered using the cluster command? You can use match=ngramset to look at subcomponents of the domain, then tweak the t threshold value to make the clusters more / less similar. Ideally you want to do some fuzzy matching which IDK how to do in Splunk. The nice thing about using cluster is that it is looking at 3-character substrings.

index=mail sourcetype=xemail
 [search index=mail sourcetype=xemail subject = "Blah" |stats count by UID| fields UID] 
|stats count by sender subject recipient UID
| cluster field=sender match=ngramset labelonly=t t=0.8
| stats values(sender) by cluster_label

Here is an example of using it with a list of email addresses, where cluster correctly groups domains that contain yahoo (even if they end in a different TLD):

alt text

packet_hunter
Contributor

Thanks I will give it a shot with cluster

0 Karma

packet_hunter
Contributor

Before I accept your answer I need a bit more advice.

Lets say I have a large number of white list email domains (around 500K in my correct_domain.csv) that I need to check for variations in the sender values.

Cluster seems to be rather resource expensive. Is there a way to optimize the comparison? Or another way to do this domain variation check?

Thank you

0 Karma

aljohnson_splun
Splunk Employee
Splunk Employee

Make sure you are clustering on the smallest possible table on the smallest subset of data that you can manage. The cluster command does not have any memory/resource control options.

If you do not want to use cluster, the next best option would be to use a custom python command. There is one already written and explained here. Best of luck.

0 Karma

packet_hunter
Contributor

Thank you!!!

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...