Splunk Search

How to write the regex to extract the domains from URLs?

ccsfdave
Builder

I have been through the field extractor, answers.splunk.com, and the interwebs looking for help on this one. So our Palo Alto will give us the URLs of sites visited - here is a sample:

crl.microsoft.com/pki/crl/products/MicRooCerAut2011_2011_03_22.crl
safebrowsing-cache.google.com/
p4-a2lp5grl52xoy-qpo2s4ky6vs36rpb-794312-s1-v6exp3-v4.metric.gstatic.com/
de.tynt.com/deb/v2?id=dZxfWCGner46jsacwqm_6l&r=lyricstranslate.com/en/l039amour-c039est-pour-rien-love-nothing.html
a248.e.akamai.net/

I would like to be able to extract the domains e.g.

microsoft or microsoft.com
google or google.com
gstatic or gstatic.com
tynt or tynt.com
akamai or akamai.net

I would think that the way to go about it is to look for the FIRST .com, .net, .org etc and then work back to the previous . to grab the domain but that is beyond me.

Can anyone help?

1 Solution

somesoni2
Revered Legend

Try this run anywhere sample

| gentimes start=-1 | eval URL="crl.microsoft.com/pki/crl/products/MicRooCerAut2011_2011_03_22.crl safebrowsing-cache.google.com/ p4-a2lp5grl52xoy-qpo2s4ky6vs36rpb-794312-s1-v6exp3-v4.metric.gstatic.com/ de.tynt.com/deb/v2?id=dZxfWCGner46jsacwqm_6l&r=lyricstranslate.com/en/l039amour-c039est-pour-rien-love-nothing.html a248.e.akamai.net/" | table _raw  | makemv URL| mvexpand URL| rex field=URL "(?<domain>\w+\.\w+)\/"

View solution in original post

somesoni2
Revered Legend

Try this run anywhere sample

| gentimes start=-1 | eval URL="crl.microsoft.com/pki/crl/products/MicRooCerAut2011_2011_03_22.crl safebrowsing-cache.google.com/ p4-a2lp5grl52xoy-qpo2s4ky6vs36rpb-794312-s1-v6exp3-v4.metric.gstatic.com/ de.tynt.com/deb/v2?id=dZxfWCGner46jsacwqm_6l&r=lyricstranslate.com/en/l039amour-c039est-pour-rien-love-nothing.html a248.e.akamai.net/" | table _raw  | makemv URL| mvexpand URL| rex field=URL "(?<domain>\w+\.\w+)\/"

ccsfdave
Builder

@somesoni2

You have it, but help me understand it so that I may apply it to my search. As @Rhin0Crash stated the Palo Altos see the field as "url" so my base search is: index=pan_logs sourcetype=pan* src_ip=x.x.x.x url=*

0 Karma

Rhin0Crash
Path Finder

@ccsfdave :

index=pan_logs sourcetype=pan* src_ip=x.x.x.x url=* | rex field=URL "(?\w+.\w+)\/" | table domain _raw

0 Karma

ccsfdave
Builder

Yup you got it!

| rex field=url "(?<domain>\w+\.\w+)\/"
0 Karma

Rhin0Crash
Path Finder
 search | rex field=_raw "(?<domain>\w+)\.(com|net|gov|edu|co)"

I think

You can replace the field with what field the PA gives you for URL. That might be URL, or misc, or uri.
0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...