Splunk Search

What's the best search method to remove web crawlers or bots from download logs?

mistydennis
Communicator

A few years ago, I was given a search string to filter web crawlers/bots from showing up in our download reports. I'm curious as to what other people use to make sure bots are not counted in their downloads...are there better methods?

This is the string I inherited:

eval agentType=if(match(http_user_agent,"(?i).*(bot|crawler|spider).*"),"Bot",if(match(http_user_agent,"^.*Mozilla/.*"),"Browser","Unknown")) | search agentType!="Bot"|search agentType!="Unknown"|

Does anyone know of a more exact or better method to filter out crawlers?

0 Karma
1 Solution

bmacias84
Champion

I don't use eval statements to figure this if something is a bot. I have a collection of 74 transforms applied against the useragent field. The regex patterns are highly tuned to match in the least amount of steps. The reason I have so many is that our SEO team uses this data; however, this does not account for any bot doing a good job of impersonating a browser. We also do a cidr match against the cip and assume any address coming from AWS, Google Cloud, Digital Ocean, and Azure address blocks are bots.

Here is a link to a gist I created - https://gist.github.com/httpstergeek/5fd08b9bc750e2d1954de78b063a092a

Hope this helps and if it does dont forget to accept and vote up. Cheers.

View solution in original post

rwesolowski
New Member

Hi!

Could you explain how to correctly implement this configuraton in Splunk, I've copied transforms.conf but nothing has changed
I also want to exclude all bots from my analysis.

Thanks in advance!

0 Karma

bmacias84
Champion

I don't use eval statements to figure this if something is a bot. I have a collection of 74 transforms applied against the useragent field. The regex patterns are highly tuned to match in the least amount of steps. The reason I have so many is that our SEO team uses this data; however, this does not account for any bot doing a good job of impersonating a browser. We also do a cidr match against the cip and assume any address coming from AWS, Google Cloud, Digital Ocean, and Azure address blocks are bots.

Here is a link to a gist I created - https://gist.github.com/httpstergeek/5fd08b9bc750e2d1954de78b063a092a

Hope this helps and if it does dont forget to accept and vote up. Cheers.

mistydennis
Communicator

Wow, so that's a totally different method 🙂 Is it safe to say there isn't a definitive way to 100% accurately define bots?

0 Karma

bmacias84
Champion

You can get fairly close, but definitely not 100%. We also use Google Analytics and our number match up fairly closely. Our SEO team uses Splunk for quick analysis and granularity since GA I think reports hourly.

0 Karma

mistydennis
Communicator

Are you able to share any hints on how you created your set of 74 transforms? I can't find anything anywhere on making sure what I'm using is giving accurate results. When I compare Splunk and GA, the numbers vary greatly and I'm trying to figure it out if it's my eval that's the problem or if GA is misbehaving somehow.

0 Karma

bmacias84
Champion

I've converted my post to an answer with a link to my transform as gist.

0 Karma

mistydennis
Communicator

You are AMAZING! Thank you so much! 🙂

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...