Looking at all the posts regarding User-Agent HTTP header searches, one of the commonalities is that they were told to change their format to Combined Log Format. I unfortunately cannot do that but I am still being asked to create a dashboard reports to show most common OS used and most common browser. Here is a log:
XX.XX.XX.XX - - [30/Jul/2013:15:16:40 -0700] 0 "GET /portal-web/images/denied.png HTTP/1.1" 200 882 "htps://ABC.ABC.com/portal-web/stuff/stuff.action" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.0)"
Ultimately I want separate count columns for browser type and OS type. How do I go about extracting the info I want? I believe I need to use a Regex statement, but I am unsure on how to proceed especially since both the client and browser are going to change in size?
You need to either build a lookup table or use a custom command to parse the user agent string. Looks like this might do the trick:
A pure regex is not going to do it alone. If you are a novice you can get some help for yourself by using the interactive field extraction creator. It is one of the options in the per-record drop down.
The difficulty is that there is no defined order or format for sub fields of the UA. I just tried myself with the following sample list culled from recent access logs for the generator to weave its magic on:
Windows NT 5.1 Linux x86_64 Windows NT 6.0 Android 4.1.2 Windows Phone OS 7.5 Windows NT 6.1
The resulting sample extractions it offered were:
Linux x86_64 Windows NT 5.1 +http://yandex.com/bots)" RU Windows NT 5.1)" US http://www.majestic12.co.uk/bot.php?+)" US rv:17.0) Gecko/20130626 Firefox/17.0 Iceweasel/17.0.7" FR +http://www.exabot.com/go/robot)" FR Windows NT 6.2 Mail.RU_Bot/2.0 Windows NT 6.0)" JP Windows NT 6.1 Windows NT 6.0)" CN +http://www.google.com/bot.html)" US Android 4.1.2 +http://www.bing.com/bingbot.htm)" US +http://www.baidu.com/search/spider.html)" CN Windows Phone OS 7.5
Even after some manual refinement it continues to miss the mark more than hit it.