Splunk Search

How to edit my regex to extract all expected fields from my sample Blue Coat log?

jwertheim
Explorer

I'm using the following regular expression:

(?<timestamp>:"(\d{1,4}\-\d{1,2}\-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})"|(\d{1,4}\-\d{1,2}\-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2}))\s+(?<time_taken>:"([^"]+)"|(\S+))\s+(?<c_ip>:"([^"]+)"|(\S+))\s+(?<cs_username>:"([^"]+)"|(\S+))\s+(?<cs_auth_group>:"([^"]+)"|(\S+))\s+(?<x_exception_id>:"([^"]+)"|(\S+))\s+(?<sc_filter_result>:"([^"]+)"|(\S+))\s+(?<cs_categories>:"([^"]+)"|(\S+))\s+(?<cs_referrer>:"([^"]+)"|(\S+))\s+(?<sc_status>:"([^"]+)"|(\S+))\s+(?<s_action>:"([^"]+)"|(\S+))\s+(?<cs_method>:"([^"]+)"|(\S+))\s+(?<rs_content_type>:"([^"]+)"|(\S+))\s+(?<cs_uri_scheme>:"([^"]+)"|(\S+))\s+(?<cs_host>:"([^"]+)"|(\S+))\s+(?<cs_uri_port>:"([^"]+)"|(\S+))\s+(?<cs_uri_path>:"([^"]+)"|(\S+))\s+(?<cs_uri_query>:"([^"]+)"|(\S+))\s+(?<cs_uri_extension>:"([^"]+)"|(\S+))\s+(?<cs_user_agent>:"([^"]+)"|(\S+))\s+(?<s_ip>:"([^"]+)"|(\S+))\s+(?<sc_bytes>:"([^"]+)"|(\S+))\s+(?<cs_bytes>:"([^"]+)"|(\S+))\s+(?<x_virus_id>:"([^"]+)"|(\S+))\s+(?<x_bluecoat_application_name>:"([^"]+)"|(\S+))\s+(?<x_bluecoat_application_operation>:"([^"]+)"|(\S+))\s+(?<cs_auth_type>:"([^"]+)"|(\S+))\s*

On the following example log file:

2016-07-28 23:37:32 240144 1.1.1.1 - - - OBSERVED "Social Networking" -  200 TCP_TUNNELED CONNECT - tcp plus.google.com 443 / - - "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36" 1.1.1.1 1135 2522 - "GooglePlus" "none" - 

There should be 28 fields in that example log file when date and time are separate fields (I combined them into one field).

With my regular expression, I'm finding that the space in the "cs_categories" field is being used to end the regex match, which doesn't make sense to me since when I try it out on a regex simulator it matches just fine. Example: http://regexr.com/3dtdr

It's obvious that the space in the cs_categories field is somehow throwing off the parser. However, I'm not sure why. I'm not a regex master, so I'm leaning more toward it being a Splunk specific difference in regex engine, but I could be entirely wrong.

I would really appreciate any kind of help.

Thanks.

0 Karma

aholzel
Communicator

Bluecoat logs are a pain in the *** to extract but I think this regex should do the trick:

(?<timestamp>[0-9-:\s]{19})\s+(?<time_taken>[^\s]+)\s+(?<c_ip>[^\s]+)\s+(?<cs_username>[^\s]+)\s+(?<cs_auth_group>[^\s]+)\s+(?<x_exception_id>[^\s]+)\s+(?<sc_filter_result>[^\s]+)(?:\s+\"|\s+)(?<cs_categories>[^\"]+)(?:\"\s+|\s+)(?<cs_referrer>[^\s]+)\s+(?<sc_status>[^\s]+)\s+(?<s_action>[^\s]+)\s+(?<cs_method>[^\s]+)\s+(?<rs_content_type>[^\s]+)\s+(?<cs_uri_scheme>[^\s]+)\s+(?<cs_host>[^\s]+)\s+(?<cs_uri_port>[^\s]+)\s+(?<cs_uri_path>[^\s]+)\s+(?<cs_uri_query>[^\s]+)\s+(?<cs_uri_extension>[^\s]+)(?:\s+\"|\s+)(?<cs_user_agent>[^\"]+)(?:\"\s+|\s+)(?<s_ip>[^\s]+)\s+(?<sc_bytes>[^\s]+)\s+(?<cs_bytes>[^\s]+)(?:\s+\"|\s+)(?<x_virus_id>[^\"\s]+)(?:\"\s+\"|\"\s+|\s+\"|\s+)(?<x_bluecoat_applicatoin_name>[^\s\"]+)(?:\"\s+\"|\"\s+|\s+\"|\s+)(?<x_bleucoat_application_operation>[^\"\s]+)(?:\"\s+|\s+)(?<cs_auth_type>[^\s]+)
0 Karma

gabriel_vasseur
Contributor

I was confused by this phrase: "With my regular expression, I'm finding that the space in the "cs_categories" field is being used to end the regex match". With a bit of play, I understood you mean that if in your data the category is "Social Networking" then the extracted cs_categories is "Social. Not what I would expect to happen but I was actually able to reproduce that so I'm guessing that's what you meant.

So in this regex:

(?<cs_categories>:"([^"]+)"|(\S+))

the "([^"]+)" is supposed to match (because it's first and the quotes are there) but the (\S+)alternative is also a potential match and seems to be preferred by the regexp engine in that instance. I believe this alternative is here to match cases where there is no category and the data just has a single -. There might be other cases too, but the point is they won't have double quotes.

So I would suggest you replace the \S+ with [^"]\S* to prevent that alternative from being used when quotes are present. I think that should work. The idea is that [^"] means the first character cannot be a " and of course we replace the + (which means 1 or more) with the * (which means zero or more) so that we still match instances where the match is a single character long.

Hope it helps.

0 Karma

sbrant_splunk
Splunk Employee
Splunk Employee

Are you using the Splunk Blue Coat TA or do you have custom log formats you're dealing with? If you're not using the TA, this should help:

https://splunkbase.splunk.com/app/2758/

0 Karma

jwertheim
Explorer

Not exactly.

I'm testing my own add-on using data generated from SA-Eventgen, and that data happens to be based off of Blue Coat logs. Those logs are pretty custom I think (the one in the original post is a decent example).

I can try the Add-on for Blue Coat ProxySG though and see what happens.

0 Karma

JDukeSplunk
Builder

I doubt it will work for everything, since cs_user_agent changes every time.. But this one works on your sample event, and it works for cs_categories. I just used the built in field extractor to get the one field, and then inserted it into your regex after (?<cs_categories>

If possible, I'd set bluecoat to insert delimeters like | into your logs, and just use a delimited extraction. cs_user_agent is BANE of web logs.

alt text

(?<timestamp>:"(\d{1,4}\-\d{1,2}\-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})"|(\d{1,4}\-\d{1,2}\-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2}))\s+(?<time_taken>:"([^"]+)"|(\S+))\s+(?<c_ip>:"([^"]+)"|(\S+))\s+(?<cs_username>:"([^"]+)"|(\S+))\s+(?<cs_auth_group>:"([^"]+)"|(\S+))\s+(?<x_exception_id>:"([^"]+)"|(\S+))\s+(?<sc_filter_result>:"([^"]+)"|(\S+))\s+(?<cs_categories>"\w+\s+\w+"|(\S+))\s+(?<cs_referrer>:"([^"]+)"|(\S+))\s+(?<sc_status>:"([^"]+)"|(\S+))\s+(?<s_action>:"([^"]+)"|(\S+))\s+(?<cs_method>:"([^"]+)"|(\S+))\s+(?<rs_content_type>:"([^"]+)"|(\S+))\s+(?<cs_uri_scheme>:"([^"]+)"|(\S+))\s+(?<cs_host>:"([^"]+)"|(\S+))\s+(?<cs_uri_port>:"([^"]+)"|(\S+))\s+(?<cs_uri_path>:"([^"]+)"|(\S+))\s+(?<cs_uri_query>:"([^"]+)"|(\S+))\s+(?<cs_uri_extension>:"([^"]+)"|(\S+))\s+(?<cs_user_agent>("\w+/\d+\.\d+\s+\(\w+\s+\w+\s+\d+\.\d+;\s+\w+\)\s+\w+/\d+\.\d+\s+\(\w+,\s+\w+\s+\w+\)\s+\w+/\d+\.\d+\.\d+\.\d+\s+\w+/\d+\.\d+")|(\S+))\s+(?<sc_bytes>:"([^"]+)"|(\S+))\s+(?<cs_bytes>:"([^"]+)"|(\S+))\s+(?<x_virus_id>:"([^"]+)"|(\S+))\s+(?<x_bluecoat_application_name>:"([^"]+)"|(\S+))\s+(?<x_bluecoat_application_operation>:"([^"]+)"|(\S+))\s+(?<cs_auth_type>:"([^"]+)"|(\S+))\s*
0 Karma

jwertheim
Explorer

That's really odd..

I just tried that expression you have and it's somewhat working, but it turns out there are many variations of how the values for that field can appear, so I still get off-by-one type issues where the wrong field's value is recorded as a cs_categories value.

Not sure if there's a better way to find them all..

I don't actually have access to the source Blue Coat system so I don't have a way to set delimiters like that, though I wish that I could...

0 Karma

JDukeSplunk
Builder

My hand written regex is rusty, but if they are always inclosed in quotes rebuild the line to capture anything between the quotes.

|(\S+))\s+(?<cs_categories>"([^"]+)"
0 Karma

jwertheim
Explorer

Yeah, that's what I've tried and then it all just breaks down.

0 Karma

JDukeSplunk
Builder

The closest thing I've dealt with to this are IIS Logs. I followed this guide.

http://blogs.splunk.com/2013/10/18/iis-logs-and-splunk-6/

IIS Logs have a header row in the file which defines the fields, and is whitespace delimited. Along with examples of props.conf which worked for me. The app I put in place works about 95% of the time, which for what we need is good enough. Where it breaks is cs_user_agent.

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...