Splunk Search

Field extraction is off

banderson7
Communicator

I'm forwarding logs via syslog udp to a box and locally ingesting them through splunk. I don't think that contributes to my issue, but wanted to get that out there...
Running any of the saved searches in the Bluecoat app shows the external host IP (that the client is accessing) as the Client IP, which doesn't work well. Same issue with the username field
Here's a sample of one of my bluecoat log entries:

2015-11-05 16:05:14 763 10.80.64.129 cajones - us-ads.openx.net 173.241.244.221 None - - PROXIED "Web Ads/Analytics" http://lagrangenews.com/news/5895/students-strive-to-serve  200 TCP_NC_MISS GET application/json http us-ads.openx.net 80 /w/1.0/acj ?o=5040266117&callback=OX_5040266117&ju=http%3A//lagrangenews.com/news/5895/students-strive-to-serve&jr=http%3A//lagrangenews.com/&auid=538038002&dims=1419x731&adxy=0%2C0&res=1440x900x32&plg=pm&ch=utf-8&tz=300&ws=1419x731&ifr=0&tws=1419x731&vmt=1&bi=66daec33-f482-4821-b52e-7f07e884dfe3&sd=29 - "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; Media Center PC 6.0; BRI/2)" 10.75.95.91 1906 2550 - "none" "none" none

This entry the cs_username field is seen as "unknown", but in this entry, the cs_username is correct:

2015-11-05 16:01:53 54 10.0.19.44 hammlx1 - - OBSERVED "Web Ads/Analytics" http://bcp.crwdcntrl.net/5/c=1226/rand=328960996/pv=y/int=%23OpR%2358075%23DailyMail%20%3A%20Time%20of%20Day%20%3A%2010AM%C2%A0/int=%23OpR%2358689%23Dailymail%20%3A%20Weather-current-description%20%3A%20Cloudy%20with%20outbreaks%20of%20Rain/int=%23OpR%2358690%23Dailymail%20%3A%20Weather-current-temperature%20%3A%2043%C2%B0F/int=%23OpR%2358691%23Dailymail%20%3A%20Weather-upcoming-description%20%3A%20Scattered%20Showers/int=%23OpR%2358692%23Dailymail%20%3A%20Weather-upcoming-temperature%20%3A%2046%C2%B0F/med=%23OpR%2350629%23DailyMail%20%3A%20Home%20Page%20Date%20%3A%20Thursday%2C%20Nov%205th%202015/seg=%23OpR%2350561%23Date%20%3A%20Thursday%2C%20Nov%205th%202015/ug=%23OpR%2350557%23GrapeShot%20%3A%20Channel%20%3A%20gv_weightwatchers/ug=%23OpR%2350558%23GrapeShot%20%3A%20Channel%20%3A%20gv_weightwatchers/ug=%23OpR%2350559%23GrapeShot%20%3A%20US%20Channel%20%3A%20us_negative_crime/ug=%23OpR%2350560%23GrapeShot%20%3A%20US%20Channel%20%3A%20us_negative_crime/genp=%23OpR%2330426%23Site%20Section%20%3A%20index/genp=%23OpR%2330427%23Site%20Section%20%3A%20ushome/rt=ifr  204 TCP_NC_MISS GET image/png;charset=UTF-8 http su.addthis.com 80 /red/usync ?pid=11127&puid=ce5754badf674f9ba73d138adc3e8e1a - "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko" 10.0.2.248 451 2394 - "none" "none"

and the regex in use is:

[bcreportermain_v1]
REGEX = (?P<date>[^\s]+)\s+(?P<time>[^\s]+)\s+(?P<time_taken>[^\s]+)\s+(?P<c_ip>[^\s]+)\s+(?P<cs_username>[^\s]+)\s+(?P<cs_auth_group>[^\s]+)\s+(?P<x_exception_id>[^\s]+)\s+(?P<filter_result>[^\s]+)\s+\"(?P<category>[^\"]+)\"\s+(?P<http_referrer>[^\s]+)\s+(?P<sc_status>[^\s]+)\s+(?P<action>[^\s]+)\s+(?P<cs_method>[^\s]+)\s+(?P<http_content_type>[^\s]+)\s+(?P<cs_uri_scheme>[^\s]+)\s+(?P<cs_host>[^\s]+)\s+(?P<cs_uri_port>[^\s]+)\s+(?P<cs_uri_path>[^\s]+)\s+(?P<cs_uri_query>[^\s]+)\s+(?P<cs_uri_extension>[^\s]+)\s+\"(?P<http_user_agent>[^\"]+)\"\s+(?P<s_ip>[^\s]+)\s+(?P<sc_bytes>[^\s]+)\s+(?P<cs_bytes>[^\s]+)\s+\"?(?P<x_virus_id>[^\"]+)\"?\s+\"(?P<x_bluecoat_application_name>[^\"]+)\"\s+\"(?P<x_bluecoat_application_operation>[^\"]+)\"

Any help you can provide would be appreciated.

Tags (3)
0 Karma
1 Solution

gcato
Contributor

Hi banderson7,

The regex is not anchored at the beginning of the line (^) and as it also has a different number of fields (single whitespace separation) it is mismatching where the first field is. Note how the cajones entry has the extra "None - - PROXIED" fields (this causes the initial mismatch).

2015-11-05 16:05:14 763 10.80.64.129 cajones - us-ads.openx.net 173.241.244.221 None - - PROXIED "Web Ads/Analytics"
2015-11-05 16:01:53 54 10.0.19.44 hammlx1 - - OBSERVED "Web Ads/Analytics"

I'm not sure what fields are important (check with bluecoat) but if it's okay to ignore the "None - - PROXIED" fields then the following regex will work better.

^(?P<date>[^\s]+)\s+(?P<time>[^\s]+)\s+(?P<time_taken>[^\s]+)\s+(?P<c_ip>[^\s]+)\s+(?P<cs_username>[^\s]+)\s+(?P<cs_auth_group>[^\s]+)\s+(?P<x_exception_id>[^\s]+)\s+(?P<filter_result>[^\s]+).*?\"(?P<category>[^\"]+)\"\s+(?P<http_referrer>[^\s]+)\s+(?P<sc_status>[^\s]+)\s+(?P<action>[^\s]+)\s+(?P<cs_method>[^\s]+)\s+(?P<http_content_type>[^\s]+)\s+(?P<cs_uri_scheme>[^\s]+)\s+(?P<cs_host>[^\s]+)\s+(?P<cs_uri_port>[^\s]+)\s+(?P<cs_uri_path>[^\s]+)\s+(?P<cs_uri_query>[^\s]+)\s+(?P<cs_uri_extension>[^\s]+)\s+\"(?P<http_user_agent>[^\"]+)\"\s+(?P<s_ip>[^\s]+)\s+(?P<sc_bytes>[^\s]+)\s+(?P<cs_bytes>[^\s]+)\s+\"?(?P<x_virus_id>[^\"]+)\"?\s+\"(?P<x_bluecoat_application_name>[^\"]+)\"\s+\"(?P<x_bluecoat_application_operation>[^\"]+)\"

There's an initial anchor point (^) to match line start and the \s+ between filter_results and category has been changed to .*? (lazy match anything up to the ") . You can have a play with it here if you want to modify the regex further: https://regex101.com/r/bK5hR5/1

Hope this helps.

View solution in original post

mreynov_splunk
Splunk Employee
Splunk Employee

The issue here is the fact that the 2 samples do not follow the same format. Things are breaking down at the category field extraction (because category is the first field enclosed in quotes, which requires a different capture than the rest of the fields). The extraction expects it to be the 9th field and it is in the 13th place in the broken message (we are splitting the fields by spaces).

banderson7
Communicator

Yeah, so I see. After researching, I have 2 bluecoats sending with this field list:

Fields: date time time-taken c-ip cs-username cs-auth-group s-supplier-name s-supplier-ip s-supplier-country s-supplier-failures x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id x-bluecoat-application-name x-bluecoat-application-operation cs-threat-risk

and a third sending these fields

    date time time-taken c-ip cs-username cs-auth-group x-exception-id sc-filter-result cs-categories cs(Referer) sc-status s-action cs-method rs(Content-Type) cs-uri-scheme cs-host cs-uri-port cs-uri-path cs-uri-query cs-uri-extension cs(User-Agent) s-ip sc-bytes cs-bytes x-virus-id x-bluecoat-application-name x-bluecoat-application-operation

I need to figure how to import both of these correctly. Will be playing w/ that site some more me-thinks.

0 Karma

mreynov_splunk
Splunk Employee
Splunk Employee

You need to look in the bluecoat console, reset the third log to bcreportmain_v1.

0 Karma

banderson7
Communicator

I guess I could go the easy way 🙂

0 Karma

gcato
Contributor

Hi banderson7,

The regex is not anchored at the beginning of the line (^) and as it also has a different number of fields (single whitespace separation) it is mismatching where the first field is. Note how the cajones entry has the extra "None - - PROXIED" fields (this causes the initial mismatch).

2015-11-05 16:05:14 763 10.80.64.129 cajones - us-ads.openx.net 173.241.244.221 None - - PROXIED "Web Ads/Analytics"
2015-11-05 16:01:53 54 10.0.19.44 hammlx1 - - OBSERVED "Web Ads/Analytics"

I'm not sure what fields are important (check with bluecoat) but if it's okay to ignore the "None - - PROXIED" fields then the following regex will work better.

^(?P<date>[^\s]+)\s+(?P<time>[^\s]+)\s+(?P<time_taken>[^\s]+)\s+(?P<c_ip>[^\s]+)\s+(?P<cs_username>[^\s]+)\s+(?P<cs_auth_group>[^\s]+)\s+(?P<x_exception_id>[^\s]+)\s+(?P<filter_result>[^\s]+).*?\"(?P<category>[^\"]+)\"\s+(?P<http_referrer>[^\s]+)\s+(?P<sc_status>[^\s]+)\s+(?P<action>[^\s]+)\s+(?P<cs_method>[^\s]+)\s+(?P<http_content_type>[^\s]+)\s+(?P<cs_uri_scheme>[^\s]+)\s+(?P<cs_host>[^\s]+)\s+(?P<cs_uri_port>[^\s]+)\s+(?P<cs_uri_path>[^\s]+)\s+(?P<cs_uri_query>[^\s]+)\s+(?P<cs_uri_extension>[^\s]+)\s+\"(?P<http_user_agent>[^\"]+)\"\s+(?P<s_ip>[^\s]+)\s+(?P<sc_bytes>[^\s]+)\s+(?P<cs_bytes>[^\s]+)\s+\"?(?P<x_virus_id>[^\"]+)\"?\s+\"(?P<x_bluecoat_application_name>[^\"]+)\"\s+\"(?P<x_bluecoat_application_operation>[^\"]+)\"

There's an initial anchor point (^) to match line start and the \s+ between filter_results and category has been changed to .*? (lazy match anything up to the ") . You can have a play with it here if you want to modify the regex further: https://regex101.com/r/bK5hR5/1

Hope this helps.

banderson7
Communicator

That's great, and thanks so much for the regex site. I do think that'll do the trick.

0 Karma

MuS
SplunkTrust
SplunkTrust

And here is another tip, you can test / verify it in Splunk running this command:

/opt/splunk/bin/splunk cmd pcregextest mregex="^(?P<date>[^\s]+)\s+(?P<time>[^\s]+)\s+(?P<time_taken>[^\s]+)\s+(?P<c_ip>[^\s]+)\s+(?P<cs_username>[^\s]+)\s+(?P<cs_auth_group>[^\s]+)\s+(?P<x_exception_id>[^\s]+)\s+(?P<filter_result>[^\s]+).*?\"(?P<category>[^\"]+)\"\s+(?P<http_referrer>[^\s]+)\s+(?P<sc_status>[^\s]+)\s+(?P<action>[^\s]+)\s+(?P<cs_method>[^\s]+)\s+(?P<http_content_type>[^\s]+)\s+(?P<cs_uri_scheme>[^\s]+)\s+(?P<cs_host>[^\s]+)\s+(?P<cs_uri_port>[^\s]+)\s+(?P<cs_uri_path>[^\s]+)\s+(?P<cs_uri_query>[^\s]+)\s+(?P<cs_uri_extension>[^\s]+)\s+\"(?P<http_user_agent>[^\"]+)\"\s+(?P<s_ip>[^\s]+)\s+(?P<sc_bytes>[^\s]+)\s+(?P<cs_bytes>[^\s]+)\s+\"?(?P<x_virus_id>[^\"]+)\"?\s+\"(?P<x_bluecoat_application_name>[^\"]+)\"\s+\"(?P<x_bluecoat_application_operation>[^\"]+)\"" test_str="2015-11-05 16:01:53 54 10.0.19.44 hammlx1 - - OBSERVED \"Web Ads/Analytics\" http://bcp.crwdcntrl.net/5/c=1226/rand=328960996/pv=y/int=%23OpR%2358075%23DailyMail%20%3A%20Time%20...  204 TCP_NC_MISS GET image/png;charset=UTF-8 http su.addthis.com 80 /red/usync ?pid=11127&puid=ce5754badf674f9ba73d138adc3e8e1a - \"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko\" 10.0.2.248 451 2394 - \"none\" \"none\""

and this will be the result:

#### Capturing group data ##### 
Group |            Name | Value
--------------------------------------
    1 |            date | 2015-11-05
    2 |            time | 16:01:53
    3 |      time_taken | 54
    4 |            c_ip | 10.0.19.44
    5 |     cs_username | hammlx1
    6 |   cs_auth_group | -
    7 |  x_exception_id | -
    8 |   filter_result | OBSERVED
    9 |        category | Web Ads/Analytics
   10 |   http_referrer | http://bcp.crwdcntrl.net/5/c=1226/rand=328960996/pv=y/int=%23OpR%2358075%23DailyMail%20%3A%20Time%20...
   11 |       sc_status | 204
   12 |          action | TCP_NC_MISS
   13 |       cs_method | GET
   14 | http_content_type | image/png;charset=UTF-8
   15 |   cs_uri_scheme | http
   16 |         cs_host | su.addthis.com
   17 |     cs_uri_port | 80
   18 |     cs_uri_path | /red/usync
   19 |    cs_uri_query | ?pid=11127&puid=ce5754badf674f9ba73d138adc3e8e1a
   20 | cs_uri_extension | -
   21 | http_user_agent | Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
   22 |            s_ip | 10.0.2.248
   23 |        sc_bytes | 451
   24 |        cs_bytes | 2394
   25 |      x_virus_id | -
   26 | x_bluecoat_application_name | none
   27 | x_bluecoat_application_operation | none

cheers, MuS

0 Karma

gcato
Contributor

Yes, it is a very useful site. Please accept the answer if it has helped.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...