All Apps and Add-ons

Website input app: Python error "LookupError: unknown encoding: 3Dutf-8="

moseisleydk
Path Finder

I get the error:

13/12/2017
20:51:38.141    
2017-12-13 20:51:38,141 ERROR An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
Traceback (most recent call last):
  File "/splunk/etc/apps/website_input/bin/web_input.py", line 349, in run
    https_only=self.is_on_cloud(input_config.session_key))
  File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 710, in scrape_page
    additional_fields=additional_fields, **kw)
  File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 446, in get_result_single
    content_decoded = content.decode(encoding=encoding, errors='replace')
LookupError: unknown encoding: 3Dutf-8=
0 Karma

LukeMurphey
Champion

This is a confirmed bug. I was able to reproduce this using the unit test framework which simulates a web-server providing an encoding that is invalid. See the bug report here: https://lukemurphey.net/issues/2190.

I have updated the app to now be forgiving if it sees an encoding it doesn't recognize. This is currently working. This fix will go out in version 4.5.2 (ETA: early next week).

0 Karma

LukeMurphey
Champion

@moseisleydk: thanks for the report.

Incidentally, I was unable to reproduce this on http://www.mos-eisley.dk today. Not sure if something changed.

This was still valid bug report though as I was able to reproduce this by recreating the scenario based on the stacktrace you provided.

0 Karma

moseisleydk
Path Finder

Excellent - looking forward to it. I still get the error on 4.5.1:

2018-01-27 07:29:58,776 ERROR An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
Traceback (most recent call last):
File "/splunk/etc/apps/website_input/bin/web_input.py", line 349, in run
https_only=self.is_on_cloud(input_config.session_key))
File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 710, in scrape_page
additional_fields=additional_fields, **kw)
File "/splunk/etc/apps/website_input/bin/website_input_app/web_scraper.py", line 446, in get_result_single
content_decoded = content.decode(encoding=encoding, errors='replace')
LookupError: unknown encoding: 3Dutf-8=

0 Karma

LukeMurphey
Champion

@moseisleydk: Would you mind testing 4.5.2? You can get the app here: https://github.com/LukeMurphey/splunk-web-input/releases/tag/4.5.2-rc.1

I want to make sure that this fixes the issue since I wasn't able to reproduce the issue on 4.5.1 with your website.

0 Karma

moseisleydk
Path Finder
02/07/2018 21:14:00.529 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/dashboard/\"
02/07/2018 21:14:00.529 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/dashboard/\", encoding="cp1252"
02/07/2018 21:12:12.258 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/feeds/network.action?username=bnp&max=40&publicFeed=false&os_authType=basic&rssType=atom", encoding="UTF-8"
02/07/2018 21:12:08.922 ERROR   An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
02/07/2018 21:12:08.922 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:12:08.921 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:11:28.858 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/plugins/inlinetasks/\"
02/07/2018 21:11:28.858 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/plugins/inlinetasks/\", encoding="cp1252"
02/07/2018 21:11:27.651 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/users/\"
02/07/2018 21:11:27.651 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/users/\", encoding="cp1252"
02/07/2018 21:11:25.590 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/spaces/\"
02/07/2018 21:11:25.590 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/spaces/\", encoding="cp1252"
02/07/2018 21:11:12.047 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="Shift_JIS"
02/07/2018 21:11:08.724 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="3Dutf-8="
02/07/2018 21:10:59.968 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/\"
02/07/2018 21:10:59.968 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/\", encoding="cp1252"
02/07/2018 21:10:59.117 INFO    Running web input, url="http://www.mos-eisley.dk"
0 Karma

LukeMurphey
Champion

A little background on what is going on here. The encoding is not getting detected properly. I setup the input to deal better with a bad encoding. However, since the input doesn't know the proper encoding, it fails to parse the output.

What is really weird, is that I'm not getting the same repro. I have tried several times but it never quite repros the same.

0 Karma

LukeMurphey
Champion

@moseisleydk: could you provide some more details? I'm sorry for the back-and-forth; I'm just struggling to get a solid repro. I tried today and get a partial repro.

Here are some questions:

Are results coming through for any of the URLs?
You can try running the following search to get a source="web_input://www_mos_eisley_dk" | table _time url match*

In my case, I am finding that I get results for everything but "http://www.mos-eisley.dk/dashboard/\\". That URL seems to just do a redirect to "http://www.mos-eisley.dk/dashboard/" which I do get results for.

What platform and version of Splunk is this running on?
I'm wondering if I cannot get an identical repro because I'm not on the same platform.

0 Karma

moseisleydk
Path Finder

Hi,

If needed, I can give you full access, mail me at npn@netic.dk or bnp@mos-eisley.dk

BR,

Normann

0 Karma

LukeMurphey
Champion

Ok, I'll hit you up on email.

0 Karma

moseisleydk
Path Finder
02/07/2018 21:14:00.529 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/dashboard/\"
02/07/2018 21:14:00.529 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/dashboard/\", encoding="cp1252"
02/07/2018 21:12:12.258 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/feeds/network.action?username=bnp&max=40&publicFeed=false&os_authType=basic&rssType=atom", encoding="UTF-8"
02/07/2018 21:12:08.922 ERROR   An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
02/07/2018 21:12:08.922 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:12:08.921 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:11:28.858 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/plugins/inlinetasks/\"
02/07/2018 21:11:28.858 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/plugins/inlinetasks/\", encoding="cp1252"
02/07/2018 21:11:27.651 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/users/\"
02/07/2018 21:11:27.651 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/users/\", encoding="cp1252"
02/07/2018 21:11:25.590 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/spaces/\"
02/07/2018 21:11:25.590 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/spaces/\", encoding="cp1252"
02/07/2018 21:11:12.047 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="Shift_JIS"
02/07/2018 21:11:08.724 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="3Dutf-8="
02/07/2018 21:10:59.968 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/\"
02/07/2018 21:10:59.968 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/\", encoding="cp1252"
02/07/2018 21:10:59.117 INFO    Running web input, url="http://www.mos-eisley.dk"
0 Karma

moseisleydk
Path Finder

Logs:

02/07/2018 21:14:00.529 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/dashboard/\"
02/07/2018 21:14:00.529 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/dashboard/\", encoding="cp1252"
02/07/2018 21:12:12.258 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/feeds/network.action?username=bnp&max=40&publicFeed=false&os_authType=basic&rssType=atom", encoding="UTF-8"
02/07/2018 21:12:08.922 ERROR   An exception occurred when attempting to retrieve information from the web-page, stanza=web_input://www_mos_eisley_dk
02/07/2018 21:12:08.922 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:12:08.921 ERROR   A general exception was thrown when executing a web request
02/07/2018 21:11:28.858 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/plugins/inlinetasks/\"
02/07/2018 21:11:28.858 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/plugins/inlinetasks/\", encoding="cp1252"
02/07/2018 21:11:27.651 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/users/\"
02/07/2018 21:11:27.651 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/users/\", encoding="cp1252"
02/07/2018 21:11:25.590 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/spaces/\"
02/07/2018 21:11:25.590 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/spaces/\", encoding="cp1252"
02/07/2018 21:11:12.047 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="Shift_JIS"
02/07/2018 21:11:08.724 WARNING Detected encoding was not recognized and the content will be evaluated (possibly with the wrong encoding), encoding_detected="3Dutf-8="
02/07/2018 21:10:59.968 INFO    The content could not be parsed, it doesn't appear to be valid HTML, url="http://www.mos-eisley.dk/\"
02/07/2018 21:10:59.968 INFO    The content is going to be parsed without decoding because the parser refused to parse it with the detected encoding (http://goo.gl/4GRjJF), url="http://www.mos-eisley.dk/\", encoding="cp1252"
02/07/2018 21:10:59.117 INFO    Running web input, url="http://www.mos-eisley.dk"
0 Karma

LukeMurphey
Champion

Could you share the URL that you are using if it is a publically available one? I would like to reproduce this myself. It looks like the website is provided an invalid encoding and the Website Inputs app doesn't handle that yet. I want to update the app to handle it more gracefully.

0 Karma

moseisleydk
Path Finder

Its http://www.mos-eisley.dk - feel free 🙂

Splunk 7.0.1

And feel free to ask for futher info !

0 Karma

moseisleydk
Path Finder

BTW . Its Confluence from Atlassian

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...