-
-
Notifications
You must be signed in to change notification settings - Fork 32k
urllib/http fail to sanitize a non-ascii url #64758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The following code will produce a UnicodeEncodeError about a character being non-ascii: from urllib import request, parse, error
url = 'http://en.wikipedia.org/wiki/Antonio%20Vallejo-N%C3%A1jera'
req = request.Request(url)
response = request.urlopen(req) This fails as follows: Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.3/urllib/request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.3/urllib/request.py", line 469, in open
response = self._open(req, data)
File "/usr/lib/python3.3/urllib/request.py", line 487, in _open
'_open', req)
File "/usr/lib/python3.3/urllib/request.py", line 447, in _call_chain
result = func(*args)
File "/usr/lib/python3.3/urllib/request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.3/urllib/request.py", line 1248, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/lib/python3.3/http/client.py", line 1067, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python3.3/http/client.py", line 1095, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python3.3/http/client.py", line 959, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128) I examined the library code in question: line 958 in http/client.py, the line before the one that barfs, contains the following comment: # Non-ASCII characters should have been eliminated earlier I added a print statement to the library code: print(request)
self._output(request.encode('ascii')) This prints the following: >>> response = request.urlopen(req)
GET /wiki/Antonio Vallejo-Nájera HTTP/1.1
Traceback (most recent call last): ... I confirmed that the 27th character as mentioned in the traceback is in fact the á in the last name. Clearly either urllib or http is not properly sanitizing the url -- unfortunately, my knowledge is useless as to determining where the actual error is; hopefully this report contains enough detail to make it easy enough. |
Follow up -- I need to use urllib.parse.quote to safely encode a url -- though if I may be so bold, I submit that since much of the goal of Python 3 was to make unicode "just work", I the (stupid) user shouldn't have to remember to safely encode unicode urls... A reasonable way to do it would be to insert the following in place of urllib/request.py line 469 (which is OpenerDirector.open()): response = self._open(req, data) would become
That's untested of course, but hopefully it'll encourage discussion. |
Even if Python 3’s text model is based on Unicode, some data formats have their own rules. There’s a long debate about whether URIs should be bytes or text; it looks like unlike web browsers, urllib/httplib don’t try to be smart with the URIs they are given but just require them to be properly formatted, i.e. not containing any space or characters that are not %-encoded. Is the documentation clear about this behaviour? If not, it would probably be simpler to improve the documentation rather than change the behaviour. |
See also bpo-3991 with proposals for handling non-ASCII as new features. |
Reproduced on 3.11. |
Should this issue be closed? It seems like a duplicate of #48241 which was closed last year with documentation updates. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: