Skip to content

urllib/http fail to sanitize a non-ascii url #64758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dubslow mannequin opened this issue Feb 8, 2014 · 6 comments
Closed

urllib/http fail to sanitize a non-ascii url #64758

dubslow mannequin opened this issue Feb 8, 2014 · 6 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@dubslow
Copy link
Mannequin

dubslow mannequin commented Feb 8, 2014

BPO 20559
Nosy @ezio-melotti, @merwok, @vadmium, @dubslow, @iritkatriel

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2014-02-08.04:34:23.956>
labels = ['type-bug', '3.9', '3.10', '3.11', 'library', 'expert-unicode']
title = 'urllib/http fail to sanitize a non-ascii url'
updated_at = <Date 2021-12-11.00:31:42.105>
user = 'https://github.com/Dubslow'

bugs.python.org fields:

activity = <Date 2021-12-11.00:31:42.105>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'Unicode']
creation = <Date 2014-02-08.04:34:23.956>
creator = 'Dubslow'
dependencies = []
files = []
hgrepos = []
issue_num = 20559
keywords = []
message_count = 5.0
messages = ['210587', '210590', '211235', '285717', '408270']
nosy_count = 5.0
nosy_names = ['ezio.melotti', 'eric.araujo', 'martin.panter', 'Dubslow', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue20559'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

@dubslow
Copy link
Mannequin Author

dubslow mannequin commented Feb 8, 2014

The following code will produce a UnicodeEncodeError about a character being non-ascii:

    from urllib import request, parse, error
    url = 'http://en.wikipedia.org/wiki/Antonio%20Vallejo-N%C3%A1jera'
    req = request.Request(url)
    response = request.urlopen(req)

This fails as follows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.3/urllib/request.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.3/urllib/request.py", line 469, in open
    response = self._open(req, data)
  File "/usr/lib/python3.3/urllib/request.py", line 487, in _open
    '_open', req)
  File "/usr/lib/python3.3/urllib/request.py", line 447, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.3/urllib/request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.3/urllib/request.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.3/http/client.py", line 1067, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.3/http/client.py", line 1095, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.3/http/client.py", line 959, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 27: ordinal not in range(128)

I examined the library code in question: line 958 in http/client.py, the line before the one that barfs, contains the following comment:

# Non-ASCII characters should have been eliminated earlier

I added a print statement to the library code:

    print(request)
    self._output(request.encode('ascii'))

This prints the following:

>>> response = request.urlopen(req)
GET /wiki/Antonio Vallejo-Nájera HTTP/1.1
Traceback (most recent call last): ...

I confirmed that the 27th character as mentioned in the traceback is in fact the á in the last name. Clearly either urllib or http is not properly sanitizing the url -- unfortunately, my knowledge is useless as to determining where the actual error is; hopefully this report contains enough detail to make it easy enough.

@dubslow dubslow mannequin added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Feb 8, 2014
@dubslow
Copy link
Mannequin Author

dubslow mannequin commented Feb 8, 2014

Follow up -- I need to use urllib.parse.quote to safely encode a url -- though if I may be so bold, I submit that since much of the goal of Python 3 was to make unicode "just work", I the (stupid) user shouldn't have to remember to safely encode unicode urls...

A reasonable way to do it would be to insert the following in place of urllib/request.py line 469 (which is OpenerDirector.open()):

    response = self._open(req, data)

would become

try:
    response = self._open(req, data)
except UnicodeDecodeError as e:
    req.full_url = quote(req.full_url, safe='/%')
    response = self._open(req, data)

That's untested of course, but hopefully it'll encourage discussion.

@merwok
Copy link
Member

merwok commented Feb 14, 2014

Even if Python 3’s text model is based on Unicode, some data formats have their own rules. There’s a long debate about whether URIs should be bytes or text; it looks like unlike web browsers, urllib/httplib don’t try to be smart with the URIs they are given but just require them to be properly formatted, i.e. not containing any space or characters that are not %-encoded.

Is the documentation clear about this behaviour? If not, it would probably be simpler to improve the documentation rather than change the behaviour.

@vadmium
Copy link
Member

vadmium commented Jan 18, 2017

See also bpo-3991 with proposals for handling non-ASCII as new features.

@iritkatriel
Copy link
Member

Reproduced on 3.11.

@iritkatriel iritkatriel added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes labels Dec 10, 2021
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@crazymerlyn
Copy link

Should this issue be closed? It seems like a duplicate of #48241 which was closed last year with documentation updates.
Using urllib.parse.quote on the non ascii part of the url before using urlopen works fine on my machine (Python 3.12 on Linux).

@merwok merwok closed this as not planned Won't fix, can't repro, duplicate, stale Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants