-
-
Notifications
You must be signed in to change notification settings - Fork 32k
urllib should fsdecode percent-encoded parts of file URIs on Unix #85168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On Unix, file names are bytes. Python mostly prefers to use unicode for file names. On the Python <-> system boundary, os.fsencode() / os.fsdecode() are used. In URIs, bytes can be percent-encoded. On Unix, most applications pass the percent-decoded bytes in file URIs to the file system unchanged. The remainder of this issue description is about Unix, except for the last paragraph. Pathlib fsencodes the path when making a file URI, roundtripping the bytes e.g. passed as an argument: Example with curl using this URL: Python 2’s urllib works the same: However, Python 3’s urllib fails:
% python3 -c 'from urllib.request import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))'
Traceback (most recent call last):
File "/usr/lib/python3.8/urllib/request.py", line 1507, in open_local_file
stats = os.stat(localfile)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/a�'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 1485, in file_open
return self.open_local_file(req)
File "/usr/lib/python3.8/urllib/request.py", line 1524, in open_local_file
raise URLError(exp)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/tmp/a�'> urllib.request.url2pathname() is the function converting the path of the file URI to a file name. On Unix, it uses urllib.parse.unquote() with the default settings (UTF-8 encoding and the "replace" error handler). I think that on Unix, the settings from os.fsdecode() should be used, so that it roundtrips with pathlib.Path.as_uri() and so that the percent-decoded bytes are passed to the file system as-is. On Windows, I couldn’t do experiments, but using UTF-8 seems like the right thing (according to https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). I’m not sure that the "replace" error handler is a good idea. I prefer "errors should never pass silently" from the Zen of Python, but I don’t a have a strong opinion on this. |
…` URIs Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the filesystem encoding when quoting and unquoting file URIs, rather than forcing use of UTF-8. No changes are needed in the `nturl2path` module because Windows always uses UTF-8, per PEP 529.
…#126852) Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the filesystem encoding when quoting and unquoting file URIs, rather than forcing use of UTF-8. No changes are needed in the `nturl2path` module because Windows always uses UTF-8, per PEP 529.
…` URIs (pythonGH-126852) Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the filesystem encoding when quoting and unquoting file URIs, rather than forcing use of UTF-8. No changes are needed in the `nturl2path` module because Windows always uses UTF-8, per PEP 529. (cherry picked from commit c9b399f) Co-authored-by: Barney Gale <[email protected]>
…` URIs (pythonGH-126852) Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the filesystem encoding when quoting and unquoting file URIs, rather than forcing use of UTF-8. No changes are needed in the `nturl2path` module because Windows always uses UTF-8, per PEP 529. (cherry picked from commit c9b399f) Co-authored-by: Barney Gale <[email protected]>
…e` URIs (GH-126852) (#127040) GH-85168: Use filesystem encoding when converting to/from `file` URIs (GH-126852) Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the filesystem encoding when quoting and unquoting file URIs, rather than forcing use of UTF-8. No changes are needed in the `nturl2path` module because Windows always uses UTF-8, per PEP 529. (cherry picked from commit c9b399f) Co-authored-by: Barney Gale <[email protected]>
…e` URIs (GH-126852) (#127039) GH-85168: Use filesystem encoding when converting to/from `file` URIs (GH-126852) Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the filesystem encoding when quoting and unquoting file URIs, rather than forcing use of UTF-8. No changes are needed in the `nturl2path` module because Windows always uses UTF-8, per PEP 529. (cherry picked from commit c9b399f) Co-authored-by: Barney Gale <[email protected]>
Fix pushed to the 3.12, 3.13 and 3.14 branches. Thank you for reporting, @manueljacob! |
…` URIs (python#126852) Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the filesystem encoding when quoting and unquoting file URIs, rather than forcing use of UTF-8. No changes are needed in the `nturl2path` module because Windows always uses UTF-8, per PEP 529.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
file
URIs #126852file
URIs (GH-126852) #127039file
URIs (GH-126852) #127040The text was updated successfully, but these errors were encountered: