Skip to content

urllib should fsdecode percent-encoded parts of file URIs on Unix #85168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
manueljacob mannequin opened this issue Jun 17, 2020 · 2 comments
Closed

urllib should fsdecode percent-encoded parts of file URIs on Unix #85168

manueljacob mannequin opened this issue Jun 17, 2020 · 2 comments
Labels
stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@manueljacob
Copy link
Mannequin

manueljacob mannequin commented Jun 17, 2020

BPO 40996
Nosy @ezio-melotti, @manueljacob

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-06-17.00:19:10.953>
labels = ['type-bug', 'library', 'expert-unicode']
title = 'urllib should fsdecode percent-encoded parts of file URIs on Unix'
updated_at = <Date 2020-06-17.10:14:42.633>
user = 'https://github.com/manueljacob'

bugs.python.org fields:

activity = <Date 2020-06-17.10:14:42.633>
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)', 'Unicode']
creation = <Date 2020-06-17.00:19:10.953>
creator = 'mjacob'
dependencies = []
files = []
hgrepos = []
issue_num = 40996
keywords = []
message_count = 1.0
messages = ['371702']
nosy_count = 2.0
nosy_names = ['ezio.melotti', 'mjacob']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue40996'
versions = []

Linked PRs

@manueljacob
Copy link
Mannequin Author

manueljacob mannequin commented Jun 17, 2020

On Unix, file names are bytes. Python mostly prefers to use unicode for file names. On the Python <-> system boundary, os.fsencode() / os.fsdecode() are used.

In URIs, bytes can be percent-encoded. On Unix, most applications pass the percent-decoded bytes in file URIs to the file system unchanged. The remainder of this issue description is about Unix, except for the last paragraph.

Pathlib fsencodes the path when making a file URI, roundtripping the bytes e.g. passed as an argument:
% python3 -c 'import pathlib, sys; print(pathlib.Path(sys.argv[1]).as_uri())' /tmp/a$(echo -e '\xE4')
file:///tmp/a%E4

Example with curl using this URL:
% echo 'Hello, World!' > /tmp/a$(echo -e '\xE4')
% curl file:///tmp/a%E4
Hello, World!

Python 2’s urllib works the same:
% python2 -c 'from urllib import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))'
'Hello, World!\n'

However, Python 3’s urllib fails:
% python3 -c 'from urllib.request import urlopen; print(repr(urlopen("file:///tmp/a%E4").read()))' 
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1507, in open_local_file
    stats = os.stat(localfile)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/a�'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1485, in file_open
    return self.open_local_file(req)
  File "/usr/lib/python3.8/urllib/request.py", line 1524, in open_local_file
    raise URLError(exp)
urllib.error.URLError: <urlopen error [Errno 2] No such file or directory: '/tmp/a�'>

urllib.request.url2pathname() is the function converting the path of the file URI to a file name. On Unix, it uses urllib.parse.unquote() with the default settings (UTF-8 encoding and the "replace" error handler).

I think that on Unix, the settings from os.fsdecode() should be used, so that it roundtrips with pathlib.Path.as_uri() and so that the percent-decoded bytes are passed to the file system as-is.

On Windows, I couldn’t do experiments, but using UTF-8 seems like the right thing (according to https://en.wikipedia.org/wiki/File_URI_scheme#Windows_2). I’m not sure that the "replace" error handler is a good idea. I prefer "errors should never pass silently" from the Zen of Python, but I don’t a have a strong opinion on this.

@manueljacob manueljacob mannequin added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Jun 17, 2020
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
barneygale added a commit to barneygale/cpython that referenced this issue Nov 15, 2024
…` URIs

Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the
filesystem encoding when quoting and unquoting file URIs, rather than
forcing use of UTF-8.

No changes are needed in the `nturl2path` module because Windows always
uses UTF-8, per PEP 529.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 19, 2024
barneygale added a commit that referenced this issue Nov 19, 2024
…#126852)

Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the
filesystem encoding when quoting and unquoting file URIs, rather than
forcing use of UTF-8.

No changes are needed in the `nturl2path` module because Windows always
uses UTF-8, per PEP 529.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 19, 2024
…` URIs (pythonGH-126852)

Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the
filesystem encoding when quoting and unquoting file URIs, rather than
forcing use of UTF-8.

No changes are needed in the `nturl2path` module because Windows always
uses UTF-8, per PEP 529.
(cherry picked from commit c9b399f)

Co-authored-by: Barney Gale <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 19, 2024
…` URIs (pythonGH-126852)

Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the
filesystem encoding when quoting and unquoting file URIs, rather than
forcing use of UTF-8.

No changes are needed in the `nturl2path` module because Windows always
uses UTF-8, per PEP 529.
(cherry picked from commit c9b399f)

Co-authored-by: Barney Gale <[email protected]>
barneygale added a commit that referenced this issue Nov 19, 2024
…e` URIs (GH-126852) (#127040)

GH-85168: Use filesystem encoding when converting to/from `file` URIs (GH-126852)

Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the
filesystem encoding when quoting and unquoting file URIs, rather than
forcing use of UTF-8.

No changes are needed in the `nturl2path` module because Windows always
uses UTF-8, per PEP 529.
(cherry picked from commit c9b399f)

Co-authored-by: Barney Gale <[email protected]>
barneygale added a commit that referenced this issue Nov 20, 2024
…e` URIs (GH-126852) (#127039)

GH-85168: Use filesystem encoding when converting to/from `file` URIs (GH-126852)

Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the
filesystem encoding when quoting and unquoting file URIs, rather than
forcing use of UTF-8.

No changes are needed in the `nturl2path` module because Windows always
uses UTF-8, per PEP 529.
(cherry picked from commit c9b399f)

Co-authored-by: Barney Gale <[email protected]>
@barneygale
Copy link
Contributor

Fix pushed to the 3.12, 3.13 and 3.14 branches. Thank you for reporting, @manueljacob!

ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…` URIs (python#126852)

Adjust `urllib.request.url2pathname()` and `pathname2url()` to use the
filesystem encoding when quoting and unquoting file URIs, rather than
forcing use of UTF-8.

No changes are needed in the `nturl2path` module because Windows always
uses UTF-8, per PEP 529.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant