-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
gh-86094 Add support for Unicode Path Extra Field in ZipFile #102566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Lib/zipfile/__init__.py
Outdated
# Unicode Path Extra Field | ||
try: | ||
up_version, up_name_crc = unpack('<BL', data[:5]) | ||
if up_version == 1 and up_name_crc == self.orig_filename_crc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check that it will be enough to calculate crc32 on the fly.
if up_version == 1 and up_name_crc == self.orig_filename_crc: | |
if up_version == 1 and up_name_crc == crc32(self.filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, orig_filename_crc
is necessary. filename
in ZipInfo is the decoded value from ZipFile. But orig_filename_crc
is the result of the CRC on filename
before decoding. So it should be stored separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, how about passing the orig_filename_crc to _decodeExtra?
cpython/Lib/zipfile/__init__.py
Lines 1436 to 1463 in c75e016
filename = fp.read(centdir[_CD_FILENAME_LENGTH]) | |
orig_filename_crc = crc32(filename) | |
flags = centdir[_CD_FLAG_BITS] | |
if flags & _MASK_UTF_FILENAME: | |
# UTF-8 file names extension | |
filename = filename.decode('utf-8') | |
else: | |
# Historical ZIP filename encoding | |
filename = filename.decode(self.metadata_encoding or 'cp437') | |
# Create ZipInfo instance to store file information | |
x = ZipInfo(filename) | |
x.extra = fp.read(centdir[_CD_EXTRA_FIELD_LENGTH]) | |
x.comment = fp.read(centdir[_CD_COMMENT_LENGTH]) | |
x.header_offset = centdir[_CD_LOCAL_HEADER_OFFSET] | |
(x.create_version, x.create_system, x.extract_version, x.reserved, | |
x.flag_bits, x.compress_type, t, d, | |
x.CRC, x.compress_size, x.file_size) = centdir[1:12] | |
if x.extract_version > MAX_EXTRACT_VERSION: | |
raise NotImplementedError("zip file version %.1f" % | |
(x.extract_version / 10)) | |
x.volume, x.internal_attr, x.external_attr = centdir[15:18] | |
# Convert date/time code to (year, month, day, hour, min, sec) | |
x._raw_time = t | |
x.date_time = ( (d>>9)+1980, (d>>5)&0xF, d&0x1F, | |
t>>11, (t>>5)&0x3F, (t&0x1F) * 2 ) | |
x.orig_filename_crc = orig_filename_crc | |
x._decodeExtra() |
Looks like _decodeExtra
is only used at this moment. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea! I will update. :)
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
try: | ||
up_version, up_name_crc = unpack('<BL', data[:5]) | ||
if up_version == 1 and up_name_crc == self.orig_filename_crc: | ||
up_unicode_name = data[5:].decode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the spec, please check the TSize if possible.
As UnicodeName is defined to be UTF-8, no UTF-8
byte order mark (BOM) is used. The length of this field is determined
by subtracting the size of the previous fields from TSize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TSize is stored in the ln
variable. I don't think we need to check it because we are storing data of the size defined by ln
. If there is a problem, an exception would be thrown.
Lib/zipfile/__init__.py
Outdated
except struct.error: | ||
import warnings | ||
warnings.warn("Corrupt unicode path extra field (0x7075)", stacklevel=2) | ||
except UnicodeDecodeError: | ||
import warnings | ||
warnings.warn('Corrupt unicode path extra field (0x7075): invalid utf-8 bytes', stacklevel=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please raise BadZipFile error to follow other conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified it in a way that throws an exception, but given that this is an extra
field, wouldn't it be okay to ignore it and let the 'original' filename field be used if an exception is thrown?
I have made the requested changes; please review again |
Thanks for making the requested changes! @corona10: please review the changes made to this pull request. |
I have made the requested changes; please review again |
Thanks for making the requested changes! @corona10: please review the changes made to this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@yeojin-dev, thank you very much! |
Uh oh!
There was an error while loading. Please reload this page.