Skip to content

gh-86094 Add support for Unicode Path Extra Field in ZipFile #102566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 5, 2023

Conversation

yeojin-dev
Copy link
Contributor

@yeojin-dev yeojin-dev commented Mar 9, 2023

@corona10 corona10 self-requested a review March 9, 2023 23:42
# Unicode Path Extra Field
try:
up_version, up_name_crc = unpack('<BL', data[:5])
if up_version == 1 and up_name_crc == self.orig_filename_crc:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check that it will be enough to calculate crc32 on the fly.

Suggested change
if up_version == 1 and up_name_crc == self.orig_filename_crc:
if up_version == 1 and up_name_crc == crc32(self.filename):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, orig_filename_crc is necessary. filename in ZipInfo is the decoded value from ZipFile. But orig_filename_crc is the result of the CRC on filename before decoding. So it should be stored separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeojin-dev

Okay, how about passing the orig_filename_crc to _decodeExtra?

filename = fp.read(centdir[_CD_FILENAME_LENGTH])
orig_filename_crc = crc32(filename)
flags = centdir[_CD_FLAG_BITS]
if flags & _MASK_UTF_FILENAME:
# UTF-8 file names extension
filename = filename.decode('utf-8')
else:
# Historical ZIP filename encoding
filename = filename.decode(self.metadata_encoding or 'cp437')
# Create ZipInfo instance to store file information
x = ZipInfo(filename)
x.extra = fp.read(centdir[_CD_EXTRA_FIELD_LENGTH])
x.comment = fp.read(centdir[_CD_COMMENT_LENGTH])
x.header_offset = centdir[_CD_LOCAL_HEADER_OFFSET]
(x.create_version, x.create_system, x.extract_version, x.reserved,
x.flag_bits, x.compress_type, t, d,
x.CRC, x.compress_size, x.file_size) = centdir[1:12]
if x.extract_version > MAX_EXTRACT_VERSION:
raise NotImplementedError("zip file version %.1f" %
(x.extract_version / 10))
x.volume, x.internal_attr, x.external_attr = centdir[15:18]
# Convert date/time code to (year, month, day, hour, min, sec)
x._raw_time = t
x.date_time = ( (d>>9)+1980, (d>>5)&0xF, d&0x1F,
t>>11, (t>>5)&0x3F, (t&0x1F) * 2 )
x.orig_filename_crc = orig_filename_crc
x._decodeExtra()

Looks like _decodeExtra is only used at this moment. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea! I will update. :)

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

try:
up_version, up_name_crc = unpack('<BL', data[:5])
if up_version == 1 and up_name_crc == self.orig_filename_crc:
up_unicode_name = data[5:].decode('utf-8')
Copy link
Member

@corona10 corona10 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the spec, please check the TSize if possible.


As UnicodeName is defined to be UTF-8, no UTF-8
byte order mark (BOM) is used.  The length of this field is determined
by subtracting the size of the previous fields from TSize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TSize is stored in the ln variable. I don't think we need to check it because we are storing data of the size defined by ln. If there is a problem, an exception would be thrown.

Comment on lines 533 to 538
except struct.error:
import warnings
warnings.warn("Corrupt unicode path extra field (0x7075)", stacklevel=2)
except UnicodeDecodeError:
import warnings
warnings.warn('Corrupt unicode path extra field (0x7075): invalid utf-8 bytes', stacklevel=2)
Copy link
Member

@corona10 corona10 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please raise BadZipFile error to follow other conventions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified it in a way that throws an exception, but given that this is an extra field, wouldn't it be okay to ignore it and let the 'original' filename field be used if an exception is thrown?

@yeojin-dev
Copy link
Contributor Author

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

@bedevere-bot bedevere-bot requested a review from corona10 March 16, 2023 12:37
@arhadthedev arhadthedev added the stdlib Python modules in the Lib dir label Mar 22, 2023
@yeojin-dev
Copy link
Contributor Author

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@corona10: please review the changes made to this pull request.

@corona10 corona10 self-assigned this Apr 5, 2023
Copy link
Member

@corona10 corona10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@agiudiceandrea
Copy link

@yeojin-dev, thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants