Skip to content

email: get_payload(decode=True) doesn't handle Content-Transfer-Encoding with trailing white space #98188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rpcross opened this issue Oct 11, 2022 · 2 comments
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@rpcross
Copy link

rpcross commented Oct 11, 2022

If the Content-Transfer-Encoding header field of a message part has trailing whitespace, for example "base64 ", get_payload(decode=True) does not return the properly decoded payload.

Here is a minimal code example. Sample message file attached.

import email
from email import policy

with open('msg.txt', 'rb') as f:
    msg = email.message_from_binary_file(f, policy=policy.default)

parts = list(msg.walk())
parts[1].get_payload(decode=True)
> b'SGVsbG8uIFRlc3Rpbmc=\n'

The parsed content-transfer-encoding header "cte" value is truncated, but it's string value is not.

>>> header = parts[1].get('content-transfer-encoding')
>>> header.cte
'base64'
>>> str(header)
'base64 '

Which is what appears to be used in the decode attempt
https://github.com/python/cpython/blob/main/Lib/email/message.py#L289

  • CPython versions tested on: 3.9.13, 3.10.7
  • Operating system and architecture: macOS 12.6 Intel

msg.txt

Linked PRs

@rpcross rpcross added the type-bug An unexpected behavior, bug, or error label Oct 11, 2022
@rruuaanng
Copy link
Contributor

rruuaanng commented Sep 10, 2024

If the Content-Transfer-Encoding header field of a message part has trailing whitespace, for example "base64 ", get_payload(decode=True) does not return the properly decoded payload.

Here is a minimal code example. Sample message file attached.

import email
from email import policy

with open('msg.txt', 'rb') as f:
    msg = email.message_from_binary_file(f, policy=policy.default)

parts = list(msg.walk())
parts[1].get_payload(decode=True)
> b'SGVsbG8uIFRlc3Rpbmc=\n'

The parsed content-transfer-encoding header "cte" value is truncated, but it's string value is not.

>>> header = parts[1].get('content-transfer-encoding')
>>> header.cte
'base64'
>>> str(header)
'base64 '

Which is what appears to be used in the decode attempt https://github.com/python/cpython/blob/main/Lib/email/message.py#L289

* CPython versions tested on: 3.9.13, 3.10.7

* Operating system and architecture: macOS 12.6 Intel

msg.txt

This issuer, I have commit a PR

RanKKI added a commit to RanKKI/cpython that referenced this issue Dec 3, 2024
Fix `email.message.EmailMessage.get_payload` failing
to decode data when there is a trailing whitespace
following the `<mechanism>`.

For backward compatibility, `str(cte_header)` still
returns the original value; `get_payload` uses `cte_header.cte`
to retrieve the parsed CTE.
bitdancer pushed a commit that referenced this issue Jan 6, 2025
…has extra text (#127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

Co-authored-by: Bénédikt Tran <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jan 6, 2025
…value has extra text (pythonGH-127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

(cherry picked from commit a62ba52)

Co-authored-by: RanKKI <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Jan 6, 2025
…value has extra text (pythonGH-127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

(cherry picked from commit a62ba52)

Co-authored-by: RanKKI <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 6, 2025
…value has extra text (python#127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

Co-authored-by: Bénédikt Tran <[email protected]>
@picnixz picnixz added the stdlib Python modules in the Lib dir label Jan 6, 2025
bitdancer pushed a commit that referenced this issue Jan 7, 2025
… value has extra text (GH-127547) (#128528)

gh-98188: Fix EmailMessage.get_payload to decode data when CTE value has extra text (GH-127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

(cherry picked from commit a62ba52)

Co-authored-by: RanKKI <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
bitdancer pushed a commit that referenced this issue Jan 7, 2025
… value has extra text (GH-127547) (#128529)

gh-98188: Fix EmailMessage.get_payload to decode data when CTE value has extra text (GH-127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

(cherry picked from commit a62ba52)

Co-authored-by: RanKKI <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
@bitdancer
Copy link
Member

Fix committed to all three branches. Thanks all.

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this issue Jan 8, 2025
…value has extra text (python#127547)

Up to this point message handling has been very strict with regards to content encoding values: mixed case was accepted, but trailing blanks or other text would cause decoding failure, even if the first token was a valid encoding.  By Postel's Rule we should go ahead and decode as long as we can recognize that first token.  We have not thought of any security or backward compatibility concerns with this fix.

This fix does introduce a new technique/pattern to the Message code: we look to see if the header has a 'cte' attribute, and if so we use that.  This effectively promotes the header API exposed by HeaderRegistry to an API that any header parser "should" support.  This seems like a reasonable thing to do.  It is not, however, a requirement, as the string value of the header is still used if there is no cte attribute.

The full fix (ignore any trailing blanks or blank-separated trailing text) applies only to the non-compat32 API.  compat32 is only fixed to the extent that it now ignores trailing spaces.  Note that the HeaderRegistry parsing still records a HeaderDefect if there is extra text.

Co-authored-by: Bénédikt Tran <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants