Skip to content

Variants of CJK encodings do not match variants specified by WHATWG, even when a more similar Python codec exists. #26

Open
@harjitmoe

Description

@harjitmoe

This is a confusing topic, since most people when learning that Shift JIS is a thing do not want to have to learn about multiple different competing Shift JIS versions.

However:

  • WHATWG's index jis0208 includes "formerly proprietary extensions from IBM and NEC".  Python's codec for Shift JIS including these extensions is "cp932", aka "ms-kanji".  Python's "shift_jis" codec excludes these extensions.  Sadly, Python does not offer EUC-JP or ISO-2022-JP codecs including these extensions.
  • WHATWG's index Big5 includes "the Hong Kong Supplementary Character Set and other common extensions".  Python's "big5" codec follows BIG5.TXT, which does not include these extensions, but does include a less common extension for hiragana and katakana, which is incompatible with (and actually collides with) the extension for hiragana and katakana included by the ETEN, IBM and WHATWG versions of Big5.  Although not exactly the same due to a small number of edge cases (and due to not treating codes with lead bytes below 0xA1 as decode-only), Python's "big5hkscs" codec is much, much closer to the WHATWG behaviour than its "big5" codec, especially in their decoders (despite a few edge cases, where Python's "big5hkscs" decoder doesn't accept absolutely all codes that WHATWG's does, though it is still miles and miles closer than Python's "big5" decoder)—and even though the encoders are still quite different in terms of which codes they exclude, the output of Python's "big5hkscs" encoder will basically always be correctly interpreted by WHATWG's "big5" decoder, while the same cannot be said of the output of Python's "big5" encoder.
  • WHATWG's index EUC-KR consists of "the KS X 1001 standard and the Unified Hangul Code, more commonly known together as Windows Codepage 949".  Python's codec for exactly this is "cp949", aka "uhc".  By contrast, Python's "euc-kr" codec does not include the Unified Hangul Code extensions, and instead transforms the characters in question to and from KS X 1001 combining sequences (which work differently to Unicode combining sequences; hence, the characters in question do not exhibit combining behaviour when decoded one-by-one to Unicode).  The WHATWG decoder for EUC-KR does not recognise or transform back these sequences.

Some illustrative examples where differences occur:

>>> webencodings.decode(b'\x87\x82\x87@ \xedB', "windows-31j") # Should be "№①  鍈"
('�g@ �B', <Encoding shift_jis>)
>>> webencodings.decode(b'\xc7g\xc6\xf1\xc6\xfd\xc7g\xc6\xf1\xc6\xfd', "big5-hkscs") # Should be "むかしむかし"
('ハろウハろウ', <Encoding big5>)
>>> webencodings.decode(b'\x8cc\xb9\xe6\xb0\xa2\xc7\xcf', "windows-949") # Should be "똠방각하"
('�c방각하', <Encoding euc-kr>)
>>> 

Although a number of other differences exist, and it is not possible to create a fully conformant implementation of the WHATWG Encoding Standard in Python without re-implementing several of the encodings (including most of the CJK ones, as well as e.g. KOI8-U) to actually conform to it, the degree of conformance and in particular compatibility with it would be considerably improved for much less effort by:

  • Using Python's "ms-kanji" codec for WHATWG's Shift JIS, not Python's "shift_jis" codec.
  • Using Python's "big5hkscs" codec for WHATWG's Big5, not Python's "big5" codec.
  • Using Python's "uhc" codec for WHATWG's EUC-KR, not Python's "euc-kr" codec.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions