Incorrect byte shift when interpreting 32-bit utf-8 codepoints #743

tjanc · 2018-02-14T13:28:39Z

During review of my own JSON string escaping code against popular implementations, I noticed an inaccuracy in yours. Codepoints for 4-byte long utf-8 characters are calculated incorrectly:

    unsigned int calculated = ((firstByte & 0x07) << 24)
      | ((static_cast<unsigned int>(s[1]) & 0x3F) << 12)
      | ((static_cast<unsigned int>(s[2]) & 0x3F) << 6)
      |  (static_cast<unsigned int>(s[3]) & 0x3F);

Notice the byte shift of 24, while there are just 3x6 bits stored in bytes with lower significance.

cdunn2001 · 2018-02-14T16:33:57Z

I wish we had a regression test for this case, but thank you!

fix: byte shift when interpreting 32-bit utf-8 codepoints

592d942

cdunn2001 merged commit 313a0e4 into open-source-parsers:master Feb 14, 2018

cdunn2001 added the bug label Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect byte shift when interpreting 32-bit utf-8 codepoints #743

Incorrect byte shift when interpreting 32-bit utf-8 codepoints #743

Uh oh!

tjanc commented Feb 14, 2018 •

edited

Loading

Uh oh!

cdunn2001 commented Feb 14, 2018

Uh oh!

Uh oh!

Incorrect byte shift when interpreting 32-bit utf-8 codepoints #743

Incorrect byte shift when interpreting 32-bit utf-8 codepoints #743

Uh oh!

Conversation

tjanc commented Feb 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdunn2001 commented Feb 14, 2018

Uh oh!

Uh oh!

tjanc commented Feb 14, 2018 •

edited

Loading