Skip to content

Incorrect byte shift when interpreting 32-bit utf-8 codepoints #743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

tjanc
Copy link

@tjanc tjanc commented Feb 14, 2018

During review of my own JSON string escaping code against popular implementations, I noticed an inaccuracy in yours. Codepoints for 4-byte long utf-8 characters are calculated incorrectly:

    unsigned int calculated = ((firstByte & 0x07) << 24)
      | ((static_cast<unsigned int>(s[1]) & 0x3F) << 12)
      | ((static_cast<unsigned int>(s[2]) & 0x3F) << 6)
      |  (static_cast<unsigned int>(s[3]) & 0x3F);

Notice the byte shift of 24, while there are just 3x6 bits stored in bytes with lower significance.

@cdunn2001 cdunn2001 merged commit 313a0e4 into open-source-parsers:master Feb 14, 2018
@cdunn2001
Copy link
Contributor

I wish we had a regression test for this case, but thank you!

@cdunn2001 cdunn2001 added the bug label Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants