Skip to content

Conversation

@masakielastic
Copy link
Contributor

This pull request improves the value of subsitute charahcter when the value of third argument of mb_convert_encoding is different from the value of mb_internal_encoding.

@jpauli jpauli added the Feature label Feb 20, 2015
php-pulls pushed a commit that referenced this pull request Aug 10, 2016
Fix bug #69086 enhancement for mb_convert_encoding
@php-pulls
Copy link

Comment on behalf of yohgaki at php.net:

Merged. Thank you for PR.

@php-pulls php-pulls closed this Aug 10, 2016
@nikic
Copy link
Member

nikic commented Aug 3, 2017

Reviewing more code related to #1094 (comment)... there are some problems here as well. Again the check is being made against the source encoding of the string, not the target encoding, which is where the substitution character has to be mapped. For example:

<?php
mb_internal_encoding("UTF-8");
mb_substitute_character(0xfffd);
var_dump(bin2hex(mb_convert_encoding("\x80", "UTF-8", "EUC-JP-2004")));

This will result in U+3F, even though UTF-8 clearly supports U+FFFD.

However, even if the target encoding is checked instead of the source encoding, the check would still be too strict in the case where the target encoding is a "non-Unicode" encoding and does not match the internal encoding. There are many encodings that support large ranges of non-ASCII Unicode codepoints, but with the current logic they would always fall back to using U+3F.

@nikic
Copy link
Member

nikic commented Aug 3, 2017

This is now fixed by fb9bf5b. No upfront check is performed anymore, instead mbfl_convert will simply try to use the character and if that fails, fall back to ?. This way all characters supported by the target encoding should be usable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants