Skip to content

Move utf8_encode and utf8_decode to ext/standard #2160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 17, 2016

Conversation

hikari-no-yume
Copy link
Contributor

These are generic string functions which have a use outside of processing XML, and don't have any dependency on libxml. Therefore, this patch moves them to ext/standard.

This patch doesn't touch NEWS and UPGRADING currently. I can do those and merge this myself, if it's approved.

@smalyshev
Copy link
Contributor

looks good! 👍

zend_string *str;
unsigned char c;

str = zend_string_safe_alloc(len, 4, 0, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2 should be enough here actually, since from iso-8859-1 to UTF-8 you can't get more than 2 bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The original function was (needlessly!) more generic and supported encoding 3- and 4-byte characters, which I stripped out, but I missed this.

@hikari-no-yume
Copy link
Contributor Author

When I copied across the internal encode/decode functions to ext/standard, I simplified the code a little (it's less generic than the ext/xml code). I might want to assess whether ext/xml could possibly share these functions, or at least backport the improvements. I'm not sure how much of their feature set is actually used.

@hikari-no-yume
Copy link
Contributor Author

hikari-no-yume commented Oct 14, 2016

On that line of thought: though it's trivial, there's now at least four places in the PHP source where UTF-8 generation is done: the lexer (\u{} literals, this duplication is my fault), ext/xml, ext/standard/string.c (with this pull request) and ext/standard/html.c (php_utf32_utf8). This would be a good opportunity to maybe use the latter function in all these cases, though it might be best to move it to a header for inlining.

(It's interesting that PHP core has a bunch of multibyte stuff in html.c, actually. Should it really live there?…)

@krakjoe
Copy link
Member

krakjoe commented Oct 14, 2016

Can I suggest that this PR stays focused on just moving these string functions to a sensible place, and try to make it "nice" in another PR that addresses the duplication issue ?

@hikari-no-yume
Copy link
Contributor Author

Yes, that's fair.

@hikari-no-yume
Copy link
Contributor Author

@yohgaki Do you have any thoughts on this?

@yohgaki
Copy link
Contributor

yohgaki commented Oct 16, 2016

It seems these functions aimed to support various encodings when it is implemented. It will not happen, I guess. It may be better soft deprecate them. However, I don't have strong opinion on this.

@krakjoe
Copy link
Member

krakjoe commented Oct 17, 2016

I don't think an RFC is required here, would like to see this merged.

Am I wrong about that, do we need an RFC ?

@hikari-no-yume
Copy link
Contributor Author

Well, it doesn't break backwards-compatibility, and it doesn't even introduce a new feature, it just moves something from one extension to another. It's not even a “bug fix” really.

So… I don't think it requires an RFC? I could just merge this right away if there's no objections.

@hikari-no-yume
Copy link
Contributor Author

hikari-no-yume commented Oct 17, 2016

Another change that maybe should be done is renaming the functions and making these names be aliases, because utf8_encode and utf8_decode are not very descriptive names. But of course, aliasing is only meaningful insofar as it means the manual says something different, so we can just edit the manual instead. Aliases would only make much sense if we introduced counterpart functions for other character sets, but I don't think that will happen (these two functions are relics and have better alternatives).

@krakjoe
Copy link
Member

krakjoe commented Oct 17, 2016

I think it's okay to merge the patch as it is, just to move the functions out of xml.

I think if you wanted to tidy up some stuff and create aliases and remove duplication, then that needs an RFC.

I'm happy for this to be merged into master whatever, I'd wait for someone else to +1 that before doing it.

Please remember a news and upgrading entry when you merge.

@nikic
Copy link
Member

nikic commented Oct 17, 2016

+1

@hikari-no-yume hikari-no-yume force-pushed the move_utf8_functions_to_standard branch from 9332d5e to 1a512ee Compare October 17, 2016 14:39
@php-pulls php-pulls merged commit 8585316 into php:master Oct 17, 2016
@hikari-no-yume
Copy link
Contributor Author

Merged into master, so this will be in PHP 7.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants