Re: utf-8 filenames in phar files.

From: Yasuo Ohgaki Date: Fri, 14 Feb 2014 23:56:36 +0000

Subject: Re: utf-8 filenames in phar files.

References: 1 2 3 4 5 Groups: php.internals

Request: Send a blank email to [email protected] to get a copy of this message

Hi Dan,

On Sat, Feb 15, 2014 at 1:11 AM, Dan Ackroyd <[email protected]> wrote:

> That is not an issue as:
>
> i) Phar files produced on a windows machine should be identical to
> those produced on a Linux or OSX box.
>
> ii) There is a test in the phar code, so that if you do have filenames
> that are degenerate after normalising, the extraction throws an error.
> e.g. for the files
>
>     $filename1 = "Am\xC3\xA9lie.txt";
>     $filename2 = "Am\x65\xCC\x81lie.txt";
>
> If you add both to a phar archive and then attempt to extract them
> both you get the error:
>
>     "Cannot extract "Amélie.txt" to "output/Amélie.txt", path
> already
> exists"
>

I suppose there is no normalization code in phar, so your system(OS / file
system) normalizes file name.

Depending on system's normalization is not good.

 - File name could be NFC or NFD
 - File names in phar may differ by system
 - Systems that do not normalize Unicode actively exist

I do see file name normalization issue on my Linux/Windows and OSX with
git. (core.precomposeunicode=true is required for correct operation on OSX)
I suggest to apply NFC normalization to avoid issue, like git.

core.precomposeunicode
This option is only used by Mac OS implementation of Git. When
core.precomposeunicode=true, Git reverts the unicode decomposition of
filenames done by Mac OS. This is useful when sharing a repository between
Mac OS and Linux or Windows. (Git for Windows 1.7.10 or higher is needed,
or Git under cygwin 1.7). When false, file names are handled fully
transparent by Git, which is backward compatible with older versions of Git..
http://git-scm.com/docs/git-config

As Rowan pointed out, although ICU is detected by acinclude.m4 always, #if
should be used for ICU/intl related code. (intl uses ICU, use intl = use
ICU. I think it's better not to rely on intl. It may be disabled or can be
DL module. There are systems without ICU also.)

Regards,

--
Yasuo Ohgaki
[email protected]

Thread (24 messages)

Dan AckroydThu, 13 Feb 2014 21:55:14 +0000
Yasuo OhgakiFri, 14 Feb 2014 00:08:24 +0000
Dan AckroydFri, 14 Feb 2014 01:00:22 +0000
Yasuo OhgakiFri, 14 Feb 2014 01:48:44 +0000
Yasuo OhgakiFri, 14 Feb 2014 02:08:36 +0000
Yasuo OhgakiFri, 14 Feb 2014 02:33:50 +0000
Dan AckroydFri, 14 Feb 2014 16:11:17 +0000
Yasuo OhgakiFri, 14 Feb 2014 23:56:36 +0000
Dan AckroydSat, 15 Feb 2014 01:07:46 +0000
Yasuo OhgakiSat, 15 Feb 2014 01:57:40 +0000
Rowan CollinsFri, 14 Feb 2014 16:55:10 +0000

« previous	php.internals (#72617)	next »

From:	Yasuo Ohgaki	Date:	Fri, 14 Feb 2014 23:56:36 +0000
Subject:	Re: utf-8 filenames in phar files.
References:	1 2 3 4 5	Groups:	php.internals
Request:	Send a blank email to [email protected] to get a copy of this message