Skip to content

mb_detect_encoding fails to detect ASCII for certain words #18410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
himikeb1 opened this issue Apr 23, 2025 · 8 comments
Closed

mb_detect_encoding fails to detect ASCII for certain words #18410

himikeb1 opened this issue Apr 23, 2025 · 8 comments

Comments

@himikeb1
Copy link

Description

The following code:

php > $encs = ["ASCII", "ISO-8859-1", "UCS-2BE", "UCS-2LE", "UCS-2", "UTF-16", "UTF-8"];
php > $msg="Hello world";
php > var_dump(mb_detect_encoding($msg, $encs));
string(5) "ASCII"
php > var_dump(mb_detect_encoding($msg, $encs));
string(5) "ASCII"
php > $msg="Hello";
php > var_dump(mb_detect_encoding($msg, $encs));
string(5) "ASCII"
php > $msg="Stop";
php > var_dump(mb_detect_encoding($msg, $encs));
string(7) "UCS-2BE"               //       <-- PROBLEM!!!! *******  THIS IS WRONG IN PHP ********
php > $msg="stop";
php > var_dump(mb_detect_encoding($msg, $encs));
string(5) "ASCII"
php > $msg="Stop IT";
php > var_dump(mb_detect_encoding($msg, $encs));
string(5) "ASCII"
<?php

Resulted in this output:

"Stop"  NOT being detected as ASCII, but UCS-2BE instead

But I expected this output instead:

"Stop" should be detected as ASCII.

Note that:

  var_dump(mb_check_encoding("Stop", "ASCII"));

does result in true.

Bonus Info:

PHP 8.2 on CentOS:
$ ea-php82 -a
Interactive shell

php > var_dump(mb_detect_encoding("Stop.", $encs));
string(5) "ASCII"
php > var_dump(mb_detect_encoding("Stop", $encs));
string(7) "UCS-2BE"

PHP 7.1 on CentOS:

$ ea-php71 -a
Interactive shell
php > $encs = ["ASCII", "ISO-8859-1", "UCS-2BE", "UCS-2LE", "UCS-2", "UTF-16", "UTF-8"];
php > var_dump(mb_detect_encoding("Stop.", $encs));
string(5) "ASCII"
php > var_dump(mb_detect_encoding("Stop", $encs));
string(5) "ASCII"

PHP Version

$ php8.3 -v
PHP 8.3.20 (cli) (built: Apr 10 2025 21:33:24) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.3.20, Copyright (c) Zend Technologies
with Zend OPcache v8.3.20, Copyright (c), by Zend Technologies

Also tested on 8.3.19 on CentOS Linux release 7.9.2009 (broken)
Also tested on 8.2.28 on CentOS Linux release 7.9.2009 (broken)
Also tested on 7.1.33 on CentOS Linux release 7.9.2009 (works)

Operating System

Ubuntu 22.04.4 LTS, CentOS Linux release 7.9.2009

@nielsdos
Copy link
Member

nielsdos commented Apr 23, 2025

I think I get why this happens.
The resulting demerits for "Stop" of ASCII is 4, that of UCS-2BE is 2.something.
Because ASCII always consumes 1 byte we end up with 4 bytes counting each for 1 to the demerits score resulting in 4. However, UCS-2BE consumes 2 bytes per iteration so we end up with a score of 2 + some penalty.
To solve this, we should probably normalize by width of the codepoint (e.g. 1 for ASCII, 2 for UCS-2BE).

@nielsdos
Copy link
Member

nielsdos commented Apr 23, 2025

Something like this would work (but is incomplete). Also the assumption that UTF-16 occupies 2 bytes isn't right as you can have surrogates too, but a minimum of 2 is right so I guess for a "guessing" function it's acceptable.

diff --git a/ext/mbstring/mbstring.c b/ext/mbstring/mbstring.c
index dec565707fa..b5023cb2643 100644
--- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -3063,6 +3063,13 @@ static size_t count_demerits(struct candidate *array, size_t length, bool strict
 			/* Do we still have more input to process for this candidate encoding? */
 			if (array[i].in_len) {
 				const mbfl_encoding *enc = array[i].enc;
+				unsigned int char_len = enc->flag & (MBFL_ENCTYPE_SBCS | MBFL_ENCTYPE_WCS2 | MBFL_ENCTYPE_WCS4);
+				if (enc == &mbfl_encoding_utf16be || enc == &mbfl_encoding_utf16le || enc == &mbfl_encoding_utf16) {
+					// XXX
+					char_len = 2;
+				} else {
+					// XXX: mblen_table???
+				}
 				size_t out_len = enc->to_wchar((unsigned char**)&array[i].in, &array[i].in_len, wchar_buf, 128, &array[i].state);
 				ZEND_ASSERT(out_len <= 128);
 				/* Check this batch of decoded codepoints; are there any error markers?
@@ -3082,7 +3089,7 @@ static size_t count_demerits(struct candidate *array, size_t length, bool strict
 							array[i].demerits += 1000;
 						}
 					} else {
-						array[i].demerits += estimate_demerits(w);
+						array[i].demerits += estimate_demerits(w) * MAX(1, char_len);
 					}
 				}
 				if (array[i].in_len == 0) {
@@ -3095,6 +3102,7 @@ try_next_encoding:;
 
 	for (size_t i = 0; i < length; i++) {
 		double demerits = array[i].demerits * (double) array[i].multiplier;
+		// fprintf(stderr,"demerits %s %f\n",array[i].enc->name,demerits);
 		array[i].demerits = demerits < (double) UINT64_MAX ? (uint64_t) demerits : UINT64_MAX;
 	}
 

cc @alexdowad @youkidearitai what do you think?
EDIT: perhaps a "minimum codepoint width" field would work well?

@alexdowad
Copy link
Contributor

@nielsdos I haven't looked at this in detail yet, but first let me just say: In the routine for rating which candidate encoding(s) are more likely to be the intended one, encodings with a small number of bytes per codepoint are deliberately "penalized", in that with more codepoints for a given input byte sequence, they have more chances to accumulate "demerits".

One reason for this is because mbstring supports some legacy 1-byte text encodings where every byte decodes to some codepoint (i.e. every byte sequence is valid), and all the resulting codepoints fall within our set of "common" codepoints (for encoding detection purposes).

If we normalized the "demerit" calculation to put 1-byte and 2-byte encodings on an even playing field, as you are suggesting, then such 1-byte encodings would always be chosen when they are included in the $encodings parameter.

Also, there are lots of cases where you have a string with i.e. two UTF-16 Chinese characters, but those 2 Chinese characters can be interpreted as 4 Latin characters if the string is interpreted as ASCII instead. If the 2 Chinese characters in question are obscure, ancient characters which are not used in modern Chinese or Japanese, then guessing that the string is ASCII is reasonable. But if they are frequently used Chinese characters, then should we guess that the string is UTF-16, or ASCII? Currently, we favor UTF-16 in that case.

In the case which @himikeb1 has shown here, it appears "obvious" that the string is ASCII, because "Stop" is an English word. But what if it was a sequence of 4 gibberish Latin characters, which don't form a word in any modern language? I think anyone would agree that it's not so clear-cut in that case. But does that mean mbstring needs to include a dictionary of all words in all modern languages? (Remember, although the development of PHP is carried out in English, mbstring doesn't favor English over other languages; a large part of the purpose of mbstring is to help process text which is written in any world language.)

Instead of including such a dictionary and attempting to do word lookups in the input text, which I think is a non-starter in many ways, another approach would be to analyze a reference corpus for digram frequency, build a little database of the most common digrams in all the most frequently used world languages, and use digram frequency as a signal to help mb_detect_encoding figure out what the intended text encoding is.

Then we would be able to tell that "Stop" is ASCII, but "Tpqz" may not be.

Another approach would be to use an ML model; with a suitable corpus to train the model, it might potentially take advantage of all kinds of features of the input text to guess the intended encoding more accurately.

However, we shouldn't forget another point: mb_detect_encoding is intended to be used on strings which are at least a few dozen characters long. It will never be very accurate on strings with just a few bytes, no matter what we do. A lot of users misunderstand this.

@alexdowad
Copy link
Contributor

alexdowad commented Apr 23, 2025

@nielsdos Just expanding on the above, after thinking a bit more.

I'm going to say that it's a blanket "no" for any changes to the heuristics used by mb_detect_encoding, unless the proposed change has been tested on a corpus of at least several 10,000s of examples, covering strings of varying length and content, in all major world languages. When that testing is done, come back with statistics and we'll talk.

The current heuristics have been tested on many examples, with varying lengths, with varying content, and in many different languages. Getting mb_detect_encoding as accurate as it is today required striking a careful balance. I tried a number of tweaks and variations, and in all cases, there were examples where mb_detect_encoding would return the "wrong" answer. The current weights have been tweaked to make the "wrong" answers as rare as I could manage. If you adjust the algorithm again based on just one example, in all likelihood, you will make it work better on that one input and worse on 100 others.

If there is someone who is willing to do most of the work involved in adopting more powerful techniques, like analyzing digram frequencies, or using ML, I'm game for that, and would lend a hand.

For @himikeb1, if you expect that your input strings will usually be ASCII, and you want to prioritize treating them as ASCII when they are valid as such, then you should tell PHP to do that.

if (mb_check_encoding($str, 'ASCII')) {
    $enc = 'ASCII';
} else {
    $enc = mb_detect_encoding($str, $encs);
}

@youkidearitai
Copy link
Contributor

I agree to @alexdowad . Because Stop can also read to UTF-16 (U+5374, U+6F70).
https://3v4l.org/s0j4o

I think duplicate of #16566 .

@nielsdos
Copy link
Member

Makes sense, thanks for the thorough insights!

@nielsdos nielsdos closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2025
@himikeb1
Copy link
Author

Thank-you very much for your quick response on this.
I was reading all the comments in email and decided I'd try to change the ticket to "documentation" but I see you've already updated the page with a "can't-miss-it" warning, so thank-you for that!! I did not realize that it was a "guess".
I would request that @nielsdos 's comment
mb_detect_encoding is intended to be used on strings which are at least a few dozen characters long. It will never be very accurate on strings with just a few bytes
also be included in documentation somewhere for clarity.
And, before submitting the ticket, I had already implemented the exact check/fallback suggested by @alexdowad and it does appear to be working as intended now, so at least for my use-case, Alex is spot-on.

@alexdowad
Copy link
Contributor

@himikeb1 Just went to update the mb_detect_encoding docs, and discovered that someone else already did it yesterday 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants