-
Notifications
You must be signed in to change notification settings - Fork 7.8k
mb_detect_encoding fails to detect ASCII for certain words #18410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think I get why this happens. |
Something like this would work (but is incomplete). Also the assumption that UTF-16 occupies 2 bytes isn't right as you can have surrogates too, but a minimum of 2 is right so I guess for a "guessing" function it's acceptable. diff --git a/ext/mbstring/mbstring.c b/ext/mbstring/mbstring.c
index dec565707fa..b5023cb2643 100644
--- a/ext/mbstring/mbstring.c
+++ b/ext/mbstring/mbstring.c
@@ -3063,6 +3063,13 @@ static size_t count_demerits(struct candidate *array, size_t length, bool strict
/* Do we still have more input to process for this candidate encoding? */
if (array[i].in_len) {
const mbfl_encoding *enc = array[i].enc;
+ unsigned int char_len = enc->flag & (MBFL_ENCTYPE_SBCS | MBFL_ENCTYPE_WCS2 | MBFL_ENCTYPE_WCS4);
+ if (enc == &mbfl_encoding_utf16be || enc == &mbfl_encoding_utf16le || enc == &mbfl_encoding_utf16) {
+ // XXX
+ char_len = 2;
+ } else {
+ // XXX: mblen_table???
+ }
size_t out_len = enc->to_wchar((unsigned char**)&array[i].in, &array[i].in_len, wchar_buf, 128, &array[i].state);
ZEND_ASSERT(out_len <= 128);
/* Check this batch of decoded codepoints; are there any error markers?
@@ -3082,7 +3089,7 @@ static size_t count_demerits(struct candidate *array, size_t length, bool strict
array[i].demerits += 1000;
}
} else {
- array[i].demerits += estimate_demerits(w);
+ array[i].demerits += estimate_demerits(w) * MAX(1, char_len);
}
}
if (array[i].in_len == 0) {
@@ -3095,6 +3102,7 @@ try_next_encoding:;
for (size_t i = 0; i < length; i++) {
double demerits = array[i].demerits * (double) array[i].multiplier;
+ // fprintf(stderr,"demerits %s %f\n",array[i].enc->name,demerits);
array[i].demerits = demerits < (double) UINT64_MAX ? (uint64_t) demerits : UINT64_MAX;
}
cc @alexdowad @youkidearitai what do you think? |
@nielsdos I haven't looked at this in detail yet, but first let me just say: In the routine for rating which candidate encoding(s) are more likely to be the intended one, encodings with a small number of bytes per codepoint are deliberately "penalized", in that with more codepoints for a given input byte sequence, they have more chances to accumulate "demerits". One reason for this is because mbstring supports some legacy 1-byte text encodings where every byte decodes to some codepoint (i.e. every byte sequence is valid), and all the resulting codepoints fall within our set of "common" codepoints (for encoding detection purposes). If we normalized the "demerit" calculation to put 1-byte and 2-byte encodings on an even playing field, as you are suggesting, then such 1-byte encodings would always be chosen when they are included in the Also, there are lots of cases where you have a string with i.e. two UTF-16 Chinese characters, but those 2 Chinese characters can be interpreted as 4 Latin characters if the string is interpreted as ASCII instead. If the 2 Chinese characters in question are obscure, ancient characters which are not used in modern Chinese or Japanese, then guessing that the string is ASCII is reasonable. But if they are frequently used Chinese characters, then should we guess that the string is UTF-16, or ASCII? Currently, we favor UTF-16 in that case. In the case which @himikeb1 has shown here, it appears "obvious" that the string is ASCII, because "Stop" is an English word. But what if it was a sequence of 4 gibberish Latin characters, which don't form a word in any modern language? I think anyone would agree that it's not so clear-cut in that case. But does that mean mbstring needs to include a dictionary of all words in all modern languages? (Remember, although the development of PHP is carried out in English, mbstring doesn't favor English over other languages; a large part of the purpose of mbstring is to help process text which is written in any world language.) Instead of including such a dictionary and attempting to do word lookups in the input text, which I think is a non-starter in many ways, another approach would be to analyze a reference corpus for digram frequency, build a little database of the most common digrams in all the most frequently used world languages, and use digram frequency as a signal to help Then we would be able to tell that "Stop" is ASCII, but "Tpqz" may not be. Another approach would be to use an ML model; with a suitable corpus to train the model, it might potentially take advantage of all kinds of features of the input text to guess the intended encoding more accurately. However, we shouldn't forget another point: |
@nielsdos Just expanding on the above, after thinking a bit more. I'm going to say that it's a blanket "no" for any changes to the heuristics used by The current heuristics have been tested on many examples, with varying lengths, with varying content, and in many different languages. Getting If there is someone who is willing to do most of the work involved in adopting more powerful techniques, like analyzing digram frequencies, or using ML, I'm game for that, and would lend a hand. For @himikeb1, if you expect that your input strings will usually be ASCII, and you want to prioritize treating them as ASCII when they are valid as such, then you should tell PHP to do that.
|
I agree to @alexdowad . Because I think duplicate of #16566 . |
Makes sense, thanks for the thorough insights! |
Thank-you very much for your quick response on this. |
@himikeb1 Just went to update the |
Description
The following code:
Resulted in this output:
But I expected this output instead:
Note that:
does result in true.
Bonus Info:
PHP 7.1 on CentOS:
PHP Version
$ php8.3 -v
PHP 8.3.20 (cli) (built: Apr 10 2025 21:33:24) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.3.20, Copyright (c) Zend Technologies
with Zend OPcache v8.3.20, Copyright (c), by Zend Technologies
Also tested on 8.3.19 on CentOS Linux release 7.9.2009 (broken)
Also tested on 8.2.28 on CentOS Linux release 7.9.2009 (broken)
Also tested on 7.1.33 on CentOS Linux release 7.9.2009 (works)
Operating System
Ubuntu 22.04.4 LTS, CentOS Linux release 7.9.2009
The text was updated successfully, but these errors were encountered: