Android Cyrillic Encoding support----我们真的可以识别native 编码吗？

最新推荐文章于 2024-08-09 04:30:06 发布

原创最新推荐文章于 2024-08-09 04:30:06 发布 · 1.9k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#native Encoding #Cyrillic

Android 专栏收录该内容

16 篇文章

订阅专栏

本文探讨了Android对Cyrillic script支持的需求，分析了Windows-1251和ISO-8859-1两种单字节编码的区别，特别是它们在0x80～0x9F区段的差异，提出了如何区分这两种编码的问题，并给出了结论。

需求是做Android上的 Cyrillic script的支持， Cyrillic 是一种以单字节编码的 native charset。我们真的可以准确的判断出Cyrillic 本地编码，继而对他进行转化吗？

FYI，Cyrillic就是Windows-1251 A.K.A CP1251

我们真的可以识别本地字符编码吗？

下面来看两个Code Page, 一个是Windows-1251，另一个是ISO-8859-1

Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Bulgarian, Serbian Cyrillic and other languages. It is the most widely used for encoding the Bulgarian, Serbian and Macedonian languages[ citation needed].
In modern applications, Unicode is a preferred character set.
Windows-1251 and KOI8-R (or its Ukrainian variant KOI8-U) are much more commonly used than ISO 8859-5[ citation needed]. In the future, both may eventually give way to Unicode.

Windows-1251

ISO-8859-1 A.K.A latin-1

ISO 8859-1 encodes what it refers to as " Latin alphabet no. 1," consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout The Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.

试想一下，有办法把上面两种不同的native codepage区分开，他们的共同点都是 “单字节编码”，并且前半部分，完全和ascii兼容。

唯一不同点是，Windows-1251（Cyrillic）占用了0x80～0x9F区段，用来表示Cyrillic字符，而ISO-8859-1（Windows-1252/Latin1）没有使用该区段。

那么到底该如何区分呢？

在我们的项目中，我是使用了这样的方法，对于单字节的，non-ascii的字符，将这个字符的“裸数据”例如 0x88，去上面的两个codepage去查表，根据命中率的情况来判断字符到底是 1251 还是 1252（latin1）. 可想而知，这个方法是非常不可靠的，因为这两个单字节编码的码表之间的差距时间是太小了，根据命中率计算出来的confidence根本没有参考价值。