需求是做Android上的 Cyrillic script的支持, Cyrillic 是一种以单字节编码的 native charset。我们真的可以准确的判断出Cyrillic 本地编码,继而对他进行转化吗?
FYI,Cyrillic就是Windows-1251 A.K.A CP1251
我们真的可以识别本地字符编码吗?
下面来看两个Code Page, 一个是Windows-1251,另一个是ISO-8859-1
Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Bulgarian, Serbian Cyrillic and other languages. It is the most widely used for encoding the Bulgarian, Serbian and Macedonian languages[ citation needed].
In modern applications, Unicode is a preferred character set.
Windows-1251 and KOI8-R (or its Ukrainian variant KOI8-U) are much more commonly used than ISO 8859-5[ citation needed]. In the future, both may eventually give way to Unicode.
Windows-1251 (a.k.a. code page CP1251) is a popular 8-bit character encoding, designed to cover languages that use the Cyrillic script such as Russian, Bulgarian, Serbian Cyrillic and other languages. It is the most widely used for encoding the Bulgarian, Serbian and Macedonian languages[ citation needed].
In modern applications, Unicode is a preferred character set.
Windows-1251 and KOI8-R (or its Ukrainian variant KOI8-U) are much more commonly used than ISO 8859-5[ citation needed]. In the future, both may eventually give way to Unicode.
Windows-1251
ISO-8859-1 A.K.A latin-1
ISO 8859-1 encodes what it refers to as " Latin alphabet no. 1," consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout The Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.
试想一下,有办法把上面两种不同的native codepage区分开,他们的共同点都是 “单字节编码”,并且前半部分,完全和ascii兼容。
唯一不同点是,Windows-1251(Cyrillic)占用了0x80~0x9F区段,用来表示Cyrillic字符,而ISO-8859-1(Windows-1252/Latin1)没有使用该区段。
那么到底该如何区分呢?
在我们的项目中,我是使用了这样的方法,对于单字节的,non-ascii的字符,将这个字符的“裸数据”例如 0x88,去上面的两个codepage去查表,根据命中率的情况来判断字符到底是 1251 还是 1252(latin1). 可想而知,这个方法是非常不可靠的,因为这两个单字节编码的码表之间的差距时间是太小了,根据命中率计算出来的confidence根本没有参考价值。
结论:
We should be told that what kind of native encoding we're facing other than detecting or guessing.
这个链接是我就该问题在StackOverflow上的提问 :
http://stackoverflow.com/questions/17544426/how-to-detect-windows-1251-encoded-characters
本文探讨了Android对Cyrillic script支持的需求,分析了Windows-1251和ISO-8859-1两种单字节编码的区别,特别是它们在0x80~0x9F区段的差异,提出了如何区分这两种编码的问题,并给出了结论。
2215

被折叠的 条评论
为什么被折叠?



