Skip to content

Commit cfa6cd2

Browse files
committed
Generate GB18030 mappings from the Unicode Consortium's UCM file
Previously we built the .map files for GB18030 (version 2000) from an XML file. The 2022 version for this encoding is only available as a Unicode Character Mapping (UCM) file, so as preparatory refactoring switch to this format as the source for building version 2000. As we do with most input files for the conversion mappings, download the file on demand. In order to generate the same mappings we have now, we must download from a previous upstream commit, rather than the head since the latter contains a correction not present in our current .map files. The XML file is still used by EUC_CN, so we cannot delete it from our repository. GB18030 is a superset of EUC_CN, so it may be possible to build EUC_CN from the same UCM file, but that is left for future work. Author: Chao Li <[email protected]> Discussion: https://postgr.es/m/966d9fc.169.198741fe60b.Coremail.jiaoshuntian%40highgo.com
1 parent e56a601 commit cfa6cd2

File tree

3 files changed

+29
-11
lines changed

3 files changed

+29
-11
lines changed

src/backend/utils/mb/Unicode/Makefile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ $(eval $(call map_rule,euc_cn,UCS_to_EUC_CN.pl,gb-18030-2000.xml))
5454
$(eval $(call map_rule,euc_kr,UCS_to_EUC_KR.pl,KSX1001.TXT))
5555
$(eval $(call map_rule,euc_tw,UCS_to_EUC_TW.pl,CNS11643.TXT))
5656
$(eval $(call map_rule,sjis,UCS_to_SJIS.pl,CP932.TXT))
57-
$(eval $(call map_rule,gb18030,UCS_to_GB18030.pl,gb-18030-2000.xml))
57+
$(eval $(call map_rule,gb18030,UCS_to_GB18030.pl,gb-18030-2000.ucm))
5858
$(eval $(call map_rule,big5,UCS_to_BIG5.pl,CP950.TXT BIG5.TXT CP950.TXT))
5959
$(eval $(call map_rule,euc_jis_2004,UCS_to_EUC_JIS_2004.pl,euc-jis-2004-std.txt))
6060
$(eval $(call map_rule,shift_jis_2004,UCS_to_SHIFT_JIS_2004.pl,sjis-0213-2004-std.txt))
@@ -78,6 +78,9 @@ euc-jis-2004-std.txt sjis-0213-2004-std.txt:
7878
gb-18030-2000.xml windows-949-2000.xml:
7979
$(DOWNLOAD) https://raw.githubusercontent.com/unicode-org/icu-data/master/charset/data/xml/$(@F)
8080

81+
gb-18030-2000.ucm:
82+
$(DOWNLOAD) https://raw.githubusercontent.com/unicode-org/icu-data/d9d3a6ed27bb98a7106763e940258f0be8cd995b/charset/data/ucm/$(@F)
83+
8184
GB2312.TXT:
8285
$(DOWNLOAD) 'http://trac.greenstone.org/browser/trunk/gsdl/unicode/MAPPINGS/EASTASIA/GB/GB2312.TXT?rev=1842&format=txt'
8386

src/backend/utils/mb/Unicode/UCS_to_GB18030.pl

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,14 @@
55
# src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
66
#
77
# Generate UTF-8 <--> GB18030 code conversion tables from
8-
# "gb-18030-2000.xml", obtained from
9-
# http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/
8+
# "gb-18030-2000.ucm", obtained from
9+
# https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm
1010
#
1111
# The lines we care about in the source file look like
12-
# <a u="009A" b="81 30 83 36"/>
13-
# where the "u" field is the Unicode code point in hex,
14-
# and the "b" field is the hex byte sequence for GB18030
12+
# <UXXXX> \xYY[\xYY...] |n
13+
# where XXXX is the Unicode code point in hex,
14+
# and the \xYY... is the hex byte sequence for GB18030,
15+
# and n is a flag indicating the type of mapping.
1516

1617
use strict;
1718
use warnings FATAL => 'all';
@@ -22,17 +23,26 @@
2223

2324
# Read the input
2425

25-
my $in_file = "gb-18030-2000.xml";
26+
my $in_file = "gb-18030-2000.ucm";
2627

2728
open(my $in, '<', $in_file) || die("cannot open $in_file");
2829

2930
my @mapping;
3031

3132
while (<$in>)
3233
{
33-
next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/);
34-
my ($u, $c) = ($1, $2);
35-
$c =~ s/ //g;
34+
# Mappings may have been removed by commenting out
35+
next if /^#/;
36+
37+
next if !/^<U([0-9A-Fa-f]+)>\s+
38+
((?:\\x[0-9A-Fa-f]{2})+)\s+
39+
\|(\d+)/x;
40+
my ($u, $c, $flag) = ($1, $2, $3);
41+
$c =~ s/\\x//g;
42+
43+
# We only want round-trip mappings
44+
next if ($flag ne '0');
45+
3646
my $ucs = hex($u);
3747
my $code = hex($c);
3848
if ($code >= 0x80 && $ucs >= 0x0080)

src/backend/utils/mb/conversion_procs/utf8_and_gb18030/utf8_and_gb18030.c

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,12 @@ utf8word_to_unicode(uint32 c)
124124
/*
125125
* Perform mapping of GB18030 ranges to UTF8
126126
*
127-
* The ranges we need to convert are specified in gb-18030-2000.xml.
127+
* General description, and the range we need to convert for U+10000 and up:
128+
* https://htmlpreview.github.io/?https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/gb18030.html
129+
*
130+
* Ranges up to U+FFFF:
131+
* https://github.com/unicode-org/icu-data/blob/main/charset/source/gb18030/ranges.txt
132+
*
128133
* All are ranges of 4-byte GB18030 codes.
129134
*/
130135
static uint32

0 commit comments

Comments
 (0)