Skip to content

Commit 4856618

Browse files
committed
Generate EUC_CN mappings from gb18030-2022.ucm
In the wake of cfa6cd2, EUC_CN was the only encoding that used gb-18030-2000.xml to generate the .map files. Since EUC_CN is a subset of GB18030, we can easily use the same UCM file. This allows deleting the XML file from our repository. Author: Chao Li <[email protected]> Discussion: https://postgr.es/m/CANWCAZaNRXZ-5NuXmsaMA2mKvMZnCGHZqQusLkpE%2B8YX%2Bi5OYg%40mail.gmail.com
1 parent 684a745 commit 4856618

File tree

3 files changed

+23
-30929
lines changed

3 files changed

+23
-30929
lines changed

src/backend/utils/mb/Unicode/Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ $(eval $(call map_rule,gbk,UCS_to_most.pl,CP936.TXT,GBK))
5050
$(eval $(call map_rule,johab,UCS_to_JOHAB.pl,JOHAB.TXT))
5151
$(eval $(call map_rule,uhc,UCS_to_UHC.pl,windows-949-2000.xml))
5252
$(eval $(call map_rule,euc_jp,UCS_to_EUC_JP.pl,CP932.TXT JIS0212.TXT))
53-
$(eval $(call map_rule,euc_cn,UCS_to_EUC_CN.pl,gb-18030-2000.xml))
53+
$(eval $(call map_rule,euc_cn,UCS_to_EUC_CN.pl,gb18030-2022.ucm))
5454
$(eval $(call map_rule,euc_kr,UCS_to_EUC_KR.pl,KSX1001.TXT))
5555
$(eval $(call map_rule,euc_tw,UCS_to_EUC_TW.pl,CNS11643.TXT))
5656
$(eval $(call map_rule,sjis,UCS_to_SJIS.pl,CP932.TXT))
@@ -75,7 +75,7 @@ BIG5.TXT CNS11643.TXT:
7575
euc-jis-2004-std.txt sjis-0213-2004-std.txt:
7676
$(DOWNLOAD) http://x0213.org/codetable/$(@F)
7777

78-
gb-18030-2000.xml windows-949-2000.xml:
78+
windows-949-2000.xml:
7979
$(DOWNLOAD) https://raw.githubusercontent.com/unicode-org/icu-data/master/charset/data/xml/$(@F)
8080

8181
gb18030-2022.ucm:

src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,16 +2,17 @@
22
#
33
# Copyright (c) 2007-2025, PostgreSQL Global Development Group
44
#
5-
# src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
5+
# src/backend/utils/mb/Unicode/UCS_to_EUC_CN.pl
66
#
7-
# Generate UTF-8 <--> GB18030 code conversion tables from
8-
# "gb-18030-2000.xml", obtained from
9-
# http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/
7+
# Generate UTF-8 <--> EUC_CN code conversion tables from
8+
# "gb18030-2022.ucm", obtained from
9+
# https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/
1010
#
1111
# The lines we care about in the source file look like
12-
# <a u="009A" b="81 30 83 36"/>
13-
# where the "u" field is the Unicode code point in hex,
14-
# and the "b" field is the hex byte sequence for GB18030
12+
# <UXXXX> \xYY[\xYY...] |n
13+
# where XXXX is the Unicode code point in hex,
14+
# and the \xYY... is the hex byte sequence for GB18030,
15+
# and n is a flag indicating the type of mapping.
1516

1617
use strict;
1718
use warnings FATAL => 'all';
@@ -22,17 +23,26 @@
2223

2324
# Read the input
2425

25-
my $in_file = "gb-18030-2000.xml";
26+
my $in_file = "gb18030-2022.ucm";
2627

2728
open(my $in, '<', $in_file) || die("cannot open $in_file");
2829

2930
my @mapping;
3031

3132
while (<$in>)
3233
{
33-
next if (!m/<a u="([0-9A-F]+)" b="([0-9A-F ]+)"/);
34-
my ($u, $c) = ($1, $2);
35-
$c =~ s/ //g;
34+
# Mappings may have been removed by commenting out
35+
next if /^#/;
36+
37+
next if !/^<U([0-9A-Fa-f]+)>\s+
38+
((?:\\x[0-9A-Fa-f]{2})+)\s+
39+
\|(\d+)/x;
40+
my ($u, $c, $flag) = ($1, $2, $3);
41+
$c =~ s/\\x//g;
42+
43+
# We only want round-trip mappings
44+
next if ($flag ne '0');
45+
3646
my $ucs = hex($u);
3747
my $code = hex($c);
3848

0 commit comments

Comments
 (0)