GB18030-2022 Support in PostgreSQL

Started by Saladin7 months ago54 messages
Jump to latest
#1Saladin
jiaoshuntian@highgo.com

Hi hackers,

I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions.

I would like to ask:

Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 version?Would the community be open to contributions in this area?

Best regards,

JiaoShuntian

HighGo Inc.

#2Saladin
jiaoshuntian@highgo.com
In reply to: Saladin (#1)
Re: GB18030-2022 Support in PostgreSQL

I would like to ask:

Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 version?Would the community be open to contributions in this area?

I think we only need to update the perl script and map file to complete this task.

JiaoShuntian
HighGo Inc.

#3wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Saladin (#1)
Re: GB18030-2022 Support in PostgreSQL

Hi
😂,Not long ago, many people were rushing to remove this character set
because of a security vulnerability. I was honestly quite shocked when I
saw it.

Thanks

On Mon, Aug 4, 2025 at 4:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote:

Show quoted text

Hi hackers,

I noticed that PostgreSQL currently supports GB18030 encoding based on the
older GB18030-2000 standard (as seen in commits like extend GB18030
conversion
<https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=...&gt;).
However, China has since updated its mandatory character set standard
to GB18030-2022, which includes additional characters and stricter
compliance requirements.GB18030-2022 is now the official standard in China,
and ensuring PostgreSQL’s full compliance would be beneficial for users in
Chinese-speaking regions.

I would like to ask:

Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022
version?Would the community be open to contributions in this area?

Best regards,

JiaoShuntian

HighGo Inc.

#4John Naylor
john.naylor@enterprisedb.com
In reply to: Saladin (#1)
Re: GB18030-2022 Support in PostgreSQL

On Mon, Aug 4, 2025 at 3:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote:

I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions.

This is a non-backwards-compatible change:

https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf
https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf

There is a risk of breaking applications, although only a few dozen
mappings changed. If it were added as a separate encoding, users could
opt in.

--
John Naylor
Amazon Web Services

#5Andrew Dunstan
andrew@dunslane.net
In reply to: John Naylor (#4)
Re: GB18030-2022 Support in PostgreSQL

On 2025-08-04 Mo 6:35 AM, John Naylor wrote:

On Mon, Aug 4, 2025 at 3:08 PM JiaoShuntian <jiaoshuntian@highgo.com> wrote:

I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions.

This is a non-backwards-compatible change:

https://www.unicode.org/L2/L2022/22274-disruptive-changes.pdf
https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf

There is a risk of breaking applications, although only a few dozen
mappings changed. If it were added as a separate encoding, users could
opt in.

That makes sense ... naming the new encoding so as to avoid confusion
might be a challenge.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#5)
Re: GB18030-2022 Support in PostgreSQL

Andrew Dunstan <andrew@dunslane.net> writes:

On 2025-08-04 Mo 6:35 AM, John Naylor wrote:

There is a risk of breaking applications, although only a few dozen
mappings changed. If it were added as a separate encoding, users could
opt in.

That makes sense ... naming the new encoding so as to avoid confusion
might be a challenge.

We have precedent for that in SHIFT_JIS_2004. Presumably if we
make this a new encoding, it'd be GB18030_2022.

However, adding a new encoding ID is not without breakage risks
of its own, stemming from some code knowing the new ID and others
not. I recall that we had some actual problems of that ilk when
we added SHIFT_JIS_2004, and some of them were pretty subtle.
See e.g. this comment from src/bin/initdb/Makefile:

# Note: it's important that we link to encnames.o from libpgcommon, not
# from libpq, else we have risks of version skew if we run with a libpq
# shared library from a different PG version. Define
# USE_PRIVATE_ENCODING_FUNCS to ensure that that happens.

That was long enough ago that I have little faith either that that
fix still does what it intended to (the code has been rejiggered
significantly since the issue was last battle-tested), or that
there are not similar hazards elsewhere.

So on the whole I'd lean a bit towards just redefining GB18030 as
meaning the new standard. The fact that we don't support it as a
server-side encoding perhaps makes that idea more tenable than it
would be if the encoding governed the interpretation of our own
stored data.

regards, tom lane

#7Kenneth Marshall
ktm@rice.edu
In reply to: Saladin (#1)
Re: GB18030-2022 Support in PostgreSQL

On Mon, Aug 04, 2025 at 04:08:24PM +0800, JiaoShuntian wrote:

Hi hackers,

I noticed that PostgreSQL currently supports GB18030 encoding based on the older GB18030-2000 standard (as seen in commits like extend GB18030 conversion). However, China has since updated its mandatory character set standard to GB18030-2022, which includes additional characters and stricter compliance requirements.GB18030-2022 is now the official standard in China, and ensuring PostgreSQL’s full compliance would be beneficial for users in Chinese-speaking regions.

I would like to ask:

Are there any plans to upgrade PostgreSQL’s GB18030 support to the 2022 version?Would the community be open to contributions in this area?

Best regards,

JiaoShuntian
HighGo Inc.

Hi,

I believe that it is in ICU already. You should be able to use that as
your locale provider.

Regards,
Ken

#8Chao Li
li.evan.chao@gmail.com
In reply to: Tom Lane (#6)
Re: GB18030-2022 Support in PostgreSQL

2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote:

So on the whole I'd lean a bit towards just redefining GB18030 as
meaning the new standard. The fact that we don't support it as a
server-side encoding perhaps makes that idea more tenable than it
would be if the encoding governed the interpretation of our own
stored data.

regards, tom lane

I agree with Tom that we may just redefine GB18030 to comply with the 2022 standard.

As John Naylor pointed, 2022 is not backward compatible, that is true. However, I went through all the incompatible changes, those are all characters rarely used. So I would guess most of the existing databases won’t be impacted and the rest with encoding GB18030 need to do data migration before upgrading to a PG version that switches to GB18030-2022. I think PG may delegate data migration tasks to third party PG service vendors. They may develop simple or complex migration tools to help different use cases.

One use case I am thinking is that, say a database uses default encoding (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 since version 73.1. If the database worked with a pre-73.1 version of ICU, and now if ICU will be upgraded to a post-73.1 version, the database may face the same backward compatibility risk. That is because, say a gb code (0xA6D9) maps to U+E78D with GB18030 and changes to map to U+FE10 with GB18030-2022. If a char of 0xA6D9 was given to the database, it would be stored as U+E78D on disk. After upgrading ICU to post-73.1, U+E78D would no longer be considered as “0xA6D9” by ICU. So to keep the data’s original meaning, a data migration has to been done to update U+E78D to U+FE10. In this example, PG version is not changed, but the database still needs a data migration.

The other reason I don’t think a new encoding GB18030_2022 is needed is that, as GB18030_2022 is a hard requirement from the government, most likely all commercial database must comply with. Thus a lot of current databases with GB18030 must be migrated to GB18030_2022. As PG doesn’t support to change a database’s encoding, if a new encoding is added, then an existing db must be migrated to a new db. If only redefine GB18030, then existing databases only need some data migrations, which should be easier.

So, I think PG doesn’t need to worries about the backward compatibility problem too much, all PG needs to do is to state/emphasize clearly in the release note that a data migration might be required. At the time when the new version is released, if some third party migration tools are known working fine, the release note may recommend the tools.

Regards,

Chao Li (Evan)
------------------------------
HighGo Infra. Software Inc.
https://www.highgo.com/

#9John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#8)
Re: GB18030-2022 Support in PostgreSQL

On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li.evan.chao@gmail.com> wrote:

2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote:

So on the whole I'd lean a bit towards just redefining GB18030 as
meaning the new standard. The fact that we don't support it as a
server-side encoding perhaps makes that idea more tenable than it
would be if the encoding governed the interpretation of our own
stored data.

I agree with Tom that we may just redefine GB18030 to comply with the 2022 standard.

As John Naylor pointed, 2022 is not backward compatible, that is true. However, I went through all the incompatible changes, those are all characters rarely used.

If that's the case than redefining is probably okay.

One use case I am thinking is that, say a database uses default encoding (UTF-8) and ICU locale provider. As ICU started to support GB180303-2022 since version 73.1.

ICU locales can only be used with sever-side encodings.

At the time when the new version is released, if some third party migration tools are known working fine, the release note may recommend the tools.

I highly doubt such a large hammer will be necessary. Whatever advice
we give for discovery and conversion of affected text is our
responsibility and can be in the form of example queries.

--
John Naylor
Amazon Web Services

#10Peter Eisentraut
peter_e@gmx.net
In reply to: Chao Li (#8)
Re: GB18030-2022 Support in PostgreSQL

On 05.08.25 08:22, Chao Li wrote:

I agree with Tom that we may just redefine GB18030 to comply with the
2022 standard.

As John Naylor pointed, 2022 is not backward compatible, that is true.
However, I went through all the incompatible changes, those are all
characters rarely used. So I would guess most of the existing databases
won’t be impacted and the rest with encoding GB18030 need to do data
migration before upgrading to a PG version that switches to
GB18030-2022. I think PG may delegate data migration tasks to third
party PG service vendors. They may develop simple or complex migration
tools to help different use cases.

Note that you can also create custom conversions using CREATE
CONVERSION, so that would be something for those who would need the old
behavior.

#11Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#9)
Re: GB18030-2022 Support in PostgreSQL

I did more researches about the changes in 2022 over 2000, here is a
summary:

* 66 new characters have been added in 2022. All these are 4 bytes
characters. As the map files store only 2 bytes GB code mappings, 4 bytes
GB code mapping are calculated, thus these chars can be properly
encoded/decoded without this patch, I tested that.
* 9 characters are no longer required by 2022, but application may decide
to retain them or not. As the ucm file (
https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm)
retains them, we also retain them.
* Unicode mappings for 18 characters have changed. Only these changes will
cause backward compatibility issues. However, half of them are rarely
used punctuation
marks and rests are glyphs that I cannot recognize as a native Chinese
speaker. So these changes should not significantly impact most
existing databases.

I added a test case with a mapping changed char, and the test passes:

% make check
...
# All 229 tests passed.

For more details on the standard change, see
https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132

I am attaching the patch file.

Chao Li (Evan)
---------------------
Highgo Software Co., Ltd.
https://www.highgo.com/

John Naylor <johncnaylorls@gmail.com> 于2025年8月5日周二 18:25写道:

Show quoted text

On Tue, Aug 5, 2025 at 1:22 PM Chao Li <li.evan.chao@gmail.com> wrote:

2025年8月4日 21:51,Tom Lane <tgl@sss.pgh.pa.us> wrote:

So on the whole I'd lean a bit towards just redefining GB18030 as
meaning the new standard. The fact that we don't support it as a
server-side encoding perhaps makes that idea more tenable than it
would be if the encoding governed the interpretation of our own
stored data.

I agree with Tom that we may just redefine GB18030 to comply with the

2022 standard.

As John Naylor pointed, 2022 is not backward compatible, that is true.

However, I went through all the incompatible changes, those are all
characters rarely used.

If that's the case than redefining is probably okay.

One use case I am thinking is that, say a database uses default encoding

(UTF-8) and ICU locale provider. As ICU started to support GB180303-2022
since version 73.1.

ICU locales can only be used with sever-side encodings.

At the time when the new version is released, if some third party

migration tools are known working fine, the release note may recommend the
tools.

I highly doubt such a large hammer will be necessary. Whatever advice
we give for discovery and conversion of affected text is our
responsibility and can be in the form of example queries.

--
John Naylor
Amazon Web Services

Attachments:

v1-0001-Upgrade-GB18030-encoding-support-from-2000-to-202.patchapplication/octet-stream; name=v1-0001-Upgrade-GB18030-encoding-support-from-2000-to-202.patchDownload+32696-32514
#12Chao Li
li.evan.chao@gmail.com
In reply to: Chao Li (#11)
Re: GB18030-2022 Support in PostgreSQL

I have created a patch https://commitfest.postgresql.org/patch/5954/.
CommitFests requested a rebase, so I rebased the code and created the v2
patch.

BTW, I have tested all 66 new characters, 9 not-required characters and
18 changed characters in a way as:

evantest=# SELECT encode(convert_from(decode('82359632', 'hex'),
'GB18030')::bytea, 'hex');
 encode
--------
 e9bfab
(1 row)

All encoded correctly.

Chao Li (Evan)

---------------------
HighGo Software Co., Ltd.
https://www.highgo.com/

Show quoted text

On 2025/8/7 16:14, Chao Li wrote:

I did more researches about the changes in 2022 over 2000, here is a
summary:

* 66 new characters have been added in 2022. All these are 4 bytes
characters. As the map files store only 2 bytes GB code mappings, 4
bytes GB code mapping are calculated, thus these chars can be properly
encoded/decoded without this patch, I tested that.
* 9 characters are no longer required by 2022, but application may
decide to retain them or not. As the ucm file
(https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/gb18030-2022.ucm)
retains them, we also retain them.
* Unicode mappings for 18 characters have changed. Only these changes
will cause backward compatibility issues. However, half of them are
rarely used punctuation marks and rests are glyphs that I cannot
recognize as a native Chinese speaker. So these changes should not
significantly impact most existing databases.

I added a test case with a mapping changed char, and the test passes:

% make check
...
# All 229 tests passed.

For more details on the standard change, see
https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132

I am attaching the patch file.

Chao Li (Evan)
---------------------
Highgo Software Co., Ltd.
https://www.highgo.com/

Attachments:

v2-0001-Upgrade-GB18030-encoding-support-from-2000-to-202.patchtext/plain; charset=UTF-8; name=v2-0001-Upgrade-GB18030-encoding-support-from-2000-to-202.patchDownload+32696-32514
#13John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#12)
Re: GB18030-2022 Support in PostgreSQL

On Mon, Aug 11, 2025 at 9:01 AM Chao Li <li.evan.chao@gmail.com> wrote:

I have created a patch https://commitfest.postgresql.org/patch/5954/. CommitFests requested a rebase, so I rebased the code and created the v2 patch.

BTW, I have tested all 66 new characters, 9 not-required characters and 18 changed characters in a way as:

"9 characters are no longer required by the new standard, but are
retained in this patch for compatibility"

How is that done?

I added a test case with a mapping changed char, and the test passes:

% make check
...
# All 229 tests passed.

For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132

I am attaching the patch file.

Going from the old .xml file to the .ucm file makes it difficult to
see the relevant changes. Also, there are nearly 1000 non-user-visible
changes like this in the output file that are not explained:

-  /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/
+  /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/

The 2000 version is available in the .ucm format, so maybe converting
to that first would be a good preparatory patch:

https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm

Looking at the history, it looks like that file has seen small
revisions, so it may take some research to get the exact equivalent to
the XML file we use. That will also tell us if anything will change
for us besides the actual 2022 revision.

--
John Naylor
Amazon Web Services

#14Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#13)
Re: GB18030-2022 Support in PostgreSQL

Hi John,

Thanks for your review.

Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The diff between 2000.ucm and 2022.ucm are quite small:

```diff - omit the comment part

<U20AC> \x80 |3
<U3000> \xA3\xA0 |3
<UE5E5> \xA3\xA0 |4

28067a28099,28114

<U9FB4> \xFE\x59 |0
<U9FB4> \x82\x35\x90\x37 |3
<U9FB5> \xFE\x61 |0
<U9FB5> \x82\x35\x90\x38 |3
<U9FB6> \xFE\x66 |0
<U9FB6> \x82\x35\x90\x39 |3
<U9FB7> \xFE\x67 |0
<U9FB7> \x82\x35\x91\x30 |3
<U9FB8> \xFE\x6D |0
<U9FB8> \x82\x35\x91\x31 |3
<U9FB9> \xFE\x7E |0
<U9FB9> \x82\x35\x91\x32 |3
<U9FBA> \xFE\x90 |0
<U9FBA> \x82\x35\x91\x33 |3
<U9FBB> \xFE\xA0 |0
<U9FBB> \x82\x35\x91\x34 |3

29577c29624
< <UE5E5> \xA3\xA0 |0
---

# <UE5E5> \xA3\xA0 |0

30001,30010c30048,30057
< <UE78D> \xA6\xD9 |0
< <UE78E> \xA6\xDA |0
< <UE78F> \xA6\xDB |0
< <UE790> \xA6\xDC |0
< <UE791> \xA6\xDD |0
< <UE792> \xA6\xDE |0
< <UE793> \xA6\xDF |0
< <UE794> \xA6\xEC |0
< <UE795> \xA6\xED |0
< <UE796> \xA6\xF3 |0
---

<UE78D> \xA6\xD9 |1
<UE78E> \xA6\xDA |1
<UE78F> \xA6\xDB |1
<UE790> \xA6\xDC |1
<UE791> \xA6\xDD |1
<UE792> \xA6\xDE |1
<UE793> \xA6\xDF |1
<UE794> \xA6\xEC |1
<UE795> \xA6\xED |1
<UE796> \xA6\xF3 |1

30146c30193
< <UE81E> \xFE\x59 |0
---

<UE81E> \xFE\x59 |1

30154c30201
< <UE826> \xFE\x61 |0
---

<UE826> \xFE\x61 |1

30159,30160c30206,30207
< <UE82B> \xFE\x66 |0
< <UE82C> \xFE\x67 |0
---

<UE82B> \xFE\x66 |1
<UE82C> \xFE\x67 |1

30166c30213
< <UE832> \xFE\x6D |0
---

<UE832> \xFE\x6D |1

30183c30230
< <UE843> \xFE\x7E |0
---

<UE843> \xFE\x7E |1

30200c30247
< <UE854> \xFE\x90 |0
---

<UE854> \xFE\x90 |1

30216c30263
< <UE864> \xFE\xA0 |0
---

<UE864> \xFE\xA0 |1

30470a30518,30537

<UFE10> \xA6\xD9 |0
<UFE10> \x84\x31\x82\x36 |3
<UFE11> \xA6\xDB |0
<UFE11> \x84\x31\x82\x37 |3
<UFE12> \xA6\xDA |0
<UFE12> \x84\x31\x82\x38 |3
<UFE13> \xA6\xDC |0
<UFE13> \x84\x31\x82\x39 |3
<UFE14> \xA6\xDD |0
<UFE14> \x84\x31\x83\x30 |3
<UFE15> \xA6\xDE |0
<UFE15> \x84\x31\x83\x31 |3
<UFE16> \xA6\xDF |0
<UFE16> \x84\x31\x83\x32 |3
<UFE17> \xA6\xEC |0
<UFE17> \x84\x31\x83\x33 |3
<UFE18> \xA6\xED |0
<UFE18> \x84\x31\x83\x34 |3
<UFE19> \xA6\xF3 |0
<UFE19> \x84\x31\x83\x35 |3

```

As you can see, the changes only reflect to the changed 18 characters plus other 3 unicode points (U20AC, U3000, UE5E5). My code comment in UCS_to_GB18030.pl has explained these changes:

```code comment from UCS_to_GB18030.pl
# The |n is a flag, where n has values of 0, 1, 3, 4.
# With a refeence to https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132,
# the flag should mean the following:
# 0 - round-trip mapping
# 1 - there are 18 mappings with flag 1, those are mapping changes
# from GB180303-2000 to GB18030-2022. Old mappings are marked
# with flag 1, new mappings with flag 0. So we can ignore all
# mappings with flag 0.
# 3 - there are 20 mappings with flag 3:
# 18 of them reflect to the 18 mappings with flag 1, but means
# the old mapping's unicode's new mapping with GB18030-2022.
# These 18 new mappings have no actual glyphs in GB18030-2022.
# So we can ignore these 18 mappings with flag 3.
# The other 2 are: "<U20AC> \x80 |3" and "<U3000> \xA3\xA0 |3".
# They are two reserved fallbacks for compatibility with GBK and
# other web data as in WHATWG. Both U20AC and U3000 have round-
# trip mappings in GB18030-2022, so we can ignore these two
# mappings with flag 3.
# So, we can ignore all mappings with flag 3.
# 4 - there is only one mapping with flag 4: <UE5E5> \xA3\xA0 |4.
# This is a "good one-way" mapping from U+E5E5 to \xA3\xA0
# for maximum compatibility with previous behavior. So we can
# ignore this mapping as well.
```

For your question:

"9 characters are no longer required by the new standard, but are
retained in this patch for compatibility"

How is that done?

The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping:

<UF92C> \xFD\x9C |0

Still appears in 2022.ucm, so that this character is retained.

Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/

Show quoted text

On Aug 11, 2025, at 13:50, John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Aug 11, 2025 at 9:01 AM Chao Li <li.evan.chao@gmail.com> wrote:

I have created a patch https://commitfest.postgresql.org/patch/5954/. CommitFests requested a rebase, so I rebased the code and created the v2 patch.

BTW, I have tested all 66 new characters, 9 not-required characters and 18 changed characters in a way as:

"9 characters are no longer required by the new standard, but are
retained in this patch for compatibility"

How is that done?

I added a test case with a mapping changed char, and the test passes:

% make check
...
# All 229 tests passed.

For more details on the standard change, see https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132

I am attaching the patch file.

Going from the old .xml file to the .ucm file makes it difficult to
see the relevant changes. Also, there are nearly 1000 non-user-visible
changes like this in the output file that are not explained:

-  /*** Three byte table, leaf: efa8xx - offset 0x07aba ***/
+  /*** Three byte table, leaf: efa8xx - offset 0x07b3a ***/

The 2000 version is available in the .ucm format, so maybe converting
to that first would be a good preparatory patch:

https://github.com/unicode-org/icu-data/blob/main/charset/data/ucm/gb-18030-2000.ucm

Looking at the history, it looks like that file has seen small
revisions, so it may take some research to get the exact equivalent to
the XML file we use. That will also tell us if anything will change
for us besides the actual 2022 revision.

--
John Naylor
Amazon Web Services

#15John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#14)
Re: GB18030-2022 Support in PostgreSQL

On Mon, Aug 11, 2025 at 3:22 PM Chao Li <li.evan.chao@gmail.com> wrote:

Hi,

For future reference, please don't quote my entire message below yours
-- it clutters the archives and also removes context.

Yes, I did a diff between 2000.ucm and 2022.ucm when I worked on the patch. The diff between 2000.ucm and 2022.ucm are quite small:

That would match my expectation. In case it wasn't clear before, my
preference is to split this patch into two patches: First convert to
.ucm, then update to 2022 revision. Then the small diff will be
obvious to everyone who looks at the second commit.

For your question:

"9 characters are no longer required by the new standard, but are
retained in this patch for compatibility"

How is that done?

The 9 mappings are not changed between 2000.ucm and 2022.ucm. For example, GB18030 code 0xFD9C is one of the 9 not-required code, but the mapping:

<UF92C> \xFD\x9C |0

Still appears in 2022.ucm, so that this character is retained.

Thanks for clarifying -- by saying "retained in the patch", the commit
message implied to me that the patch added something not in the
upstream file.

--
John Naylor
Amazon Web Services

#16Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#15)
Re: GB18030-2022 Support in PostgreSQL

That would match my expectation. In case it wasn't clear before, my
preference is to split this patch into two patches: First convert to
.ucm, then update to 2022 revision. Then the small diff will be
obvious to everyone who looks at the second commit.

Sure I can split the patch into two. The patch only changes the .xml file to .ucm file and updating the perl script. As a result, map files should not be changed.

Then the second patch will update the ucm file, so that the second patch should be small, contains only ucm changes and map file changes.

One thing to confirm with you. To archive that, we will rename the ucm file as gb18030.ucm without suffix of “-2000” and “-2022”, otherwise git won’t be able to show the diff. Is that what you meant?

Thanks for clarifying -- by saying "retained in the patch", the commit
message implied to me that the patch added something not in the
upstream file.

I will update the commit message in the new patch.

Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/

#17John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#16)
Re: GB18030-2022 Support in PostgreSQL

On Mon, Aug 11, 2025 at 4:25 PM Chao Li <li.evan.chao@gmail.com> wrote:

Sure I can split the patch into two. The patch only changes the .xml file to .ucm file and updating the perl script. As a result, map files should not be changed.

Then the second patch will update the ucm file, so that the second patch should be small, contains only ucm changes and map file changes.

One thing to confirm with you. To archive that, we will rename the ucm file as gb18030.ucm without suffix of “-2000” and “-2022”, otherwise git won’t be able to show the diff. Is that what you meant?

Usually git is pretty smart about renames combined with small changes,
so I would try keeping the original names and see what it does.

--
John Naylor
Amazon Web Services

#18John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#17)
Re: GB18030-2022 Support in PostgreSQL

On Tue, Aug 12, 2025 at 9:09 AM Chao Li <li.evan.chao@gmail.com> wrote:

[bringing this back to the original thread]

So, I compared 2000 ucm with 2005 ucm also compared 2005 ucm with 2022 ucm. Then I found that some changed in 2005 is reverted in 2022, that why diff between 2000 and 2022 is small. For example, the following mappings

Yes, this was mentioned in the "disruptive changes" document linked in
my first email in this thread:

"The 2005 edition included 6 characters with double mappings. The 2022
edition removes the
double mappings.
The 2005 edition included 9 characters from the CJK Compatibility
Ideographs block. In
Unicode/10646, these all have canonical decomposition mappings to
characters in the URO. In
the 2022 edition, these nine compatibility characters are removed."

So, for how to create patch 2, I think we have 3 options:

1. As planned, update to the latest version of 2000 ucm, then skip 2005 and directly upgrade to 2022 in patch 3. This way, we just honor 2000 ucm regardless that the change is actually introduced by 2005.

2. Skip the latest version of 2000 ucm and upgrade to 2005 ucm. This way will clearly show the upgrade path 2000->2005->2022. Downside is that 2005 introduced some changes that are reverted in 2022, which will cause some unnecessary changes in map files.

3. Skip patch 2, directly go to patch 3. So that, patch 3 will include changes introduced by both 2005 and 2022. This way makes minimum changes to map files.

#3 is what I had in mind to begin with unless we found some reason not
to. Minimizing churn is a lucky side effect that reinforces that
choice.

Before getting to that, I thought I'd bring this up to the community:

+# Copyright (C) 2000-2009, International Business Machines
Corporation and others.
+# All Rights Reserved.

The previous XML file didn't contain a copyright notice -- does anyone
want to make a case for not checking unicode-org's source file into
our tree because of this? The 2022 update changes it to

# Copyright (C) 2016 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html
# Copyright (C) 2000-2012, International Business Machines Corporation
and others.
# All Rights Reserved.

...and the above links to https://www.unicode.org/license.txt

--
John Naylor
Amazon Web Services

#19Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#18)
Re: GB18030-2022 Support in PostgreSQL

3. Skip patch 2, directly go to patch 3. So that, patch 3 will include changes introduced by both 2005 and 2022. This way makes minimum changes to map files.

#3 is what I had in mind to begin with unless we found some reason not
to. Minimizing churn is a lucky side effect that reinforces that
choice.

Cool, then I will take option 3.

Before getting to that, I thought I'd bring this up to the community:

The previous XML file didn't contain a copyright notice -- does anyone
want to make a case for not checking unicode-org's source file into
our tree because of this? The 2022 update changes it to

Thanks for pointing out the unicode license issue, I really didn’t notice about that.

I did some quick research. As we generate mapping files from the ucm files, and the map files are built into the final executable binaries, we are redistributing Unicode-derived data, so we should still include the Unicode license. Thus, not checking in the ucm won’t waive the license problem.

We can just added a license file, say named unicode_license.txt with proper content under the same folder of the ucm file. I guess that would address the license problem.

This following the ChatGTP generated content of the license file:

```
Portions of this product include data from the Unicode Character Database
and other Unicode® data files.

Copyright © 1991–2025 Unicode, Inc.
All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy
of the Unicode data files and any associated documentation (the "Data Files")
or Unicode software and any associated documentation (the "Software") to deal
in the Data Files or Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, and/or sell copies
of the Data Files or Software, and to permit persons to whom the Data Files or
Software are furnished to do so, provided that either:

(a) this copyright and permission notice appear with all copies of the Data
Files or Software, or

(b) this copyright and permission notice appear in associated documentation.

THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD
PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN
THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL
DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE DATA FILES OR
SOFTWARE.

Unicode and the Unicode logo are trademarks of Unicode, Inc. in the United
States and other countries. All third party trademarks referenced herein are
the property of their respective owners.
```

Regards,

Chao Li (Evan)
--------------------
HighGo Software Co., Ltd.
https://www.highgo.com/

#20Peter Eisentraut
peter_e@gmx.net
In reply to: John Naylor (#18)
Re: GB18030-2022 Support in PostgreSQL

On 12.08.25 06:57, John Naylor wrote:

Before getting to that, I thought I'd bring this up to the community:

+# Copyright (C) 2000-2009, International Business Machines
Corporation and others.
+# All Rights Reserved.

The previous XML file didn't contain a copyright notice -- does anyone
want to make a case for not checking unicode-org's source file into
our tree because of this? The 2022 update changes it to

# Copyright (C) 2016 and later: Unicode, Inc. and others.
# License & terms of use:http://www.unicode.org/copyright.html
# Copyright (C) 2000-2012, International Business Machines Corporation
and others.
# All Rights Reserved.

...and the above links tohttps://www.unicode.org/license.txt

Could we download this file on demand, like we do for the other input
files for the conversion mappings?

#21John Naylor
john.naylor@enterprisedb.com
In reply to: Peter Eisentraut (#20)
#22Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#21)
#23Chao Li
li.evan.chao@gmail.com
In reply to: Chao Li (#22)
#24John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#23)
#25Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#24)
#26John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#25)
#27Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#26)
#28Chao Li
li.evan.chao@gmail.com
In reply to: Chao Li (#27)
#29John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#27)
#30Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#29)
#31John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#30)
#32Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#31)
#33Chao Li
li.evan.chao@gmail.com
In reply to: Chao Li (#32)
#34Chao Li
li.evan.chao@gmail.com
In reply to: Chao Li (#33)
#35John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#33)
#36John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#34)
#37Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#36)
#38John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#37)
#39Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#38)
#40John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#39)
#41Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#40)
#42John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#41)
#43Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#42)
#44Chao Li
li.evan.chao@gmail.com
In reply to: Chao Li (#43)
#45Chao Li
li.evan.chao@gmail.com
In reply to: Chao Li (#44)
#46John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#44)
#47Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#46)
#48John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#44)
#49Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#48)
#50John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#49)
#51Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#50)
#52John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#51)
#53Chao Li
li.evan.chao@gmail.com
In reply to: John Naylor (#52)
#54John Naylor
john.naylor@enterprisedb.com
In reply to: Chao Li (#53)