Retiring some encodings?
Hi all,
$subject is something that has been on my mind for a few weeks now,
following the recent events with CVE-2025-4207 (627acc3caa74) and
CVE-2025-1094 (5dc1e42b4fa6).
All the encodings supported are documented here:
https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED
One pain point in the code is with encoding GB18030, which has the
particularity to require a look at the first two bytes of an input to
know what's the actual length of a multi-byte character sequence.
This is not documented, and it can be a trapped in disguise,
particularly with the frontend code (see jsonapi.c).
With all that in mind, I have wanted to kick a discussion about
potentially removing one or more encodings from the core code,
including the backend part, the frontend part and the conversion
routines, coupled with checks in pg_upgrade to complain with database
or collations include the so-said encoding (the collation part needs
to be checked when not using ICU). Just being able to removing
GB18030 would do us a favor in the long-term, at least, but there's
more.
I have discussed the matter internally, with a few things pointed out:
- One thing that I was considering first would be the possibility to
add support for pluggable encodings in the backend code, giving an
option for retired encodings to be reloaded back to the server, with a
concept close to what we do for WAL RMGRs with IDs stuck in time once
defined, catalogs using pg_enc. Encouraging users to have their own
encodings, particularly ones that we'd consider to be unsafe by design
like the GB one may not be a good idea. But there is always the
argument that users may not want to pay the cost of a set of ALTER
DATABASE commands. Nobody really liked this idea of putting the
encoding responsibility into an extension :D
- Another idea, that Jeff Davis has mentioned is around unicode point
U+FFFD (didn't know about this one) that can be used to replace an
incoming character whose value is unknown. One strategy would then be
to map encodings whose internals are dropped to use UTF-8 underground,
with this character as exit path when finding characters that cannot
be understood, meaning partial and silent data loss.
Another set of things (also mentioned by Jeff as he's been diving into
this area a lot for the last few years with Jeremy Schneider), that
could also help $subject in the long-run, would be to try removing
some code used for non-UTF8 cases. Some examples:
- downcase_identifier() and pgstrcasecmp.c mention the specific case
of Turkish with 'i' and 'I'.
- Simplify regc_pg_locale.c which is unable to support non-UTF8
encodings with characters of more than 2 bytes.
- pg_wchar's uint type could be removed, switched to a codepoint value
(?) (pointed out by Jeff).
- Varlena cases with non-URF8, like text_position_setup().
In theory, what we could aim for here is to move forward with non-UTF8
encodings in the server, potentially moving away from libc. That's a
larger project, so it may be better to try something with some of the
low-hanging fruits like the non-UTF8 cases.
This last paragraph does not really my opinion about GB18030: I'd like
to propose its removal for v19 because looking at the first two bytes
of a character sequence to know how long the full sequence is stands
as an exception compared to all the encodings supported by Postgres.
Anyway, at the end, all that is about removing code. A large majority
of users use UTF-8, we could improve things, so feel free to comment.
Feel free to use this thread if you have different ideas or if you
have any comments.
Thanks,
--
Michael
The obvious question is how many people would suffer because
of that removal, as it would prevent them from using pg_upgrade.
Can anybody who works in a region that uses these encodings make
an educated guess?
Yours,
Laurenz Albe
On 22/05/2025 08:54, Michael Paquier wrote:
With all that in mind, I have wanted to kick a discussion about
potentially removing one or more encodings from the core code,
including the backend part, the frontend part and the conversion
routines, coupled with checks in pg_upgrade to complain with database
or collations include the so-said encoding (the collation part needs
to be checked when not using ICU). Just being able to removing
GB18030 would do us a favor in the long-term, at least, but there's
more.
+1 at high level for deprecating and removing conversions that are not
widely used anymore. As the first step, we can at least add a warning to
the documentation, that they will be removed in the future.
--
Heikki Linnakangas
Neon (https://neon.tech)
čt 22. 5. 2025 v 13:44 odesílatel Heikki Linnakangas <hlinnaka@iki.fi>
napsal:
On 22/05/2025 08:54, Michael Paquier wrote:
With all that in mind, I have wanted to kick a discussion about
potentially removing one or more encodings from the core code,
including the backend part, the frontend part and the conversion
routines, coupled with checks in pg_upgrade to complain with database
or collations include the so-said encoding (the collation part needs
to be checked when not using ICU). Just being able to removing
GB18030 would do us a favor in the long-term, at least, but there's
more.+1 at high level for deprecating and removing conversions that are not
widely used anymore. As the first step, we can at least add a warning to
the documentation, that they will be removed in the future.
+1
Pavel
Show quoted text
--
Heikki Linnakangas
Neon (https://neon.tech)
On Thu, May 22, 2025 at 02:44:39PM +0300, Heikki Linnakangas wrote:
On 22/05/2025 08:54, Michael Paquier wrote:
With all that in mind, I have wanted to kick a discussion about
potentially removing one or more encodings from the core code,
including the backend part, the frontend part and the conversion
routines, coupled with checks in pg_upgrade to complain with database
or collations include the so-said encoding (the collation part needs
to be checked when not using ICU). Just being able to removing
GB18030 would do us a favor in the long-term, at least, but there's
more.+1 at high level for deprecating and removing conversions that are not
widely used anymore. As the first step, we can at least add a warning to the
documentation, that they will be removed in the future.
Agreed on notification. A radical idea would be to add a warning for
the use of such encodings in PG 18, and then mention their deprecation
in the PG 18 release notes so everyone is informed they will be removed
in PG 19.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
On Thu, May 22, 2025 at 10:02:16AM -0400, Bruce Momjian wrote:
Agreed on notification. A radical idea would be to add a warning for
the use of such encodings in PG 18, and then mention their deprecation
in the PG 18 release notes so everyone is informed they will be removed
in PG 19.
With v18beta1 already out in the wild, I think that we are too late
for taking any action on this version at this stage. Putting a
deprecation notice for a selected set of conversions and/or encodings
and do the actual removal work when v20 opens up around July 2026
would sound like a better timing here, if the overall consensus goes
in this direction, of course.
--
Michael
On 23/05/2025 05:11, Michael Paquier wrote:
On Thu, May 22, 2025 at 10:02:16AM -0400, Bruce Momjian wrote:
Agreed on notification. A radical idea would be to add a warning for
the use of such encodings in PG 18, and then mention their deprecation
in the PG 18 release notes so everyone is informed they will be removed
in PG 19.With v18beta1 already out in the wild, I think that we are too late
for taking any action on this version at this stage. Putting a
deprecation notice for a selected set of conversions and/or encodings
and do the actual removal work when v20 opens up around July 2026
would sound like a better timing here, if the overall consensus goes
in this direction, of course.
If we plan to remove something in the future, I think putting a
deprecation notice in the docs in v18 is still a good idea. There's no
point in hiding the plan by not documenting it sooner. The more advance
notice people get the better.
--
Heikki Linnakangas
Neon (https://neon.tech)
On 23 May 2025, at 09:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
If we plan to remove something in the future, I think putting a deprecation notice in the docs in v18 is still a good idea. There's no point in hiding the plan by not documenting it sooner. The more advance notice people get the better.
+1
--
Daniel Gustafsson
HI
The obvious question is how many people would suffer because
of that removal, as it would prevent them from using pg_upgrade.
Can anybody who works in a region that uses these encodings make
an educated guess?
+1 Agree ,GB18030 A coding standard in China, if deleted, will have an
impact on the application of postgresql in China, and China is now
experiencing more and more hot postgresql heat, need to consider carefully!
On Fri, May 23, 2025 at 4:22 PM Daniel Gustafsson <daniel@yesql.se> wrote:
Show quoted text
On 23 May 2025, at 09:18, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
If we plan to remove something in the future, I think putting a
deprecation notice in the docs in v18 is still a good idea. There's no
point in hiding the plan by not documenting it sooner. The more advance
notice people get the better.+1
--
Daniel Gustafsson
On 23 May 2025, at 11:08, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
HI
The obvious question is how many people would suffer because
of that removal, as it would prevent them from using pg_upgrade.Can anybody who works in a region that uses these encodings make
an educated guess?+1 Agree ,GB18030 A coding standard in China, if deleted, will have an impact on the application of postgresql in China, and China is now experiencing more and more hot postgresql heat, need to consider carefully!
Thanks for the input, that's exactly what we need to make informed decisions.
How prevalent is GB18030 usage, is it used in all postgres installations in
China, most of them or in some particular cases?
--
Daniel Gustafsson
On 23 May 2025, at 11:08, wenhui qiu <qiuwenhuifx@gmail.com> wrote:
HI
The obvious question is how many people would suffer because
of that removal, as it would prevent them from using pg_upgrade.Can anybody who works in a region that uses these encodings make
an educated guess?+1 Agree ,GB18030 A coding standard in China, if deleted, will have an impact on the application of postgresql in China, and China is now experiencing more and more hot postgresql heat, need to consider carefully!
Thanks for the input, that's exactly what we need to make informed decisions.
How prevalent is GB18030 usage, is it used in all postgres installations in
China, most of them or in some particular cases?
Another point is, whether other DBMS support GB18030 or not. If they
support, but PostgreSQL would not in the future, that could be a
reason to move away from PostgreSQL.
As far as I know MySQL, Oracle and SQL server support GB18030.
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Fri, May 23, 2025 at 07:58:46PM +0900, Tatsuo Ishii wrote:
Another point is, whether other DBMS support GB18030 or not. If they
support, but PostgreSQL would not in the future, that could be a
reason to move away from PostgreSQL.
Yeah, that's a good point. I would also question what's the benefit
in using GB18030 over UTF-8, though. An obvious one I can see is
because legacy applications never get updated.
On my side, I'll try to grab some actual numbers or at least a trend
of them.
--
Michael
Yeah, that's a good point. I would also question what's the benefit
in using GB18030 over UTF-8, though. An obvious one I can see is
because legacy applications never get updated.
Plus users have too many GB18030 encoded files, I guess.
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
Hi Michael
Yeah, that's a good point. I would also question what's the benefit
in using GB18030 over UTF-8, though. An obvious one I can see is
because legacy applications never get updated.
The GB18030 encoding standard is a mandatory Chinese character encoding
standard required by regulations. Software sold and used in China must
support GB18030, with its latest version being the 2023 edition. The
primary advantage of GB18030 is that most Chinese characters require
only 2 bytes for storage, whereas UTF-8 necessitates 3 bytes for the
same characters. This makes GB18030 significantly more storage-efficient
compared to UTF-8 in terms of space utilization.
Tony
On 2025-05-24 Sa 8:58 PM, DEVOPS_WwIT wrote:
Hi Michael
Yeah, that's a good point. I would also question what's the benefit
in using GB18030 over UTF-8, though. An obvious one I can see is
because legacy applications never get updated.The GB18030 encoding standard is a mandatory Chinese character
encoding standard required by regulations. Software sold and used in
China must support GB18030, with its latest version being the 2023
edition. The primary advantage of GB18030 is that most Chinese
characters require only 2 bytes for storage, whereas UTF-8
necessitates 3 bytes for the same characters. This makes GB18030
significantly more storage-efficient compared to UTF-8 in terms of
space utilization.
Given this, removing it seems like a non-starter.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On 26 May 2025, at 18:07, Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-05-24 Sa 8:58 PM, DEVOPS_WwIT wrote:
The GB18030 encoding standard is a mandatory Chinese character encoding standard required by regulations. Software sold and used in China must support GB18030, with its latest version being the 2023 edition. The primary advantage of GB18030 is that most Chinese characters require only 2 bytes for storage, whereas UTF-8 necessitates 3 bytes for the same characters. This makes GB18030 significantly more storage-efficient compared to UTF-8 in terms of space utilization.
Given this, removing it seems like a non-starter.
Agreed, it seems very unappealing to remove something so important to such a
large userbase.
--
Daniel Gustafsson
On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote:
Agreed, it seems very unappealing to remove something so important to such a
large userbase.
Agreed that the so-said "state" level requirement would be a
non-starter.
--
Michael
Re: Michael Paquier
On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote:
Agreed, it seems very unappealing to remove something so important to such a
large userbase.Agreed that the so-said "state" level requirement would be a
non-starter.
Or maybe support for using these as server encodings could be
removed, keeping the client_encoding part intact?
Christoph
On Thu, Jun 05, 2025 at 03:35:19PM +0200, Christoph Berg wrote:
Re: Michael Paquier
On Mon, May 26, 2025 at 06:54:49PM +0200, Daniel Gustafsson wrote:
Agreed, it seems very unappealing to remove something so important to such a
large userbase.Agreed that the so-said "state" level requirement would be a
non-starter.Or maybe support for using these as server encodings could be
removed, keeping the client_encoding part intact?Christoph
Hi,
Doesn't the ICU system support this encoding? They could just use it if
you still want to remove our own implementation.
Regards,
Ken
Agreed that the so-said "state" level requirement would be a
non-starter.Or maybe support for using these as server encodings could be
removed, keeping the client_encoding part intact?
GB18030 is already client encoding only, and cannot be used as a
server encoding. The only way to save GB18030 data into database is,
converting GB18030 to UTF-8 (which can be done automatically).
Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
Hi,
On 2025-05-22 14:54:22 +0900, Michael Paquier wrote:
All the encodings supported are documented here:
https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED
There has been plenty discussion about GB18030, and it seems we aren't likely
to be able to drop that.
I think there are a lot easier cases though. The easiest probably is
MULE_INTERNAL - all discussions referencing it seem to be about oddities of
MULE_INTERNAL, not about using it. I think it's been effectively unused since
it's introduction. Due to not even having a conversion path to UTF-8 it's
really not practically usable IMO.
Greetings,
Andres Freund
On Thu, Jun 05, 2025 at 08:05:20PM -0400, Andres Freund wrote:
There has been plenty discussion about GB18030, and it seems we aren't likely
to be able to drop that.
Yes, as per upthread.
I think there are a lot easier cases though. The easiest probably is
MULE_INTERNAL - all discussions referencing it seem to be about oddities of
MULE_INTERNAL, not about using it. I think it's been effectively unused since
it's introduction. Due to not even having a conversion path to UTF-8 it's
really not practically usable IMO.
Perhaps, yes. I still need to do some homework here and gather some
data to share, FWIW.
--
Michael