Retiring some encodings?

Started by Michael Paquierabout 1 year ago22 messageshackers

michael@paquier.xyz

about 1 year ago

Hi all,

$subject is something that has been on my mind for a few weeks now,
following the recent events with CVE-2025-4207 (627acc3caa74) and
CVE-2025-1094 (5dc1e42b4fa6).

All the encodings supported are documented here:
https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED

One pain point in the code is with encoding GB18030, which has the
particularity to require a look at the first two bytes of an input to
know what's the actual length of a multi-byte character sequence.
This is not documented, and it can be a trapped in disguise,
particularly with the frontend code (see jsonapi.c).

With all that in mind, I have wanted to kick a discussion about
potentially removing one or more encodings from the core code,
including the backend part, the frontend part and the conversion
routines, coupled with checks in pg_upgrade to complain with database
or collations include the so-said encoding (the collation part needs
to be checked when not using ICU). Just being able to removing
GB18030 would do us a favor in the long-term, at least, but there's
more.

I have discussed the matter internally, with a few things pointed out:
- One thing that I was considering first would be the possibility to
add support for pluggable encodings in the backend code, giving an
option for retired encodings to be reloaded back to the server, with a
concept close to what we do for WAL RMGRs with IDs stuck in time once
defined, catalogs using pg_enc. Encouraging users to have their own
encodings, particularly ones that we'd consider to be unsafe by design
like the GB one may not be a good idea. But there is always the
argument that users may not want to pay the cost of a set of ALTER
DATABASE commands. Nobody really liked this idea of putting the
encoding responsibility into an extension :D
- Another idea, that Jeff Davis has mentioned is around unicode point
U+FFFD (didn't know about this one) that can be used to replace an
incoming character whose value is unknown. One strategy would then be
to map encodings whose internals are dropped to use UTF-8 underground,
with this character as exit path when finding characters that cannot
be understood, meaning partial and silent data loss.

Another set of things (also mentioned by Jeff as he's been diving into
this area a lot for the last few years with Jeremy Schneider), that
could also help $subject in the long-run, would be to try removing
some code used for non-UTF8 cases. Some examples:
- downcase_identifier() and pgstrcasecmp.c mention the specific case
of Turkish with 'i' and 'I'.
- Simplify regc_pg_locale.c which is unable to support non-UTF8
encodings with characters of more than 2 bytes.
- pg_wchar's uint type could be removed, switched to a codepoint value
(?) (pointed out by Jeff).
- Varlena cases with non-URF8, like text_position_setup().
In theory, what we could aim for here is to move forward with non-UTF8
encodings in the server, potentially moving away from libc. That's a
larger project, so it may be better to try something with some of the
low-hanging fruits like the non-UTF8 cases.

This last paragraph does not really my opinion about GB18030: I'd like
to propose its removal for v19 because looking at the first two bytes
of a character sequence to know how long the full sequence is stands
as an exception compared to all the encodings supported by Postgres.
Anyway, at the end, all that is about removing code. A large majority
of users use UTF-8, we could improve things, so feel free to comment.

Feel free to use this thread if you have different ideas or if you
have any comments.

Thanks,
--
Michael