ICU 54 and earlier are too dangerous

Started by Jeff Davisabout 3 years ago6 messageshackers
Jump to latest
#1Jeff Davis
pgsql@j-davis.com

In ICU 54 and earlier, if ucol_open() is unable to find a matching
locale, it will fall back to the *environment*.

Using ICU 54:

initdb -D data -N --locale="en_US.UTF-8"
pg_ctl -D data -l logfile start
psql postgres -c "create collation asdf(provider=icu, locale='asdf')"
# returns true
psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"
psql postgres -c "alter system set lc_messages='C'"
pg_ctl -D data -l logfile restart
# returns false and warns about collation version mismatch
psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"

This was fixed in ICU 55 to fall back to the root locale instead[1]https://icu.unicode.org/download/55m1,
which is stable, has a collator version, and is not dependent on the
environment. As far as I can tell, 55 and later never fall back to the
environment when opening a collator (unless you explicitly pass NULL to
ucol_open(), which is documented).

It would be nice if we could detect when this fallback-to-environment
happens, so that we could just refuse to create the bogus collation.
But I didn't find a good way. There are non-error return codes from
ucol_open() that seem promising[2]https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utypes_8h.html#a3343c1c8a8377277046774691c98d78c, but they aren't actually very
useful to distinguish the fallback-to-environment case as far as I can
tell.

Unless someone has a better idea, I think we need to bump the minimum
required ICU version to 55. That would solve the issue in v16 and
later, but those using old versions of ICU and old versions of postgres
would still be vulnerable to these kinds of typos.

Regards,
Jeff Davis

[1]: https://icu.unicode.org/download/55m1
[2]: https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utypes_8h.html#a3343c1c8a8377277046774691c98d78c
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/utypes_8h.html#a3343c1c8a8377277046774691c98d78c

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#1)
Re: ICU 54 and earlier are too dangerous

Jeff Davis <pgsql@j-davis.com> writes:

In ICU 54 and earlier, if ucol_open() is unable to find a matching
locale, it will fall back to the *environment*.

That's not great, but ...

Unless someone has a better idea, I think we need to bump the minimum
required ICU version to 55. That would solve the issue in v16 and
later, but those using old versions of ICU and old versions of postgres
would still be vulnerable to these kinds of typos.

... that seems like an overreaction. We know from the buildfarm
that there's still a lot of old ICU out there. Is it really improving
anybody's life to try to forbid them from using such a version?

regards, tom lane

#3Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#1)
Re: ICU 54 and earlier are too dangerous

Hi,

On 2023-03-13 16:39:04 -0700, Jeff Davis wrote:

In ICU 54 and earlier, if ucol_open() is unable to find a matching
locale, it will fall back to the *environment*.

Using ICU 54:

initdb -D data -N --locale="en_US.UTF-8"
pg_ctl -D data -l logfile start
psql postgres -c "create collation asdf(provider=icu, locale='asdf')"
# returns true
psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"
psql postgres -c "alter system set lc_messages='C'"
pg_ctl -D data -l logfile restart
# returns false and warns about collation version mismatch
psql postgres -c "select 'abc' collate asdf < 'ABC' collate asdf"

This was fixed in ICU 55 to fall back to the root locale instead[1],
which is stable, has a collator version, and is not dependent on the
environment. As far as I can tell, 55 and later never fall back to the
environment when opening a collator (unless you explicitly pass NULL to
ucol_open(), which is documented).

It would be nice if we could detect when this fallback-to-environment
happens, so that we could just refuse to create the bogus collation.
But I didn't find a good way. There are non-error return codes from
ucol_open() that seem promising[2], but they aren't actually very
useful to distinguish the fallback-to-environment case as far as I can
tell.

What non-error code is returned in the above example?

Can we query the returned collator and see if it matches what we were looking
for?

Unless someone has a better idea, I think we need to bump the minimum
required ICU version to 55. That would solve the issue in v16 and
later, but those using old versions of ICU and old versions of postgres
would still be vulnerable to these kinds of typos.

I'm a bit confused by the dates. https://icu.unicode.org/download/55m1 says
that version was released 2014-12-17, but the linked issue around root locales
is from 2018: https://unicode-org.atlassian.net/browse/ICU-10823 - I guess
the issue tracker was migrated at some point or such...

If indeed 2014 is the correct year of release, then it might be ok to increase
the minimum version...

Greetings,

Andres Freund

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#2)
Re: ICU 54 and earlier are too dangerous

On 14.03.23 01:26, Tom Lane wrote:

Unless someone has a better idea, I think we need to bump the minimum
required ICU version to 55. That would solve the issue in v16 and
later, but those using old versions of ICU and old versions of postgres
would still be vulnerable to these kinds of typos.

... that seems like an overreaction. We know from the buildfarm
that there's still a lot of old ICU out there. Is it really improving
anybody's life to try to forbid them from using such a version?

If I'm getting the dates right, the 10-year support of RHEL 7 will
expire in June 2024. So if we follow past practices, we could drop
support for RHEL 7 in PG17. This would allow us to drop support for old
libicu, and also old openssl, zlib, maybe more.

So if we don't feel like we need to do an emergency change here, there
is a path to do this in a principled way in the near future.

#5Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#3)
Re: ICU 54 and earlier are too dangerous

On Mon, 2023-03-13 at 18:13 -0700, Andres Freund wrote:

What non-error code is returned in the above example?

When the collator for locale "asdf" is opened, the status is set to
U_USING_DEFAULT_WARNING.

That seemed very promising at first, but it's the same thing returned
after opening most valid locales, including "en" and "en-US". It seems
to only return U_ZERO_ERROR on an exact hit, like "fr-CA" or "root".

There's also U_USING_FALLBACK_WARNING, which also seemed promising, but
it's returned when opening "fr-FR" or "ja-JP" (falls back to "fr" and
"ja" respectively).

Can we query the returned collator and see if it matches what we were
looking
for?

I tried a few variations of that in my canonicalization / validation
patch, which I called "validation". The closest thing I found is:

ucol_getLocaleByType(collator, ULOC_VALID_LOCALE, &status)

We could strip away the attributes and compare to the result of that,
and it mostly works. There are a few complications, like I think we
need to preserve the "collation" attribute for things like
"de@collation=phonebook".

Another thing to consider is that the environment might happen to open
the collation you intend at the time the collation is created, but then
later of course the environment can change, so we'd have to check every
time it's opened. And getting an error when the collation is opened is
not great, so it might need to be a WARNING or something, and it starts
to get less useful.

What would be *really* nice is if there was some kind of way to tell if
there was no real match to a known locale, either during open or via
some other API. I wasn't able to find one, though.

Actually, now that I think about it, we could just search all known
locales using either ucol_getAvailable() or uloc_getAvailable(), and
see if there's a match. Not very clean, but it should catch most
problems. I'll look into whether there's a reasonable way to match or
not.

I'm a bit confused by the dates.
https://icu.unicode.org/download/55m1 says
that version was released 2014-12-17, but the linked issue around
root locales
is from 2018: https://unicode-org.atlassian.net/browse/ICU-10823  - I
guess
the issue tracker was migrated at some point or such...

The dates are misleading in both git (migrated from SVN circa 2016) and
JIRA (migrated circa 2018, see
https://unicode-org.atlassian.net/browse/ICU-1 ). It seems 55.1 was
released in either 2014 or 2015.

Regards,
Jeff Davis

#6Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#5)
Re: ICU 54 and earlier are too dangerous

On Tue, 2023-03-14 at 08:48 -0700, Jeff Davis wrote:

Actually, now that I think about it, we could just search all known
locales using either ucol_getAvailable() or uloc_getAvailable(), and
see if there's a match. Not very clean, but it should catch most
problems. I'll look into whether there's a reasonable way to match or
not.

I posted a patch to do this as 0006 in the series here:

/messages/by-id/9afa6dbe0d31053ad265aeba488fde784fd5b7ab.camel@j-davis.com

Regards,
Jeff Davis