Order changes in PG16 since ICU introduction
A couple of days ago, our PostGIS PG16 bots started failing with order
changes in text.
We have our tests set to locale=c
It seems since April 20th, our tests that rely on sorting characters
changed.
As noted in this ticket:
https://trac.osgeo.org/postgis/ticket/5375
I'm assuming it's result of icu change:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fcb21b3ac
dcb9a60313325618fd7080aa36f1626
I suspect all our bots are compiling with icu enabled. But I haven't
confirmed.
I'm assuming this is an expected change in behavior, but just want to
confirm.
Thanks,
Regina
"Regina Obe" <lr@pcorp.us> writes:
A couple of days ago, our PostGIS PG16 bots started failing with order
changes in text.
We have our tests set to locale=c
It seems since April 20th, our tests that rely on sorting characters
changed.
As noted in this ticket:
I'm assuming it's result of icu change:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fcb21b3ac
dcb9a60313325618fd7080aa36f1626
I suspect all our bots are compiling with icu enabled. But I haven't
confirmed.
If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.
regards, tom lane
On Fri, 2023-04-21 at 11:27 -0400, Regina Obe wrote:
A couple of days ago, our PostGIS PG16 bots started failing with
order
changes in text.
We have our tests set to locale=c
Are you sure it's still using the C locale? The results seem to be
explainable if the locale switched from "C" to "en-US-x-icu":
The results of the following are the same in v15 and v16:
select 'RM(25)/nodes|+|21|1' collate "C" < 'RM(25)/nodes|-|21|' collate
"C"; -- true
select 'RM(25)/nodes|+|21|1' collate "en-US-x-icu" < 'RM(25)/nodes|-
|21|' collate "en-US-x-icu"; -- false
I suspect when the initdb and configure defaults changed from libc to
ICU, then your locale changed.
Regards,
Jeff Davis
On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
"Regina Obe" <lr@pcorp.us> writes:
If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.
Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :
=# select version();
PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
=# show lc_collate;
C
=# select '+' < '-';
f
=# select '+' < '-' collate "C";
t
I don't know if it should matter but also:
=# show lc_messages;
C
--strk;
On 21.04.23 19:09, Sandro Santilli wrote:
On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
"Regina Obe" <lr@pcorp.us> writes:
If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :=# select version();
PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
=# show lc_collate;
C
=# select '+' < '-';
f
If the database is created with locale provider ICU, then lc_collate
does not apply here, so the result might be correct (depending on what
locale you have set).
Show quoted text
=# select '+' < '-' collate "C";
t
On Fri, 2023-04-21 at 19:09 +0200, Sandro Santilli wrote:
=# select version();
PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
=# show lc_collate;
C
=# select '+' < '-';
f
What is the result of:
select datlocprovider, datcollate, daticulocale
from pg_database where datname=current_database();
=# select '+' < '-' collate "C";
t
Regards,
Jeff Davis
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
If the database is created with locale provider ICU, then lc_collate
does not apply here, so the result might be correct (depending on what
locale you have set).
FWIW, an installation created under LANG=C defaults to ICU locale
en-US-u-va-posix for me (see psql \l), and that still sorts as
expected on my RHEL8 box. We've not seen buildfarm problems either.
I am wondering however whether this doesn't mean that all our carefully
coded fast paths for C locale just went down the drain. Does the ICU
code have any of that? Has any performance testing been done to see
what impact this change had on C-locale installations? (The current
code coverage report for pg_locale.c is not encouraging.)
regards, tom lane
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
If the database is created with locale provider ICU, then lc_collate
does not apply here, so the result might be correct (depending on what
locale you have set).FWIW, an installation created under LANG=C defaults to ICU locale en-US-u-
va-posix for me (see psql \l), and that still sorts as expected on my
RHEL8 box.
We've not seen buildfarm problems either.
I am wondering however whether this doesn't mean that all our carefully
coded fast paths for C locale just went down the drain. Does the ICU code
have any of that? Has any performance testing been done to see what
impact
this change had on C-locale installations? (The current code coverage
report
for pg_locale.c is not encouraging.)
regards, tom lane
Just another metric.
On my mingw64 setup, I built a test database on PG16 (built with icu
support) and PG15 (no icu support)
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C';
I think the above is the similar setup we have when testing.
On PG15
SELECT '+' < '-' ; returns true
On PG 16 returns false
For PG 16, to strk's point, you have to do: to get a true
SELECT '+' COLLATE "C" < '-' COLLATE "C";
I would expect since I'm initializing my db in collate C they would both
behave the same
"Regina Obe" <lr@pcorp.us> writes:
On my mingw64 setup, I built a test database on PG16 (built with icu
support) and PG15 (no icu support)
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C';
As has been pointed out already, setting LC_COLLATE/LC_CTYPE is
meaningless when the locale provider is ICU. You need to look
at what ICU locale is being chosen, or force it with LOCALE = 'C'.
regards, tom lane
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8'
LC_COLLATE = 'C'
LC_CTYPE = 'C';
As has been pointed out already, setting LC_COLLATE/LC_CTYPE is
meaningless when the locale provider is ICU. You need to look at what ICU
locale is being chosen, or force it with LOCALE = 'C'.regards, tom lane
Okay got it was on IRC with RhodiumToad and he suggested:
CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C' ICU_LOCALE='C';
Which gives expected result:
SELECT '+' < '-' ; -- true
but gives me a notice:
NOTICE: using standard form "en-US-u-va-posix" for locale "C"
"Regina Obe" <lr@pcorp.us> writes:
Okay got it was on IRC with RhodiumToad and he suggested:
CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C' ICU_LOCALE='C';
Which gives expected result:
SELECT '+' < '-' ; -- true
but gives me a notice:
NOTICE: using standard form "en-US-u-va-posix" for locale "C"
Yeah. My recommendation is just LOCALE:
regression=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
CREATE DATABASE
regression=# CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' ICU_LOCALE = 'C';
NOTICE: using standard form "en-US-u-va-posix" for locale "C"
CREATE DATABASE
I think it's probably intentional that ICU_LOCALE is stricter
about being given a real ICU locale name, but I didn't write
any of that code.
regards, tom lane
"Peter" == Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
Peter> If the database is created with locale provider ICU, then
Peter> lc_collate does not apply here,
Having lc_collate return a value which is silently being ignored seems
to me rather hugely confusing.
Also, somewhere along the line someone broke initdb --no-locale, which
should result in C locale being the default everywhere, but when I just
tested it it picked 'en' for an ICU locale, which is not the right
thing.
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Peter" == Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
Peter> If the database is created with locale provider ICU, then
Peter> lc_collate does not apply here,
Having lc_collate return a value which is silently being ignored seems
to me rather hugely confusing.
It's not *completely* ignored --- there are bits of code that are not
yet ICU-ified and will still use the libc facilities. So we can't
get rid of those options yet, even in an ICU-based database.
Also, somewhere along the line someone broke initdb --no-locale, which
should result in C locale being the default everywhere, but when I just
tested it it picked 'en' for an ICU locale, which is not the right
thing.
Confirmed:
$ LANG=en_US.utf8 initdb --no-locale
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
Using default ICU locale "en_US".
Using language tag "en-US" for ICU locale "en_US".
The database cluster will be initialized with this locale configuration:
provider: icu
ICU locale: en-US
LC_COLLATE: C
LC_CTYPE: C
...
That needs to be fixed: --no-locale should prevent any consideration
of initdb's LANG/LC_foo environment.
regards, tom lane
Yeah. My recommendation is just LOCALE:
regression=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING =
'UTF8' LOCALE = 'C'; CREATE DATABASE regression=# CREATE DATABASE test2
TEMPLATE=template0 ENCODING = 'UTF8' ICU_LOCALE = 'C';
NOTICE: using standard form "en-US-u-va-posix" for locale "C"
CREATE DATABASEI think it's probably intentional that ICU_LOCALE is stricter about being
given
a real ICU locale name, but I didn't write any of that code.
regards, tom lane
CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
Doesn't seem to work at least not under mingw64 anyway.
SELECT '+' < '-' ;
Returns false
"Regina Obe" <lr@pcorp.us> writes:
CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
Doesn't seem to work at least not under mingw64 anyway.
Hmm, doesn't work for me either:
$ LANG=en_US.utf8 initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
Using default ICU locale "en_US".
Using language tag "en-US" for ICU locale "en_US".
The database cluster will be initialized with this locale configuration:
provider: icu
ICU locale: en-US
LC_COLLATE: en_US.utf8
LC_CTYPE: en_US.utf8
LC_MESSAGES: en_US.utf8
LC_MONETARY: en_US.utf8
LC_NUMERIC: en_US.utf8
LC_TIME: en_US.utf8
...
$ psql postgres
psql (16devel)
Type "help" for help.
postgres=# SELECT '+' < '-' ;
?column?
----------
f
(1 row)
(as expected, so far)
postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
CREATE DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "postgres".
test1=# SELECT '+' < '-' ;
?column?
----------
f
(1 row)
(wrong!)
test1=# \l
List of databases
Name | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges
-----------+----------+----------+-----------------+------------+------------+------------+-----------+-----------------------
postgres | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | en-US | |
template0 | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | en-US | | =c/postgres +
| | | | | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | en-US | | =c/postgres +
| | | | | | | | postgres=CTc/postgres
test1 | postgres | UTF8 | icu | C | C | en-US | |
(4 rows)
Looks like the "pick en-US even when told not to" problem exists here too.
regards, tom lane
On Fri, Apr 21, 2023 at 07:14:13PM +0200, Peter Eisentraut wrote:
On 21.04.23 19:09, Sandro Santilli wrote:
On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
"Regina Obe" <lr@pcorp.us> writes:
If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :=# select version();
PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
=# show lc_collate;
C
=# select '+' < '-';
fIf the database is created with locale provider ICU, then lc_collate does
not apply here, so the result might be correct (depending on what locale you
have set).
The database is created by a perl script which starts like this:
$ENV{"LC_ALL"} = "C";
$ENV{"LANG"} = "C";
And then runs:
createdb --encoding=UTF-8 --template=template0 --lc-collate=C
Should we tweak anything else to make the results predictable ?
--strk;
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Also, somewhere along the line someone broke initdb --no-locale,
which should result in C locale being the default everywhere, but
when I just tested it it picked 'en' for an ICU locale, which is not
the right thing.
Tom> Confirmed:
Tom> $ LANG=en_US.utf8 initdb --no-locale
Tom> The files belonging to this database system will be owned by user "postgres".
Tom> This user must also own the server process.
Tom> Using default ICU locale "en_US".
Tom> Using language tag "en-US" for ICU locale "en_US".
Tom> The database cluster will be initialized with this locale configuration:
Tom> provider: icu
Tom> ICU locale: en-US
Tom> LC_COLLATE: C
Tom> LC_CTYPE: C
Tom> ...
Tom> That needs to be fixed: --no-locale should prevent any
Tom> consideration of initdb's LANG/LC_foo environment.
Would it also not make sense to also take into account any --locale and
--lc-* options before choosing an ICU default locale? Right now if you
do, say, initdb --locale=fr_FR you get an ICU locale based on the
environment but lc_* settings based on the option, which seems maximally
confusing.
Also, what happens now to lc_collate_is_c() when the provider is ICU? Am
I missing something, or is it never true now, even if you specified C /
POSIX / en-US-u-va-posix as the ICU locale? This seems like it could be
an important pessimization.
Also also, we now have the problem that it is much harder to create a
'C' collation database within an existing cluster (e.g. for testing)
without knowing whether the default provider is ICU. In the past one
would have done:
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
but now that creates a database that uses the same ICU locale as
template0 by default. If instead one tries:
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' ICU_LOCALE='C';
then one gets an error if the default locale provider is _not_ ICU. The
only option now seems to be:
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' LOCALE_PROVIDER = 'libc';
which of course doesn't work in older pg versions.
--
Andrew.
On Fri, Apr 21, 2023 at 10:27:49AM -0700, Jeff Davis wrote:
On Fri, 2023-04-21 at 19:09 +0200, Sandro Santilli wrote:
� =# select version();
� PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
� =# show lc_collate;
� C
� =# select '+' < '-';
� fWhat is the result of:
select datlocprovider, datcollate, daticulocale
from pg_database where datname=current_database();
datlocprovider | i
datcollate | C
daticulocale | en-US
--strk;
On Fri, 2023-04-21 at 14:23 -0400, Tom Lane wrote:
postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8'
LOCALE = 'C';
...
test1 | postgres | UTF8 | icu | C |
C | en-US | |
(4 rows)Looks like the "pick en-US even when told not to" problem exists here
too.
Both provider (ICU) and the icu locale (en-US) are inherited from
template0. The LOCALE parameter to CREATE DATABASE doesn't affect
either of those things, because there's a separate parameter
ICU_LOCALE.
This happens the same way in v15, and although it matches the
documentation technically, it is not a great user experience.
I have a couple ideas:
1. Introduce a "none" provider to separate the concept of C/POSIX
locales from the libc provider. It's not really using a provider
anyway, it's just using memcmp(), and I think it causes confusion to
combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than
"LOCALE_PROVIDER=libc LOCALE='C'".
2. Change the CREATE DATABASE syntax to catch these errors better at
the possible expense of backwards compatibility.
I am also having second thoughts about accepting "C" or "POSIX" as an
ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not
terribly useful (why not just use memcmp()?), it's not fast in my
measurements (en-US is faster), so maybe it's better to just throw an
error and tell the user to use C (or provider=none as I suggest
above)?
Obviously the user could manually type "en-US-u-va-posix" if that's the
locale they want. Throwing an error would be a backwards-compatibility
issue, but in v15 an ICU locale of "C" just gives the root locale
anyway, which is probably not what they want.
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
I have a couple ideas:
1. Introduce a "none" provider to separate the concept of C/POSIX
locales from the libc provider. It's not really using a provider
anyway, it's just using memcmp(), and I think it causes confusion to
combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than
"LOCALE_PROVIDER=libc LOCALE='C'".
I think I might like this idea, except for one thing: you're imagining
that the locale doesn't control anything except string comparisons.
What about to_upper/to_lower, character classifications in regexes, etc?
(I'm not sure whether those operations can get redirected to ICU today
or whether they still always go to libc, but we'll surely want to fix
it eventually if the latter is still true.)
In any case, that seems somewhat orthogonal to what we're on about here,
which is making the behavior of CREATE DATABASE less surprising and more
backwards-compatible. I'm not sure that provider=none can help with that.
Aside from the user-surprise issues discussed up to now, pg_dump scripts
emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER
clauses in CREATE DATABASE, and people are going to be very unhappy
if that means they suddenly get totally different locale semantics
after restoring into a new DB. I think we need some plan for mapping
libc-style locale specs into ICU locales so that we can make that
more nearly transparent.
2. Change the CREATE DATABASE syntax to catch these errors better at
the possible expense of backwards compatibility.
That is the exact opposite of what I think we need. Backwards
compatibility isn't optional.
Maybe this means we are not ready to do ICU-by-default in v16.
It certainly feels like there might be more here than we want to
start designing post-feature-freeze.
regards, tom lane