Order changes in PG16 since ICU introduction
A couple of days ago, our PostGIS PG16 bots started failing with order
changes in text.
We have our tests set to locale=c
It seems since April 20th, our tests that rely on sorting characters
changed.
As noted in this ticket:
https://trac.osgeo.org/postgis/ticket/5375
I'm assuming it's result of icu change:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fcb21b3ac
dcb9a60313325618fd7080aa36f1626
I suspect all our bots are compiling with icu enabled. But I haven't
confirmed.
I'm assuming this is an expected change in behavior, but just want to
confirm.
Thanks,
Regina
"Regina Obe" <lr@pcorp.us> writes:
A couple of days ago, our PostGIS PG16 bots started failing with order
changes in text.
We have our tests set to locale=c
It seems since April 20th, our tests that rely on sorting characters
changed.
As noted in this ticket:
I'm assuming it's result of icu change:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fcb21b3ac
dcb9a60313325618fd7080aa36f1626
I suspect all our bots are compiling with icu enabled. But I haven't
confirmed.
If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.
regards, tom lane
On Fri, 2023-04-21 at 11:27 -0400, Regina Obe wrote:
A couple of days ago, our PostGIS PG16 bots started failing with
order
changes in text.
We have our tests set to locale=c
Are you sure it's still using the C locale? The results seem to be
explainable if the locale switched from "C" to "en-US-x-icu":
The results of the following are the same in v15 and v16:
select 'RM(25)/nodes|+|21|1' collate "C" < 'RM(25)/nodes|-|21|' collate
"C"; -- true
select 'RM(25)/nodes|+|21|1' collate "en-US-x-icu" < 'RM(25)/nodes|-
|21|' collate "en-US-x-icu"; -- false
I suspect when the initdb and configure defaults changed from libc to
ICU, then your locale changed.
Regards,
Jeff Davis
On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
"Regina Obe" <lr@pcorp.us> writes:
If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.
Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :
=# select version();
PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
=# show lc_collate;
C
=# select '+' < '-';
f
=# select '+' < '-' collate "C";
t
I don't know if it should matter but also:
=# show lc_messages;
C
--strk;
On 21.04.23 19:09, Sandro Santilli wrote:
On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
"Regina Obe" <lr@pcorp.us> writes:
If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :=# select version();
PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
=# show lc_collate;
C
=# select '+' < '-';
f
If the database is created with locale provider ICU, then lc_collate
does not apply here, so the result might be correct (depending on what
locale you have set).
Show quoted text
=# select '+' < '-' collate "C";
t
On Fri, 2023-04-21 at 19:09 +0200, Sandro Santilli wrote:
=# select version();
PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
=# show lc_collate;
C
=# select '+' < '-';
f
What is the result of:
select datlocprovider, datcollate, daticulocale
from pg_database where datname=current_database();
=# select '+' < '-' collate "C";
t
Regards,
Jeff Davis
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
If the database is created with locale provider ICU, then lc_collate
does not apply here, so the result might be correct (depending on what
locale you have set).
FWIW, an installation created under LANG=C defaults to ICU locale
en-US-u-va-posix for me (see psql \l), and that still sorts as
expected on my RHEL8 box. We've not seen buildfarm problems either.
I am wondering however whether this doesn't mean that all our carefully
coded fast paths for C locale just went down the drain. Does the ICU
code have any of that? Has any performance testing been done to see
what impact this change had on C-locale installations? (The current
code coverage report for pg_locale.c is not encouraging.)
regards, tom lane
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
If the database is created with locale provider ICU, then lc_collate
does not apply here, so the result might be correct (depending on what
locale you have set).FWIW, an installation created under LANG=C defaults to ICU locale en-US-u-
va-posix for me (see psql \l), and that still sorts as expected on my
RHEL8 box.
We've not seen buildfarm problems either.
I am wondering however whether this doesn't mean that all our carefully
coded fast paths for C locale just went down the drain. Does the ICU code
have any of that? Has any performance testing been done to see what
impact
this change had on C-locale installations? (The current code coverage
report
for pg_locale.c is not encouraging.)
regards, tom lane
Just another metric.
On my mingw64 setup, I built a test database on PG16 (built with icu
support) and PG15 (no icu support)
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C';
I think the above is the similar setup we have when testing.
On PG15
SELECT '+' < '-' ; returns true
On PG 16 returns false
For PG 16, to strk's point, you have to do: to get a true
SELECT '+' COLLATE "C" < '-' COLLATE "C";
I would expect since I'm initializing my db in collate C they would both
behave the same
"Regina Obe" <lr@pcorp.us> writes:
On my mingw64 setup, I built a test database on PG16 (built with icu
support) and PG15 (no icu support)
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C';
As has been pointed out already, setting LC_COLLATE/LC_CTYPE is
meaningless when the locale provider is ICU. You need to look
at what ICU locale is being chosen, or force it with LOCALE = 'C'.
regards, tom lane
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8'
LC_COLLATE = 'C'
LC_CTYPE = 'C';
As has been pointed out already, setting LC_COLLATE/LC_CTYPE is
meaningless when the locale provider is ICU. You need to look at what ICU
locale is being chosen, or force it with LOCALE = 'C'.regards, tom lane
Okay got it was on IRC with RhodiumToad and he suggested:
CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C' ICU_LOCALE='C';
Which gives expected result:
SELECT '+' < '-' ; -- true
but gives me a notice:
NOTICE: using standard form "en-US-u-va-posix" for locale "C"
"Regina Obe" <lr@pcorp.us> writes:
Okay got it was on IRC with RhodiumToad and he suggested:
CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C' ICU_LOCALE='C';
Which gives expected result:
SELECT '+' < '-' ; -- true
but gives me a notice:
NOTICE: using standard form "en-US-u-va-posix" for locale "C"
Yeah. My recommendation is just LOCALE:
regression=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
CREATE DATABASE
regression=# CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' ICU_LOCALE = 'C';
NOTICE: using standard form "en-US-u-va-posix" for locale "C"
CREATE DATABASE
I think it's probably intentional that ICU_LOCALE is stricter
about being given a real ICU locale name, but I didn't write
any of that code.
regards, tom lane
"Peter" == Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
Peter> If the database is created with locale provider ICU, then
Peter> lc_collate does not apply here,
Having lc_collate return a value which is silently being ignored seems
to me rather hugely confusing.
Also, somewhere along the line someone broke initdb --no-locale, which
should result in C locale being the default everywhere, but when I just
tested it it picked 'en' for an ICU locale, which is not the right
thing.
--
Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
"Peter" == Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
Peter> If the database is created with locale provider ICU, then
Peter> lc_collate does not apply here,
Having lc_collate return a value which is silently being ignored seems
to me rather hugely confusing.
It's not *completely* ignored --- there are bits of code that are not
yet ICU-ified and will still use the libc facilities. So we can't
get rid of those options yet, even in an ICU-based database.
Also, somewhere along the line someone broke initdb --no-locale, which
should result in C locale being the default everywhere, but when I just
tested it it picked 'en' for an ICU locale, which is not the right
thing.
Confirmed:
$ LANG=en_US.utf8 initdb --no-locale
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
Using default ICU locale "en_US".
Using language tag "en-US" for ICU locale "en_US".
The database cluster will be initialized with this locale configuration:
provider: icu
ICU locale: en-US
LC_COLLATE: C
LC_CTYPE: C
...
That needs to be fixed: --no-locale should prevent any consideration
of initdb's LANG/LC_foo environment.
regards, tom lane
Yeah. My recommendation is just LOCALE:
regression=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING =
'UTF8' LOCALE = 'C'; CREATE DATABASE regression=# CREATE DATABASE test2
TEMPLATE=template0 ENCODING = 'UTF8' ICU_LOCALE = 'C';
NOTICE: using standard form "en-US-u-va-posix" for locale "C"
CREATE DATABASEI think it's probably intentional that ICU_LOCALE is stricter about being
given
a real ICU locale name, but I didn't write any of that code.
regards, tom lane
CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
Doesn't seem to work at least not under mingw64 anyway.
SELECT '+' < '-' ;
Returns false
"Regina Obe" <lr@pcorp.us> writes:
CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
Doesn't seem to work at least not under mingw64 anyway.
Hmm, doesn't work for me either:
$ LANG=en_US.utf8 initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
Using default ICU locale "en_US".
Using language tag "en-US" for ICU locale "en_US".
The database cluster will be initialized with this locale configuration:
provider: icu
ICU locale: en-US
LC_COLLATE: en_US.utf8
LC_CTYPE: en_US.utf8
LC_MESSAGES: en_US.utf8
LC_MONETARY: en_US.utf8
LC_NUMERIC: en_US.utf8
LC_TIME: en_US.utf8
...
$ psql postgres
psql (16devel)
Type "help" for help.
postgres=# SELECT '+' < '-' ;
?column?
----------
f
(1 row)
(as expected, so far)
postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
CREATE DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "postgres".
test1=# SELECT '+' < '-' ;
?column?
----------
f
(1 row)
(wrong!)
test1=# \l
List of databases
Name | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges
-----------+----------+----------+-----------------+------------+------------+------------+-----------+-----------------------
postgres | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | en-US | |
template0 | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | en-US | | =c/postgres +
| | | | | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | icu | en_US.utf8 | en_US.utf8 | en-US | | =c/postgres +
| | | | | | | | postgres=CTc/postgres
test1 | postgres | UTF8 | icu | C | C | en-US | |
(4 rows)
Looks like the "pick en-US even when told not to" problem exists here too.
regards, tom lane
On Fri, Apr 21, 2023 at 07:14:13PM +0200, Peter Eisentraut wrote:
On 21.04.23 19:09, Sandro Santilli wrote:
On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
"Regina Obe" <lr@pcorp.us> writes:
If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :=# select version();
PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
=# show lc_collate;
C
=# select '+' < '-';
fIf the database is created with locale provider ICU, then lc_collate does
not apply here, so the result might be correct (depending on what locale you
have set).
The database is created by a perl script which starts like this:
$ENV{"LC_ALL"} = "C";
$ENV{"LANG"} = "C";
And then runs:
createdb --encoding=UTF-8 --template=template0 --lc-collate=C
Should we tweak anything else to make the results predictable ?
--strk;
"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
Also, somewhere along the line someone broke initdb --no-locale,
which should result in C locale being the default everywhere, but
when I just tested it it picked 'en' for an ICU locale, which is not
the right thing.
Tom> Confirmed:
Tom> $ LANG=en_US.utf8 initdb --no-locale
Tom> The files belonging to this database system will be owned by user "postgres".
Tom> This user must also own the server process.
Tom> Using default ICU locale "en_US".
Tom> Using language tag "en-US" for ICU locale "en_US".
Tom> The database cluster will be initialized with this locale configuration:
Tom> provider: icu
Tom> ICU locale: en-US
Tom> LC_COLLATE: C
Tom> LC_CTYPE: C
Tom> ...
Tom> That needs to be fixed: --no-locale should prevent any
Tom> consideration of initdb's LANG/LC_foo environment.
Would it also not make sense to also take into account any --locale and
--lc-* options before choosing an ICU default locale? Right now if you
do, say, initdb --locale=fr_FR you get an ICU locale based on the
environment but lc_* settings based on the option, which seems maximally
confusing.
Also, what happens now to lc_collate_is_c() when the provider is ICU? Am
I missing something, or is it never true now, even if you specified C /
POSIX / en-US-u-va-posix as the ICU locale? This seems like it could be
an important pessimization.
Also also, we now have the problem that it is much harder to create a
'C' collation database within an existing cluster (e.g. for testing)
without knowing whether the default provider is ICU. In the past one
would have done:
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
but now that creates a database that uses the same ICU locale as
template0 by default. If instead one tries:
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' ICU_LOCALE='C';
then one gets an error if the default locale provider is _not_ ICU. The
only option now seems to be:
CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' LOCALE_PROVIDER = 'libc';
which of course doesn't work in older pg versions.
--
Andrew.
On Fri, Apr 21, 2023 at 10:27:49AM -0700, Jeff Davis wrote:
On Fri, 2023-04-21 at 19:09 +0200, Sandro Santilli wrote:
� =# select version();
� PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
� =# show lc_collate;
� C
� =# select '+' < '-';
� fWhat is the result of:
select datlocprovider, datcollate, daticulocale
from pg_database where datname=current_database();
datlocprovider | i
datcollate | C
daticulocale | en-US
--strk;
On Fri, 2023-04-21 at 14:23 -0400, Tom Lane wrote:
postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8'
LOCALE = 'C';
...
test1 | postgres | UTF8 | icu | C |
C | en-US | |
(4 rows)Looks like the "pick en-US even when told not to" problem exists here
too.
Both provider (ICU) and the icu locale (en-US) are inherited from
template0. The LOCALE parameter to CREATE DATABASE doesn't affect
either of those things, because there's a separate parameter
ICU_LOCALE.
This happens the same way in v15, and although it matches the
documentation technically, it is not a great user experience.
I have a couple ideas:
1. Introduce a "none" provider to separate the concept of C/POSIX
locales from the libc provider. It's not really using a provider
anyway, it's just using memcmp(), and I think it causes confusion to
combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than
"LOCALE_PROVIDER=libc LOCALE='C'".
2. Change the CREATE DATABASE syntax to catch these errors better at
the possible expense of backwards compatibility.
I am also having second thoughts about accepting "C" or "POSIX" as an
ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not
terribly useful (why not just use memcmp()?), it's not fast in my
measurements (en-US is faster), so maybe it's better to just throw an
error and tell the user to use C (or provider=none as I suggest
above)?
Obviously the user could manually type "en-US-u-va-posix" if that's the
locale they want. Throwing an error would be a backwards-compatibility
issue, but in v15 an ICU locale of "C" just gives the root locale
anyway, which is probably not what they want.
Regards,
Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
I have a couple ideas:
1. Introduce a "none" provider to separate the concept of C/POSIX
locales from the libc provider. It's not really using a provider
anyway, it's just using memcmp(), and I think it causes confusion to
combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than
"LOCALE_PROVIDER=libc LOCALE='C'".
I think I might like this idea, except for one thing: you're imagining
that the locale doesn't control anything except string comparisons.
What about to_upper/to_lower, character classifications in regexes, etc?
(I'm not sure whether those operations can get redirected to ICU today
or whether they still always go to libc, but we'll surely want to fix
it eventually if the latter is still true.)
In any case, that seems somewhat orthogonal to what we're on about here,
which is making the behavior of CREATE DATABASE less surprising and more
backwards-compatible. I'm not sure that provider=none can help with that.
Aside from the user-surprise issues discussed up to now, pg_dump scripts
emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER
clauses in CREATE DATABASE, and people are going to be very unhappy
if that means they suddenly get totally different locale semantics
after restoring into a new DB. I think we need some plan for mapping
libc-style locale specs into ICU locales so that we can make that
more nearly transparent.
2. Change the CREATE DATABASE syntax to catch these errors better at
the possible expense of backwards compatibility.
That is the exact opposite of what I think we need. Backwards
compatibility isn't optional.
Maybe this means we are not ready to do ICU-by-default in v16.
It certainly feels like there might be more here than we want to
start designing post-feature-freeze.
regards, tom lane
On Fri, 2023-04-21 at 21:14 +0200, Sandro Santilli wrote:
And then runs:
createdb --encoding=UTF-8 --template=template0 --lc-collate=C
Should we tweak anything else to make the results predictable ?
You can specify --locale-provider=libc
Regards,
Jeff Davis
On Fri, 2023-04-21 at 13:28 -0400, Tom Lane wrote:
I am wondering however whether this doesn't mean that all our
carefully
coded fast paths for C locale just went down the drain.
The code still exists. You can test it by using the built-in collation
"C" which is correctly specified with collprovider=libc and
collcollate=C.
For my test dataset, ICU 72, glibc 2.35:
-- ~07s
explain analyze select t from a order by t collate "C";
-- ~15s
explain analyze select t from a order by t collate "en-US-x-icu";
-- ~21s
explain analyze select t from a order by t collate "en-US-u-va-posix-
x-icu";
-- ~34s
explain analyze select t from a order by t collate "en_US";
I believe the confusion in this thread comes from:
* The syntax of CREATE DATABASE (the same as v15 but still confusing)
* The fact that you need provider=libc to get memcmp() behavior (same
as v15 but still confusing)
Regards,
Jeff Davis
On Fri, Apr 21, 2023 at 3:25 PM Jeff Davis <pgsql@j-davis.com> wrote:
I am also having second thoughts about accepting "C" or "POSIX" as an
ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not
terribly useful (why not just use memcmp()?), it's not fast in my
measurements (en-US is faster), so maybe it's better to just throw an
error and tell the user to use C (or provider=none as I suggest
above)?
I mean, to renew a complaint I've made previously, how the heck is
anyone supposed to understand what's going on here?
We have no meaningful documentation of how to select an ICU locale
that works for you. We have a couple of examples and a suggestion that
you should use BCP 47. But when I asked before for documentation
references, the ones you provided were not clear, basically
incomprehensible. In follow-up discussion, you admitted you'd had to
consult the source code to figure certain things out.
And the fact that "C" or "POSIX" gets transformed into
"en-US-u-va-posix" is also completely documented. That string appears
twice in the code, but zero times in the documentation. There's code
to do it, but users shouldn't have to read code, and it wouldn't help
much if they did, because the code comments don't really explain the
rationale behind this choice either.
I find the fact that people are having trouble here completely
predictable. Of course if people ask for "C" and the system tells them
that it's using "en-US-u-va-posix" instead they're going to be
confused and ask questions, exactly as is happening here. glibc
collations aren't particularly well-documented either, but people have
some experience with, and they can get a list of values that have a
chance of working from /usr/share/locale, and they know what "C"
means. Nobody knows what "en-US-u-va-posix" is. It's not even
Googleable, really, whereas "C locale" is.
My opinion is that the switch to using ICU by default is ill-advised
and should be reverted. The compatibility break isn't worth whatever
advantages ICU may have, the documentation to allow people to
transition to ICU with reasonable effort doesn't exist, and the fact
that within weeks of feature freeze people who know a lot about
PostgreSQL are struggling to get the behavior they want is a really
bad sign.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, 2023-04-21 at 19:00 +0100, Andrew Gierth wrote:
Also, somewhere along the line someone broke initdb --no-locale,
which
should result in C locale being the default everywhere, but when I
just
tested it it picked 'en' for an ICU locale, which is not the right
thing.
Fixed, thank you.
Regards,
Jeff Davis
On Fri, 2023-04-21 at 16:00 -0400, Tom Lane wrote:
Maybe this means we are not ready to do ICU-by-default in v16.
It certainly feels like there might be more here than we want to
start designing post-feature-freeze.
I don't see how punting to the next release helps. If the CREATE
DATABASE syntax (and similar issues for createdb and initdb) in v15 is
just too confusing, and we can't find a remedy for v16, then we
probably won't find a remedy for v17 either.
Regards,
Jeff Davis
"Jeff" == Jeff Davis <pgsql@j-davis.com> writes:
Also, somewhere along the line someone broke initdb --no-locale,
which should result in C locale being the default everywhere, but
when I just tested it it picked 'en' for an ICU locale, which is not
the right thing.
Jeff> Fixed, thank you.
Is that the right fix, though? (It forces --locale-provider=libc for the
cluster default, which might not be desirable?)
--
Andrew.
On Fri, 2023-04-21 at 22:08 +0100, Andrew Gierth wrote:
Is that the right fix, though? (It forces --locale-provider=libc for
the
cluster default, which might not be desirable?)
For the "no locale" behavior (memcmp()-based) the provider needs to be
libc. Do you see an alternative?
Regards,
Jeff Davis
"Jeff" == Jeff Davis <pgsql@j-davis.com> writes:
Is that the right fix, though? (It forces --locale-provider=libc for
the cluster default, which might not be desirable?)
Jeff> For the "no locale" behavior (memcmp()-based) the provider needs
Jeff> to be libc. Do you see an alternative?
Can lc_collate_is_c() be taught to check whether an ICU locale is using
POSIX collation?
There's now another bug in that --no-locale no longer does the same
thing as --locale=C (which is its long-established documented behavior).
How should these various options interact? This all seems not well
thought out from a usability perspective, and I think a proper fix
should involve a bit more serious consideration.
--
Andrew.
On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:
My opinion is that the switch to using ICU by default is ill-advised
and should be reverted.
Most of the complaints seem to be complaints about v15 as well, and
while those complaints may be a reason to not make ICU the default,
they are also an argument that we should continue to learn and try to
fix those issues because they exist in an already-released version.
Leaving it the default for now will help us fix those issues rather
than hide them.
It's still early, so we have plenty of time to revert the initdb
default if we need to.
Regards,
Jeff Davis
My opinion is that the switch to using ICU by default is ill-advised
and should be reverted.Most of the complaints seem to be complaints about v15 as well, and while
those complaints may be a reason to not make ICU the default, they are also
an argument that we should continue to learn and try to fix those issues
because they exist in an already-released version.
Leaving it the default for now will help us fix those issues rather than hide
them.It's still early, so we have plenty of time to revert the initdb default if we need
to.Regards,
Jeff Davis
I'm fine with that. Sounds like it wouldn't be too hard to just pull it out at the end.
Before this, I didn't even know ICU existed in PG15. My first realization that ICU was even a thing was when my PG16 refused to compile without adding my ICU path to my pkg-config or putting in --without-icu.
So yah I suspect leaving it in a little bit longer will uncover some more issues and won't harm too much.
Thanks,
Regina
On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:
And the fact that "C" or "POSIX" gets transformed into
"en-US-u-va-posix"
I already expressed, on reflection, that we should probably just not do
that. So I think we're in agreement on this point; patch attached.
Regards,
Jeff Davis
Attachments:
0001-ICU-do-not-convert-locale-C-to-en-US-u-va-posix.patchtext/x-patch; charset=UTF-8; name=0001-ICU-do-not-convert-locale-C-to-en-US-u-va-posix.patchDownload
From 3d2791af0a236cbc7ce7f29d988e8ac7fd3fd389 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 21 Apr 2023 14:03:57 -0700
Subject: [PATCH] ICU: do not convert locale 'C' to 'en-US-u-va-posix'.
The conversion was intended to be for convenience, but it's more
likely to be confusing than useful.
The user can still directly specify 'en-US-u-va-posix' if desired.
Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.camel@j-davis.com
---
src/backend/utils/adt/pg_locale.c | 19 +------------------
src/bin/initdb/initdb.c | 17 +----------------
.../regress/expected/collate.icu.utf8.out | 8 ++++++++
src/test/regress/sql/collate.icu.utf8.sql | 4 ++++
4 files changed, 14 insertions(+), 34 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 51df570ce9..58c4c426bc 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel)
{
#ifdef USE_ICU
UErrorCode status;
- char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
- status = U_ZERO_ERROR;
- uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
- {
- if (elevel > 0)
- ereport(elevel,
- (errmsg("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status))));
- return NULL;
- }
-
- /* C/POSIX locales aren't handled by uloc_getLanguageTag() */
- if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
- return pstrdup("en-US-u-va-posix");
-
/*
* A BCP47 language tag doesn't have a clearly-defined upper limit
* (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str)
/* check for special language name */
if (strcmp(lang, "") == 0 ||
- strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
- strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+ strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
found = true;
/* search for matching language within ICU */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2c208ead01..4086834458 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str)
{
#ifdef USE_ICU
UErrorCode status;
- char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
- status = U_ZERO_ERROR;
- uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
- {
- pg_fatal("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status));
- return NULL;
- }
-
- /* C/POSIX locales aren't handled by uloc_getLanguageTag() */
- if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
- return pstrdup("en-US-u-va-posix");
-
/*
* A BCP47 language tag doesn't have a clearly-defined upper limit
* (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str)
/* check for special language name */
if (strcmp(lang, "") == 0 ||
- strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
- strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+ strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
found = true;
/* search for matching language within ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index b5a221b030..99f12d2e73 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1020,6 +1020,7 @@ CREATE ROLE regress_test_role;
CREATE SCHEMA test_schema;
-- We need to do this this way to cope with varying names for encodings:
SET client_min_messages TO WARNING;
+SET icu_validation_level = disabled;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1034,17 +1035,24 @@ BEGIN
quote_literal(current_setting('lc_collate')) || ');';
END
$$;
+RESET icu_validation_level;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
ERROR: parameter "locale" must be specified
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
+ERROR: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+WARNING: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+WARNING: ICU locale "c" has unknown language "c"
+HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 85e26951b6..d9778faacc 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -358,6 +358,7 @@ CREATE SCHEMA test_schema;
-- We need to do this this way to cope with varying names for encodings:
SET client_min_messages TO WARNING;
+SET icu_validation_level = disabled;
do $$
BEGIN
@@ -373,13 +374,16 @@ BEGIN
END
$$;
+RESET icu_validation_level;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
RESET icu_validation_level;
--
2.34.1
On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql@j-davis.com> wrote:
Most of the complaints seem to be complaints about v15 as well, and
while those complaints may be a reason to not make ICU the default,
they are also an argument that we should continue to learn and try to
fix those issues because they exist in an already-released version.
Leaving it the default for now will help us fix those issues rather
than hide them.It's still early, so we have plenty of time to revert the initdb
default if we need to.
That's fair enough, but I really think it's important that some energy
get invested in providing adequate documentation for this stuff. Just
patching the code is not enough.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 21.04.23 19:14, Peter Eisentraut wrote:
On 21.04.23 19:09, Sandro Santilli wrote:
On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
"Regina Obe" <lr@pcorp.us> writes:
If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :=# select version();
PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
=# show lc_collate;
C
=# select '+' < '-';
fIf the database is created with locale provider ICU, then lc_collate
does not apply here, so the result might be correct (depending on what
locale you have set).
The GUC settings lc_collate and lc_ctype are from a time when those
locale settings were cluster-global. When we made those locale settings
per-database (PG 8.4), we kept them as read-only. As of PG 15, you can
use ICU as the per-database locale provider, so what is being attempted
in the above example is already meaningless before PG 16, since you need
to look into pg_database to find out what is really happening.
I think we should just remove the GUC parameters lc_collate and lc_ctype.
On 22.04.23 01:00, Jeff Davis wrote:
On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:
And the fact that "C" or "POSIX" gets transformed into
"en-US-u-va-posix"I already expressed, on reflection, that we should probably just not do
that. So I think we're in agreement on this point; patch attached.
This makes sense to me. This way, if someone specifies 'C' locale
together with ICU provider they get an error. They can then choose to
use the libc provider, to get the performance path, or stick with ICU by
using the native spelling of the locale.
On Fri, 2023-04-21 at 16:00 -0400, Tom Lane wrote:
I think I might like this idea, except for one thing: you're
imagining
that the locale doesn't control anything except string comparisons.
What about to_upper/to_lower, character classifications in regexes,
etc?
If provider='libc' and LC_CTYPE='C', str_toupper/str_tolower are
handled with asc_tolower/asc_toupper. The regex character
classification is done with pg_char_properties. In these cases neither
ICU nor libc is used; it's just code in postgres.
libc is special in that you can set LC_COLLATE and LC_CTYPE separately,
so that different locales are used for sorting and character
classification. That's potentially useful to set LC_COLLATE to C for
performance reasons, while setting LC_CTYPE to a useful locale. We
don't allow ICU to set collation and ctype separately (it would be
possible to allow it, but I don't think there's a huge demand and it's
arguably inconsistent to set them differently).
(I'm not sure whether those operations can get redirected to ICU
today
or whether they still always go to libc, but we'll surely want to fix
it eventually if the latter is still true.)
Those operations do get redirected to ICU today. There are extensions
that call locale-sensitive libc functions directly, and obviously those
won't use ICU.
Aside from the user-surprise issues discussed up to now, pg_dump
scripts
emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER
clauses in CREATE DATABASE, and people are going to be very unhappy
if that means they suddenly get totally different locale semantics
after restoring into a new DB.
Agreed.
I think we need some plan for mapping
libc-style locale specs into ICU locales so that we can make that
more nearly transparent.
ICU does a reasonable job mapping libc-like locale names to ICU
locales, e.g. en_US to en-US, etc. The ordering semantics aren't
guaranteed to be the same, of course (because the libc-locales are
platform-dependent), but it's at least conceptually the same locale.
Maybe this means we are not ready to do ICU-by-default in v16.
It certainly feels like there might be more here than we want to
start designing post-feature-freeze.
This thread is already on the Open Items list. As long as it's not too
disruptive to others I'll leave it as-is for now to see how this sorts
out. Right now it's not clear to me how much of this is a v15 issue vs
a v16 issue.
Regards,
Jeff Davis
Jeff Davis wrote:
(I'm not sure whether those operations can get redirected to ICU
today
or whether they still always go to libc, but we'll surely want to fix
it eventually if the latter is still true.)Those operations do get redirected to ICU today.
FTR the full text search parser still uses the libc functions
is[w]space/alpha/digit... that depend on lc_ctype, whether the db
collation provider is ICU or not.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes:
FTR the full text search parser still uses the libc functions
is[w]space/alpha/digit... that depend on lc_ctype, whether the db
collation provider is ICU or not.
Yeah, those aren't even connected up to the collation-selection
mechanisms; lots of work to do there. I wonder if they could be
made to use regc_pg_locale.c instead of duplicating logic.
regards, tom lane
On Fri, 2023-04-21 at 22:35 +0100, Andrew Gierth wrote:
Can lc_collate_is_c() be taught to check whether an ICU locale is
using
POSIX collation?
Attached are a few small patches:
0001: don't convert C to en-US-u-va-posix
0002: handle locale C the same regardless of the provider, as you
suggest above
0003: make LOCALE (or --locale) apply to everything including ICU
As far as I can tell, any libc locale has a reasonable match in ICU, so
setting LOCALE to either C or a libc locale name should be fine. Some
locales are only valid in ICU, e.g. '@colStrength=primary', or a
language tag representation, so if you do something like:
create database foo locale 'en_US@colStrenghth=primary'
template template0;
You'll get a decent error like:
ERROR: invalid LC_COLLATE locale name: "en_US@colStrenghth=primary"
HINT: If the locale name is specific to ICU, use ICU_LOCALE.
Overall, I think it works out nicely. Let me know if there are still
some confusing cases. I tried a few variations and this one seemed the
best, but I may have missed something.
Regards,
Jeff Davis
Attachments:
v2-0001-ICU-do-not-convert-locale-C-to-en-US-u-va-posix.patchtext/x-patch; charset=UTF-8; name=v2-0001-ICU-do-not-convert-locale-C-to-en-US-u-va-posix.patchDownload
From c768e040dc92b033e4eb0e69f08b59d8d1ffe1e4 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 21 Apr 2023 14:03:57 -0700
Subject: [PATCH v2 1/3] ICU: do not convert locale 'C' to 'en-US-u-va-posix'.
The conversion was intended to be for convenience, but it's more
likely to be confusing than useful.
The user can still directly specify 'en-US-u-va-posix' if desired.
Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.camel@j-davis.com
---
src/backend/utils/adt/pg_locale.c | 19 +------------------
src/bin/initdb/initdb.c | 17 +----------------
.../regress/expected/collate.icu.utf8.out | 8 ++++++++
src/test/regress/sql/collate.icu.utf8.sql | 4 ++++
4 files changed, 14 insertions(+), 34 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 51df570ce9..58c4c426bc 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel)
{
#ifdef USE_ICU
UErrorCode status;
- char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
- status = U_ZERO_ERROR;
- uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
- {
- if (elevel > 0)
- ereport(elevel,
- (errmsg("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status))));
- return NULL;
- }
-
- /* C/POSIX locales aren't handled by uloc_getLanguageTag() */
- if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
- return pstrdup("en-US-u-va-posix");
-
/*
* A BCP47 language tag doesn't have a clearly-defined upper limit
* (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str)
/* check for special language name */
if (strcmp(lang, "") == 0 ||
- strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
- strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+ strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
found = true;
/* search for matching language within ICU */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2c208ead01..4086834458 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str)
{
#ifdef USE_ICU
UErrorCode status;
- char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
- status = U_ZERO_ERROR;
- uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
- {
- pg_fatal("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status));
- return NULL;
- }
-
- /* C/POSIX locales aren't handled by uloc_getLanguageTag() */
- if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
- return pstrdup("en-US-u-va-posix");
-
/*
* A BCP47 language tag doesn't have a clearly-defined upper limit
* (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str)
/* check for special language name */
if (strcmp(lang, "") == 0 ||
- strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
- strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+ strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
found = true;
/* search for matching language within ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index b5a221b030..99f12d2e73 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1020,6 +1020,7 @@ CREATE ROLE regress_test_role;
CREATE SCHEMA test_schema;
-- We need to do this this way to cope with varying names for encodings:
SET client_min_messages TO WARNING;
+SET icu_validation_level = disabled;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1034,17 +1035,24 @@ BEGIN
quote_literal(current_setting('lc_collate')) || ');';
END
$$;
+RESET icu_validation_level;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
ERROR: parameter "locale" must be specified
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
+ERROR: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+WARNING: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+WARNING: ICU locale "c" has unknown language "c"
+HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 85e26951b6..d9778faacc 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -358,6 +358,7 @@ CREATE SCHEMA test_schema;
-- We need to do this this way to cope with varying names for encodings:
SET client_min_messages TO WARNING;
+SET icu_validation_level = disabled;
do $$
BEGIN
@@ -373,13 +374,16 @@ BEGIN
END
$$;
+RESET icu_validation_level;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
RESET icu_validation_level;
--
2.34.1
v2-0002-ICU-support-locale-C-with-the-same-behavior-as-li.patchtext/x-patch; charset=UTF-8; name=v2-0002-ICU-support-locale-C-with-the-same-behavior-as-li.patchDownload
From 1302a4b65e4e12753ae15e732dab059afe69dbd9 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 24 Apr 2023 15:46:17 -0700
Subject: [PATCH v2 2/3] ICU: support locale "C" with the same behavior as
libc.
The "C" locale doesn't actually use a provider at all, it's a special
locale that uses memcmp() and built-in character classification. Make
it behave the same in ICU as libc (even though it doesn't actually
make use of either provider).
Discussion: https://postgr.es/m/87v8hoexdv.fsf@news-spur.riddles.org.uk
---
src/backend/commands/collationcmds.c | 43 ++++++----
src/backend/commands/dbcommands.c | 42 +++++----
src/backend/utils/adt/pg_locale.c | 86 ++++++++++++++-----
.../regress/expected/collate.icu.utf8.out | 12 +--
src/test/regress/sql/collate.icu.utf8.sql | 7 +-
5 files changed, 131 insertions(+), 59 deletions(-)
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index c91fe66d9b..7e69a889fb 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -264,26 +264,39 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("parameter \"locale\" must be specified")));
- /*
- * During binary upgrade, preserve the locale string. Otherwise,
- * canonicalize to a language tag.
- */
- if (!IsBinaryUpgrade)
+ if (pg_strcasecmp(colliculocale, "C") == 0 ||
+ pg_strcasecmp(colliculocale, "POSIX") == 0)
{
- char *langtag = icu_language_tag(colliculocale,
- icu_validation_level);
-
- if (langtag && strcmp(colliculocale, langtag) != 0)
+ if (!collisdeterministic)
+ ereport(ERROR,
+ (errmsg("nondeterministic collations not supported for C or POSIX locale")));
+ if (collicurules != NULL)
+ ereport(ERROR,
+ (errmsg("RULES not supported for C or POSIX locale")));
+ }
+ else
+ {
+ /*
+ * During binary upgrade, preserve the locale
+ * string. Otherwise, canonicalize to a language tag.
+ */
+ if (!IsBinaryUpgrade)
{
- ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
- langtag, colliculocale)));
+ char *langtag = icu_language_tag(colliculocale,
+ icu_validation_level);
+
+ if (langtag && strcmp(colliculocale, langtag) != 0)
+ {
+ ereport(NOTICE,
+ (errmsg("using standard form \"%s\" for locale \"%s\"",
+ langtag, colliculocale)));
- colliculocale = langtag;
+ colliculocale = langtag;
+ }
}
- }
- icu_validate_locale(colliculocale);
+ icu_validate_locale(colliculocale);
+ }
}
/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2e242eeff2..8ef33871f0 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1058,27 +1058,37 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("ICU locale must be specified")));
- /*
- * During binary upgrade, or when the locale came from the template
- * database, preserve locale string. Otherwise, canonicalize to a
- * language tag.
- */
- if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+ if (pg_strcasecmp(dbiculocale, "C") == 0 ||
+ pg_strcasecmp(dbiculocale, "POSIX") == 0)
{
- char *langtag = icu_language_tag(dbiculocale,
- icu_validation_level);
-
- if (langtag && strcmp(dbiculocale, langtag) != 0)
+ if (dbicurules != NULL)
+ ereport(ERROR,
+ (errmsg("ICU_RULES not supported for C or POSIX locale")));
+ }
+ else
+ {
+ /*
+ * During binary upgrade, or when the locale came from the
+ * template database, preserve locale string. Otherwise,
+ * canonicalize to a language tag.
+ */
+ if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
{
- ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
- langtag, dbiculocale)));
+ char *langtag = icu_language_tag(dbiculocale,
+ icu_validation_level);
+
+ if (langtag && strcmp(dbiculocale, langtag) != 0)
+ {
+ ereport(NOTICE,
+ (errmsg("using standard form \"%s\" for locale \"%s\"",
+ langtag, dbiculocale)));
- dbiculocale = langtag;
+ dbiculocale = langtag;
+ }
}
- }
- icu_validate_locale(dbiculocale);
+ icu_validate_locale(dbiculocale);
+ }
}
else
{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 58c4c426bc..06e7530247 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1246,8 +1246,15 @@ lookup_collation_cache(Oid collation, bool set_flags)
}
else
{
- cache_entry->collate_is_c = false;
- cache_entry->ctype_is_c = false;
+ Datum datum;
+ const char *colliculocale;
+
+ datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_colliculocale);
+ colliculocale = TextDatumGetCString(datum);
+
+ cache_entry->collate_is_c = ((strcmp(colliculocale, "C") == 0) ||
+ (strcmp(colliculocale, "POSIX") == 0));
+ cache_entry->ctype_is_c = cache_entry->collate_is_c;
}
cache_entry->flags_valid = true;
@@ -1279,16 +1286,27 @@ lc_collate_is_c(Oid collation)
if (collation == DEFAULT_COLLATION_OID)
{
static int result = -1;
- char *localeptr;
-
- if (default_locale.provider == COLLPROVIDER_ICU)
- return false;
+ const char *localeptr;
if (result >= 0)
return (bool) result;
- localeptr = setlocale(LC_COLLATE, NULL);
- if (!localeptr)
- elog(ERROR, "invalid LC_COLLATE setting");
+
+ if (default_locale.provider == COLLPROVIDER_ICU)
+ {
+#ifdef USE_ICU
+ localeptr = default_locale.info.icu.locale;
+#else
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ICU is not supported in this build")));
+#endif
+ }
+ else
+ {
+ localeptr = setlocale(LC_COLLATE, NULL);
+ if (!localeptr)
+ elog(ERROR, "invalid LC_COLLATE setting");
+ }
if (strcmp(localeptr, "C") == 0)
result = true;
@@ -1332,16 +1350,27 @@ lc_ctype_is_c(Oid collation)
if (collation == DEFAULT_COLLATION_OID)
{
static int result = -1;
- char *localeptr;
-
- if (default_locale.provider == COLLPROVIDER_ICU)
- return false;
+ const char *localeptr;
if (result >= 0)
return (bool) result;
- localeptr = setlocale(LC_CTYPE, NULL);
- if (!localeptr)
- elog(ERROR, "invalid LC_CTYPE setting");
+
+ if (default_locale.provider == COLLPROVIDER_ICU)
+ {
+#ifdef USE_ICU
+ localeptr = default_locale.info.icu.locale;
+#else
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ICU is not supported in this build")));
+#endif
+ }
+ else
+ {
+ localeptr = setlocale(LC_CTYPE, NULL);
+ if (!localeptr)
+ elog(ERROR, "invalid LC_CTYPE setting");
+ }
if (strcmp(localeptr, "C") == 0)
result = true;
@@ -1375,7 +1404,14 @@ make_icu_collator(const char *iculocstr,
#ifdef USE_ICU
UCollator *collator;
- collator = pg_ucol_open(iculocstr);
+ if (pg_strcasecmp(iculocstr, "C") == 0 ||
+ pg_strcasecmp(iculocstr, "POSIX") == 0)
+ {
+ Assert(icurules == NULL);
+ collator = NULL;
+ }
+ else
+ collator = pg_ucol_open(iculocstr);
/*
* If rules are specified, we extract the rules of the standard collation,
@@ -1650,6 +1686,10 @@ get_collation_actual_version(char collprovider, const char *collcollate)
{
char *collversion = NULL;
+ if (pg_strcasecmp("C", collcollate) ||
+ pg_strcasecmp("POSIX", collcollate))
+ return NULL;
+
#ifdef USE_ICU
if (collprovider == COLLPROVIDER_ICU)
{
@@ -1668,9 +1708,7 @@ get_collation_actual_version(char collprovider, const char *collcollate)
else
#endif
if (collprovider == COLLPROVIDER_LIBC &&
- pg_strcasecmp("C", collcollate) != 0 &&
- pg_strncasecmp("C.", collcollate, 2) != 0 &&
- pg_strcasecmp("POSIX", collcollate) != 0)
+ pg_strncasecmp("C.", collcollate, 2) != 0)
{
#if defined(__GLIBC__)
/* Use the glibc version because we don't have anything better. */
@@ -2457,6 +2495,14 @@ pg_ucol_open(const char *loc_str)
if (loc_str == NULL)
elog(ERROR, "opening default collator is not supported");
+ /*
+ * Must never open special values C or POSIX, which are treated specially
+ * and not passed to the provider.
+ */
+ if (pg_strcasecmp(loc_str, "C") == 0 ||
+ pg_strcasecmp(loc_str, "POSIX") == 0)
+ elog(ERROR, "unexpected ICU locale string: %s", loc_str);
+
/*
* In ICU versions 54 and earlier, "und" is not a recognized spelling of
* the root locale. If the first component of the locale is "und", replace
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 99f12d2e73..53ab496bfe 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1042,21 +1042,21 @@ ERROR: parameter "locale" must be specified
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
-CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
-ERROR: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = 'c', deterministic = false); -- fails
+ERROR: nondeterministic collations not supported for C or POSIX locale
+CREATE COLLATION testx (provider = icu, locale = 'c', rules = '&V << w <<< W'); -- fails
+ERROR: RULES not supported for C or POSIX locale
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
-WARNING: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-WARNING: ICU locale "c" has unknown language "c"
-HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
RESET icu_validation_level;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'posix'); DROP COLLATION testx;
CREATE COLLATION test4 FROM nonsense;
ERROR: collation "nonsense" for encoding "UTF8" does not exist
CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index d9778faacc..63d5352ee6 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -379,14 +379,17 @@ RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
-CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c', deterministic = false); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c', rules = '&V << w <<< W'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
-CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
RESET icu_validation_level;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'posix'); DROP COLLATION testx;
+
CREATE COLLATION test4 FROM nonsense;
CREATE COLLATION test5 FROM test0;
--
2.34.1
v2-0003-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchtext/x-patch; charset=UTF-8; name=v2-0003-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchDownload
From 1c30ea67e48bab60b7e96847ff3c24880e954471 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 25 Apr 2023 15:01:55 -0700
Subject: [PATCH v2 3/3] Make LOCALE apply to ICU_LOCALE for CREATE DATABASE.
LOCALE is now an alias for LC_COLLATE, LC_CTYPE, and (if the provider
is ICU) ICU_LOCALE. The ICU provider accepts more locale names than
libc (e.g. language tags and locale names containing collation
attributes), so in some cases LC_COLLATE, LC_CTYPE, and ICU_LOCALE
will still need to be specified separately.
Previously, LOCALE applied only to LC_COLLATE and LC_CTYPE (and
similarly for --locale in initdb and createdb). That could lead to
confusion when the provider is implicit, such as when it is inherited
from the template database, or when ICU was made default at initdb
time in commit 27b62377b4.
Reverts incomplete fix 5cd1a5af4d.
Discussion: https://postgr.es/m/3391932.1682107209@sss.pgh.pa.us
---
doc/src/sgml/ref/create_database.sgml | 6 ++--
doc/src/sgml/ref/createdb.sgml | 5 ++-
doc/src/sgml/ref/initdb.sgml | 7 +++--
src/backend/commands/collationcmds.c | 2 +-
src/backend/commands/dbcommands.c | 15 ++++++---
src/bin/initdb/initdb.c | 31 ++++++++++++-------
src/bin/scripts/createdb.c | 13 +++-----
src/bin/scripts/t/020_createdb.pl | 4 +--
src/test/icu/t/010_database.pl | 23 +++++++++-----
.../regress/expected/collate.icu.utf8.out | 22 ++++++-------
10 files changed, 77 insertions(+), 51 deletions(-)
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 13793bb6b7..844773ff44 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -145,8 +145,10 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
<term><replaceable class="parameter">locale</replaceable></term>
<listitem>
<para>
- This is a shortcut for setting <symbol>LC_COLLATE</symbol>
- and <symbol>LC_CTYPE</symbol> at once.
+ This is a shortcut for setting <symbol>LC_COLLATE</symbol>,
+ <symbol>LC_CTYPE</symbol> and <symbol>ICU_LOCALE</symbol> at
+ once. Some locales are only valid for ICU, and must be set separately
+ with <symbol>ICU_LOCALE</symbol>.
</para>
<tip>
<para>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index e23419ba6c..e4647d5ce7 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -124,7 +124,10 @@ PostgreSQL documentation
<listitem>
<para>
Specifies the locale to be used in this database. This is equivalent
- to specifying both <option>--lc-collate</option> and <option>--lc-ctype</option>.
+ to specifying <option>--lc-collate</option>,
+ <option>--lc-ctype</option>, and <option>--icu-locale</option> to the
+ same value. Some locales are only valid for ICU and must be set with
+ <option>--icu-locale</option>.
</para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 87945b4b62..f850dc404d 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -116,9 +116,10 @@ PostgreSQL documentation
<para>
To choose a different locale for the cluster, use the option
<option>--locale</option>. There are also individual options
- <option>--lc-*</option> (see below) to set values for the individual locale
- categories. Note that inconsistent settings for different locale
- categories can give nonsensical results, so this should be used with care.
+ <option>--lc-*</option> and <option>--icu-locale</option> (see below) to
+ set values for the individual locale categories. Note that inconsistent
+ settings for different locale categories can give nonsensical results, so
+ this should be used with care.
</para>
<para>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 7e69a889fb..e481f20dc8 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -288,7 +288,7 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
if (langtag && strcmp(colliculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, colliculocale)));
colliculocale = langtag;
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 8ef33871f0..b447dc55f3 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1017,7 +1017,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (dblocprovider == '\0')
dblocprovider = src_locprovider;
if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
- dbiculocale = src_iculocale;
+ {
+ if (dlocale && dlocale->arg)
+ dbiculocale = defGetString(dlocale);
+ else
+ dbiculocale = src_iculocale;
+ }
if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
dbicurules = src_icurules;
@@ -1031,12 +1036,14 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (!check_locale(LC_COLLATE, dbcollate, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbcollate)));
+ errmsg("invalid LC_COLLATE locale name: \"%s\"", dbcollate),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbcollate = canonname;
if (!check_locale(LC_CTYPE, dbctype, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbctype)));
+ errmsg("invalid LC_CTYPE locale name: \"%s\"", dbctype),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbctype = canonname;
check_encoding_locale_matches(encoding, dbcollate, dbctype);
@@ -1080,7 +1087,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (langtag && strcmp(dbiculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, dbiculocale)));
dbiculocale = langtag;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 4086834458..1ef028617e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2157,7 +2157,11 @@ check_locale_name(int category, const char *locale, char **canonname)
if (res == NULL)
{
if (*locale)
- pg_fatal("invalid locale name \"%s\"", locale);
+ {
+ pg_log_error("invalid locale name \"%s\"", locale);
+ pg_log_error_hint("If the locale name is specific to ICU, use --icu-locale.");
+ exit(1);
+ }
else
{
/*
@@ -2391,7 +2395,7 @@ setlocales(void)
{
char *canonname;
- /* set empty lc_* values to locale config if set */
+ /* set empty lc_* and iculocale values to locale config if set */
if (locale)
{
@@ -2407,6 +2411,8 @@ setlocales(void)
lc_monetary = locale;
if (!lc_messages)
lc_messages = locale;
+ if (!icu_locale && locale_provider == COLLPROVIDER_ICU)
+ icu_locale = locale;
}
/*
@@ -2443,14 +2449,18 @@ setlocales(void)
printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
}
- /* canonicalize to a language tag */
- langtag = icu_language_tag(icu_locale);
- printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
- langtag, icu_locale);
- pg_free(icu_locale);
- icu_locale = langtag;
-
- icu_validate_locale(icu_locale);
+ if (pg_strcasecmp(icu_locale, "C") != 0 &&
+ pg_strcasecmp(icu_locale, "POSIX") != 0)
+ {
+ /* canonicalize to a language tag */
+ langtag = icu_language_tag(icu_locale);
+ printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+ langtag, icu_locale);
+ pg_free(icu_locale);
+ icu_locale = langtag;
+
+ icu_validate_locale(icu_locale);
+ }
/*
* In supported builds, the ICU locale ID will be opened during
@@ -3282,7 +3292,6 @@ main(int argc, char *argv[])
break;
case 8:
locale = "C";
- locale_provider = COLLPROVIDER_LIBC;
break;
case 9:
pwfilename = pg_strdup(optarg);
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index b4205c4fa5..9ca86a3e53 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -164,14 +164,6 @@ main(int argc, char *argv[])
exit(1);
}
- if (locale)
- {
- if (!lc_ctype)
- lc_ctype = locale;
- if (!lc_collate)
- lc_collate = locale;
- }
-
if (encoding)
{
if (pg_char_to_encoding(encoding) < 0)
@@ -219,6 +211,11 @@ main(int argc, char *argv[])
appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
if (template)
appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
+ if (locale)
+ {
+ appendPQExpBufferStr(&sql, " LOCALE ");
+ appendStringLiteralConn(&sql, locale, conn);
+ }
if (lc_collate)
{
appendPQExpBufferStr(&sql, " LC_COLLATE ");
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index af3b1492e3..3db9fe931f 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -126,7 +126,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_COLLATE locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
],
'createdb with incorrect --lc-collate');
$node->command_checks_all(
@@ -134,7 +134,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_CTYPE locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
],
'createdb with incorrect --lc-ctype');
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 715b1bffd6..df4af00afe 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -51,16 +51,23 @@ b),
'sort by explicit collation upper first');
-# Test error cases in CREATE DATABASE involving locale-related options
+# Test that LOCALE='C' works for ICU
-my ($ret, $stdout, $stderr) = $node1->psql('postgres',
- q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
-isnt($ret, 0,
- "ICU locale must be specified for ICU provider: exit code not 0");
+my $ret1 = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
+is($ret1, 0,
+ "C locale works for ICU");
+
+# Test that ICU-specific locale string must be specified with ICU_LOCALE,
+# not LOCALE
+
+my ($ret2, $stdout, $stderr) = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu3 LOCALE_PROVIDER icu LOCALE '@colStrength=primary' TEMPLATE template0 ENCODING UTF8});
+isnt($ret2, 0,
+ "ICU-specific locale must be specified with ICU_LOCALE: exit code not 0");
like(
$stderr,
- qr/ERROR: ICU locale must be specified/,
- "ICU locale must be specified for ICU provider: error message");
-
+ qr/ERROR: invalid LC_COLLATE locale name/,
+ "ICU-specific locale must be specified with ICU_LOCALE: error message");
done_testing();
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 53ab496bfe..ecceb6d10c 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1202,9 +1202,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
(1 row)
CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
-NOTICE: using standard form "und-u-kf-lower" for locale "@colCaseFirst=lower"
+NOTICE: using standard form "und-u-kf-lower" for ICU locale "@colCaseFirst=lower"
CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
-NOTICE: using standard form "und-u-kf-upper" for locale "@colCaseFirst=upper"
+NOTICE: using standard form "und-u-kf-upper" for ICU locale "@colCaseFirst=upper"
SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
?column? | ?column?
----------+----------
@@ -1212,7 +1212,7 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
(1 row)
CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
-NOTICE: using standard form "und-u-ka-shifted" for locale "@colAlternate=shifted"
+NOTICE: using standard form "und-u-ka-shifted" for ICU locale "@colAlternate=shifted"
SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
?column? | ?column?
----------+----------
@@ -1229,12 +1229,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
(1 row)
CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-NOTICE: using standard form "und-u-kn-lower" for locale "@colNumeric=lower"
+NOTICE: using standard form "und-u-kn-lower" for ICU locale "@colNumeric=lower"
ERROR: could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
-- test that attributes not handled by icu_set_collation_attributes()
-- (handled by ucol_open() directly) also work
CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
-NOTICE: using standard form "de-u-co-phonebk" for locale "de@collation=phonebook"
+NOTICE: using standard form "de-u-co-phonebk" for ICU locale "de@collation=phonebook"
SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
?column? | ?column?
----------+----------
@@ -1243,7 +1243,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
-- rules
CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test7 (a text);
-- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1271,13 +1271,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
DROP TABLE test7;
CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
ERROR: could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
-- nondeterministic collations
CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test6 (a int, b text);
-- same string in different normal forms
INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1327,9 +1327,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
(2 rows)
CREATE COLLATION case_sensitive (provider = icu, locale = '');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
-NOTICE: using standard form "und-u-ks-level2" for locale "@colStrength=secondary"
+NOTICE: using standard form "und-u-ks-level2" for ICU locale "@colStrength=secondary"
SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
?column? | ?column?
----------+----------
--
2.34.1
Jeff Davis wrote:
Attached are a few small patches:
0001: don't convert C to en-US-u-va-posix
0002: handle locale C the same regardless of the provider, as you
suggest above
0003: make LOCALE (or --locale) apply to everything including ICU
Testing this briefly I noticed two regressions
1) all pg_collation.collversion are empty due to a trivial bug in 0002:
@ -1650,6 +1686,10 @@ get_collation_actual_version(char collprovider, const
char *collcollate)
{
char *collversion = NULL;
+ if (pg_strcasecmp("C", collcollate) ||
+ pg_strcasecmp("POSIX", collcollate))
+ return NULL;
+
This should be pg_strcasecmp(...) == 0
2) The following works with HEAD (default provider=icu) but errors out with
the patches:
postgres=# create database lat9 locale 'fr_FR@euro' encoding LATIN9 template
'template0';
ERROR: could not convert locale name "fr_FR@euro" to language tag:
U_ILLEGAL_ARGUMENT_ERROR
fr_FR@euro is a libc locale name
$ locale -a|grep fr_FR
fr_FR
fr_FR@euro
fr_FR.iso88591
fr_FR.iso885915@euro
fr_FR.utf8
I understand that fr_FR@euro is taken as an ICU locale name, with the idea
that the locale
syntax being more or less compatible between both providers, this should work
smoothly. 0003 seems to go further in the interpretation and fail on it.
TBH the assumption that it's OK to feed libc locale names to ICU feels quite
uncomfortable.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
On Thu, 2023-04-27 at 14:23 +0200, Daniel Verite wrote:
This should be pg_strcasecmp(...) == 0
Good catch, thank you! Fixed in updated patches.
postgres=# create database lat9 locale 'fr_FR@euro' encoding LATIN9
template
'template0';
ERROR: could not convert locale name "fr_FR@euro" to language tag:
U_ILLEGAL_ARGUMENT_ERROR
ICU 63 and earlier convert it without error to the language tag 'fr-FR-
u-cu-eur', which is correct. ICU 64 removed support for transforming
some locale variants, because apparently they think those variants are
obsolete:
https://unicode-org.atlassian.net/browse/ICU-22268
https://unicode-org.atlassian.net/browse/ICU-20187
(Aside: how obsolete are those variants?)
It's frustrating that they'd remove such transformations from the
canonicalization process.
Fortunately, it looks like it's easy enough to do the transformation
ourselves. The only problematic format is '...@VARIANT'. The other
format 'fr_FR_EURO' doesn't seem to be a valid glibc locale name[1]https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html and
windows seems to use BCP 47[2]https://learn.microsoft.com/en-us/windows/win32/intl/locale-names.
And there don't seem to be a lot of variants to handle. ICU 63 only
handles 3 variants, so that's what my patch does. Any unknown variant
between 5 and 8 characters won't throw an error. There could be more
problem cases, but I'm not sure how much of a practical problem they
are.
If we try to keep the meaning of LOCALE to only LC_COLLATE and
LC_CTYPE, that will continue to be confusing for anyone that uses
provider=icu.
Regards,
Jeff Davis
[1]: https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html
https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html
[2]: https://learn.microsoft.com/en-us/windows/win32/intl/locale-names
https://learn.microsoft.com/en-us/windows/win32/intl/locale-names
Attachments:
v3-0001-ICU-do-not-convert-locale-C-to-en-US-u-va-posix.patchtext/x-patch; charset=UTF-8; name=v3-0001-ICU-do-not-convert-locale-C-to-en-US-u-va-posix.patchDownload
From 6c0251c584edea64148604da52c8e55e43fe36e6 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 21 Apr 2023 14:03:57 -0700
Subject: [PATCH v3 1/4] ICU: do not convert locale 'C' to 'en-US-u-va-posix'.
The conversion was intended to be for convenience, but it's more
likely to be confusing than useful.
The user can still directly specify 'en-US-u-va-posix' if desired.
Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.camel@j-davis.com
---
src/backend/utils/adt/pg_locale.c | 19 +------------------
src/bin/initdb/initdb.c | 17 +----------------
.../regress/expected/collate.icu.utf8.out | 8 ++++++++
src/test/regress/sql/collate.icu.utf8.sql | 4 ++++
4 files changed, 14 insertions(+), 34 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 51df570ce9..58c4c426bc 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel)
{
#ifdef USE_ICU
UErrorCode status;
- char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
- status = U_ZERO_ERROR;
- uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
- {
- if (elevel > 0)
- ereport(elevel,
- (errmsg("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status))));
- return NULL;
- }
-
- /* C/POSIX locales aren't handled by uloc_getLanguageTag() */
- if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
- return pstrdup("en-US-u-va-posix");
-
/*
* A BCP47 language tag doesn't have a clearly-defined upper limit
* (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str)
/* check for special language name */
if (strcmp(lang, "") == 0 ||
- strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
- strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+ strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
found = true;
/* search for matching language within ICU */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2c208ead01..4086834458 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str)
{
#ifdef USE_ICU
UErrorCode status;
- char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
- status = U_ZERO_ERROR;
- uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
- {
- pg_fatal("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status));
- return NULL;
- }
-
- /* C/POSIX locales aren't handled by uloc_getLanguageTag() */
- if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
- return pstrdup("en-US-u-va-posix");
-
/*
* A BCP47 language tag doesn't have a clearly-defined upper limit
* (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str)
/* check for special language name */
if (strcmp(lang, "") == 0 ||
- strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
- strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+ strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
found = true;
/* search for matching language within ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index b5a221b030..99f12d2e73 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1020,6 +1020,7 @@ CREATE ROLE regress_test_role;
CREATE SCHEMA test_schema;
-- We need to do this this way to cope with varying names for encodings:
SET client_min_messages TO WARNING;
+SET icu_validation_level = disabled;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1034,17 +1035,24 @@ BEGIN
quote_literal(current_setting('lc_collate')) || ');';
END
$$;
+RESET icu_validation_level;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
ERROR: parameter "locale" must be specified
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
+ERROR: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+WARNING: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+WARNING: ICU locale "c" has unknown language "c"
+HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 85e26951b6..d9778faacc 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -358,6 +358,7 @@ CREATE SCHEMA test_schema;
-- We need to do this this way to cope with varying names for encodings:
SET client_min_messages TO WARNING;
+SET icu_validation_level = disabled;
do $$
BEGIN
@@ -373,13 +374,16 @@ BEGIN
END
$$;
+RESET icu_validation_level;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
RESET icu_validation_level;
--
2.34.1
v3-0002-ICU-support-locale-C-with-the-same-behavior-as-li.patchtext/x-patch; charset=UTF-8; name=v3-0002-ICU-support-locale-C-with-the-same-behavior-as-li.patchDownload
From 22a8ba5748953fbc577f7aeb8d8d85d185364fb7 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 24 Apr 2023 15:46:17 -0700
Subject: [PATCH v3 2/4] ICU: support locale "C" with the same behavior as
libc.
The "C" locale doesn't actually use a provider at all, it's a special
locale that uses memcmp() and built-in character classification. Make
it behave the same in ICU as libc (even though it doesn't actually
make use of either provider).
Discussion: https://postgr.es/m/87v8hoexdv.fsf@news-spur.riddles.org.uk
---
src/backend/commands/collationcmds.c | 43 ++++++----
src/backend/commands/dbcommands.c | 42 +++++----
src/backend/utils/adt/pg_locale.c | 86 ++++++++++++++-----
.../regress/expected/collate.icu.utf8.out | 12 +--
src/test/regress/sql/collate.icu.utf8.sql | 7 +-
5 files changed, 131 insertions(+), 59 deletions(-)
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index c91fe66d9b..7e69a889fb 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -264,26 +264,39 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("parameter \"locale\" must be specified")));
- /*
- * During binary upgrade, preserve the locale string. Otherwise,
- * canonicalize to a language tag.
- */
- if (!IsBinaryUpgrade)
+ if (pg_strcasecmp(colliculocale, "C") == 0 ||
+ pg_strcasecmp(colliculocale, "POSIX") == 0)
{
- char *langtag = icu_language_tag(colliculocale,
- icu_validation_level);
-
- if (langtag && strcmp(colliculocale, langtag) != 0)
+ if (!collisdeterministic)
+ ereport(ERROR,
+ (errmsg("nondeterministic collations not supported for C or POSIX locale")));
+ if (collicurules != NULL)
+ ereport(ERROR,
+ (errmsg("RULES not supported for C or POSIX locale")));
+ }
+ else
+ {
+ /*
+ * During binary upgrade, preserve the locale
+ * string. Otherwise, canonicalize to a language tag.
+ */
+ if (!IsBinaryUpgrade)
{
- ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
- langtag, colliculocale)));
+ char *langtag = icu_language_tag(colliculocale,
+ icu_validation_level);
+
+ if (langtag && strcmp(colliculocale, langtag) != 0)
+ {
+ ereport(NOTICE,
+ (errmsg("using standard form \"%s\" for locale \"%s\"",
+ langtag, colliculocale)));
- colliculocale = langtag;
+ colliculocale = langtag;
+ }
}
- }
- icu_validate_locale(colliculocale);
+ icu_validate_locale(colliculocale);
+ }
}
/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2e242eeff2..8ef33871f0 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1058,27 +1058,37 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("ICU locale must be specified")));
- /*
- * During binary upgrade, or when the locale came from the template
- * database, preserve locale string. Otherwise, canonicalize to a
- * language tag.
- */
- if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+ if (pg_strcasecmp(dbiculocale, "C") == 0 ||
+ pg_strcasecmp(dbiculocale, "POSIX") == 0)
{
- char *langtag = icu_language_tag(dbiculocale,
- icu_validation_level);
-
- if (langtag && strcmp(dbiculocale, langtag) != 0)
+ if (dbicurules != NULL)
+ ereport(ERROR,
+ (errmsg("ICU_RULES not supported for C or POSIX locale")));
+ }
+ else
+ {
+ /*
+ * During binary upgrade, or when the locale came from the
+ * template database, preserve locale string. Otherwise,
+ * canonicalize to a language tag.
+ */
+ if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
{
- ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
- langtag, dbiculocale)));
+ char *langtag = icu_language_tag(dbiculocale,
+ icu_validation_level);
+
+ if (langtag && strcmp(dbiculocale, langtag) != 0)
+ {
+ ereport(NOTICE,
+ (errmsg("using standard form \"%s\" for locale \"%s\"",
+ langtag, dbiculocale)));
- dbiculocale = langtag;
+ dbiculocale = langtag;
+ }
}
- }
- icu_validate_locale(dbiculocale);
+ icu_validate_locale(dbiculocale);
+ }
}
else
{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 58c4c426bc..3e19b21122 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1246,8 +1246,15 @@ lookup_collation_cache(Oid collation, bool set_flags)
}
else
{
- cache_entry->collate_is_c = false;
- cache_entry->ctype_is_c = false;
+ Datum datum;
+ const char *colliculocale;
+
+ datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_colliculocale);
+ colliculocale = TextDatumGetCString(datum);
+
+ cache_entry->collate_is_c = ((strcmp(colliculocale, "C") == 0) ||
+ (strcmp(colliculocale, "POSIX") == 0));
+ cache_entry->ctype_is_c = cache_entry->collate_is_c;
}
cache_entry->flags_valid = true;
@@ -1279,16 +1286,27 @@ lc_collate_is_c(Oid collation)
if (collation == DEFAULT_COLLATION_OID)
{
static int result = -1;
- char *localeptr;
-
- if (default_locale.provider == COLLPROVIDER_ICU)
- return false;
+ const char *localeptr;
if (result >= 0)
return (bool) result;
- localeptr = setlocale(LC_COLLATE, NULL);
- if (!localeptr)
- elog(ERROR, "invalid LC_COLLATE setting");
+
+ if (default_locale.provider == COLLPROVIDER_ICU)
+ {
+#ifdef USE_ICU
+ localeptr = default_locale.info.icu.locale;
+#else
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ICU is not supported in this build")));
+#endif
+ }
+ else
+ {
+ localeptr = setlocale(LC_COLLATE, NULL);
+ if (!localeptr)
+ elog(ERROR, "invalid LC_COLLATE setting");
+ }
if (strcmp(localeptr, "C") == 0)
result = true;
@@ -1332,16 +1350,27 @@ lc_ctype_is_c(Oid collation)
if (collation == DEFAULT_COLLATION_OID)
{
static int result = -1;
- char *localeptr;
-
- if (default_locale.provider == COLLPROVIDER_ICU)
- return false;
+ const char *localeptr;
if (result >= 0)
return (bool) result;
- localeptr = setlocale(LC_CTYPE, NULL);
- if (!localeptr)
- elog(ERROR, "invalid LC_CTYPE setting");
+
+ if (default_locale.provider == COLLPROVIDER_ICU)
+ {
+#ifdef USE_ICU
+ localeptr = default_locale.info.icu.locale;
+#else
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ICU is not supported in this build")));
+#endif
+ }
+ else
+ {
+ localeptr = setlocale(LC_CTYPE, NULL);
+ if (!localeptr)
+ elog(ERROR, "invalid LC_CTYPE setting");
+ }
if (strcmp(localeptr, "C") == 0)
result = true;
@@ -1375,7 +1404,14 @@ make_icu_collator(const char *iculocstr,
#ifdef USE_ICU
UCollator *collator;
- collator = pg_ucol_open(iculocstr);
+ if (pg_strcasecmp(iculocstr, "C") == 0 ||
+ pg_strcasecmp(iculocstr, "POSIX") == 0)
+ {
+ Assert(icurules == NULL);
+ collator = NULL;
+ }
+ else
+ collator = pg_ucol_open(iculocstr);
/*
* If rules are specified, we extract the rules of the standard collation,
@@ -1650,6 +1686,10 @@ get_collation_actual_version(char collprovider, const char *collcollate)
{
char *collversion = NULL;
+ if (pg_strcasecmp("C", collcollate) == 0 ||
+ pg_strcasecmp("POSIX", collcollate) == 0)
+ return NULL;
+
#ifdef USE_ICU
if (collprovider == COLLPROVIDER_ICU)
{
@@ -1668,9 +1708,7 @@ get_collation_actual_version(char collprovider, const char *collcollate)
else
#endif
if (collprovider == COLLPROVIDER_LIBC &&
- pg_strcasecmp("C", collcollate) != 0 &&
- pg_strncasecmp("C.", collcollate, 2) != 0 &&
- pg_strcasecmp("POSIX", collcollate) != 0)
+ pg_strncasecmp("C.", collcollate, 2) != 0)
{
#if defined(__GLIBC__)
/* Use the glibc version because we don't have anything better. */
@@ -2457,6 +2495,14 @@ pg_ucol_open(const char *loc_str)
if (loc_str == NULL)
elog(ERROR, "opening default collator is not supported");
+ /*
+ * Must never open special values C or POSIX, which are treated specially
+ * and not passed to the provider.
+ */
+ if (pg_strcasecmp(loc_str, "C") == 0 ||
+ pg_strcasecmp(loc_str, "POSIX") == 0)
+ elog(ERROR, "unexpected ICU locale string: %s", loc_str);
+
/*
* In ICU versions 54 and earlier, "und" is not a recognized spelling of
* the root locale. If the first component of the locale is "und", replace
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 99f12d2e73..53ab496bfe 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1042,21 +1042,21 @@ ERROR: parameter "locale" must be specified
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
-CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
-ERROR: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = 'c', deterministic = false); -- fails
+ERROR: nondeterministic collations not supported for C or POSIX locale
+CREATE COLLATION testx (provider = icu, locale = 'c', rules = '&V << w <<< W'); -- fails
+ERROR: RULES not supported for C or POSIX locale
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
-WARNING: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-WARNING: ICU locale "c" has unknown language "c"
-HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
RESET icu_validation_level;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'posix'); DROP COLLATION testx;
CREATE COLLATION test4 FROM nonsense;
ERROR: collation "nonsense" for encoding "UTF8" does not exist
CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index d9778faacc..63d5352ee6 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -379,14 +379,17 @@ RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
-CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c', deterministic = false); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c', rules = '&V << w <<< W'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
-CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
RESET icu_validation_level;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'posix'); DROP COLLATION testx;
+
CREATE COLLATION test4 FROM nonsense;
CREATE COLLATION test5 FROM test0;
--
2.34.1
v3-0003-ICU-fix-up-old-libc-style-locale-strings.patchtext/x-patch; charset=UTF-8; name=v3-0003-ICU-fix-up-old-libc-style-locale-strings.patchDownload
From b33dc56960378a1047ccf9c0387a1fe333912140 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 28 Apr 2023 12:22:41 -0700
Subject: [PATCH v3 3/4] ICU: fix up old libc-style locale strings.
Before transforming a locale string into a language tag, fix up old
libc-style locale strings such as 'de__PHONEBOOK' or
'fr_FR@EURO'. Older ICU versions did this automatically, but ICU
version 64 removed that support.
---
src/backend/utils/adt/pg_locale.c | 59 ++++++++++++++++-
src/bin/initdb/initdb.c | 63 ++++++++++++++++++-
.../regress/expected/collate.icu.utf8.out | 11 ++++
src/test/regress/sql/collate.icu.utf8.sql | 7 +++
4 files changed, 138 insertions(+), 2 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 3e19b21122..9f2c139b0b 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2812,6 +2812,60 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
pfree(lower_str);
}
+static const char *icu_variant_map[][2] = {
+ { "@EURO", "@currency=EUR" },
+ { "@PINYIN", "@collation=pinyin" },
+ { "@STROKE", "@collation=stroke" },
+};
+
+#define ICU_VARIANT_MAP_SIZE \
+ (sizeof(icu_variant_map)/sizeof(icu_variant_map[0]))
+
+/*
+ * ICU version 64 removed the ability to transform locale strings of the form
+ * '...@VARIANT' into proper language tags. Perform the transformation from
+ * within Postgres so that ICU supports any libc locale name consistently,
+ * regardless of the ICU version.
+ */
+static char *
+icu_fix_variants(const char *loc_str)
+{
+ const char *old_variant = strrchr(loc_str, '@');
+
+ /*
+ * Extract a variant of the form '...@VARIANT', and replace with
+ * the appropriate '...@keyword=value' if found in the map.
+ */
+ if (old_variant)
+ {
+ size_t prefix_len = old_variant - loc_str; /* bytes before the '@' */
+
+ for (int i = 0; i < ICU_VARIANT_MAP_SIZE; i++)
+ {
+ const char *map_variant = icu_variant_map[i][0];
+ const char *map_replacement = icu_variant_map[i][1];
+
+ if (pg_strcasecmp(old_variant, map_variant) == 0)
+ {
+ size_t replacement_len = strlen(map_replacement);
+ size_t result_len;
+ char *result;
+
+ result_len = prefix_len + replacement_len + 1;
+ result = palloc(result_len);
+
+ memcpy(result, loc_str, prefix_len);
+ memcpy(result + prefix_len, map_replacement, replacement_len);
+ result[prefix_len + replacement_len] = '\0';
+
+ return result;
+ }
+ }
+ }
+
+ return pstrdup(loc_str);
+}
+
#endif
/*
@@ -2828,6 +2882,7 @@ icu_language_tag(const char *loc_str, int elevel)
{
#ifdef USE_ICU
UErrorCode status;
+ char *fixed_loc_str = icu_fix_variants(loc_str);
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
@@ -2844,7 +2899,7 @@ icu_language_tag(const char *loc_str, int elevel)
int32_t len;
status = U_ZERO_ERROR;
- len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+ len = uloc_toLanguageTag(fixed_loc_str, langtag, buflen, strict, &status);
/*
* If the result fits in the buffer exactly (len == buflen),
@@ -2864,6 +2919,8 @@ icu_language_tag(const char *loc_str, int elevel)
break;
}
+ pfree(fixed_loc_str);
+
if (U_FAILURE(status))
{
pfree(langtag);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 4086834458..600c8d93f3 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2229,6 +2229,64 @@ check_icu_locale_encoding(int user_enc)
return true;
}
+#ifdef USE_ICU
+
+static const char *icu_variant_map[][2] = {
+ { "@EURO", "@currency=EUR" },
+ { "@PINYIN", "@collation=pinyin" },
+ { "@STROKE", "@collation=stroke" },
+};
+
+#define ICU_VARIANT_MAP_SIZE \
+ (sizeof(icu_variant_map)/sizeof(icu_variant_map[0]))
+
+/*
+ * ICU version 64 removed the ability to transform locale strings of the form
+ * '...@VARIANT' into proper language tags. Perform the transformation from
+ * within Postgres so that ICU supports any libc locale name consistently,
+ * regardless of the ICU version.
+ */
+static char *
+icu_fix_variants(const char *loc_str)
+{
+ const char *old_variant = strrchr(loc_str, '@');
+
+ /*
+ * Extract a variant of the form '...@VARIANT', and replace with
+ * the appropriate '...@keyword=value' if found in the map.
+ */
+ if (old_variant)
+ {
+ size_t prefix_len = old_variant - loc_str; /* bytes before the '@' */
+
+ for (int i = 0; i < ICU_VARIANT_MAP_SIZE; i++)
+ {
+ const char *map_variant = icu_variant_map[i][0];
+ const char *map_replacement = icu_variant_map[i][1];
+
+ if (pg_strcasecmp(old_variant, map_variant) == 0)
+ {
+ size_t replacement_len = strlen(map_replacement);
+ size_t result_len;
+ char *result;
+
+ result_len = prefix_len + replacement_len + 1;
+ result = pg_malloc(result_len);
+
+ memcpy(result, loc_str, prefix_len);
+ memcpy(result + prefix_len, map_replacement, replacement_len);
+ result[prefix_len + replacement_len] = '\0';
+
+ return result;
+ }
+ }
+ }
+
+ return pg_strdup(loc_str);
+}
+
+#endif
+
/*
* Convert to canonical BCP47 language tag. Must be consistent with
* icu_language_tag().
@@ -2238,6 +2296,7 @@ icu_language_tag(const char *loc_str)
{
#ifdef USE_ICU
UErrorCode status;
+ char *fixed_loc_str = icu_fix_variants(loc_str);
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
@@ -2254,7 +2313,7 @@ icu_language_tag(const char *loc_str)
int32_t len;
status = U_ZERO_ERROR;
- len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+ len = uloc_toLanguageTag(fixed_loc_str, langtag, buflen, strict, &status);
/*
* If the result fits in the buffer exactly (len == buflen),
@@ -2273,6 +2332,8 @@ icu_language_tag(const char *loc_str)
break;
}
+ pg_free(fixed_loc_str);
+
if (U_FAILURE(status))
{
pg_free(langtag);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 53ab496bfe..5f5b61d036 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1048,15 +1048,26 @@ CREATE COLLATION testx (provider = icu, locale = 'c', rules = '&V << w <<< W');
ERROR: RULES not supported for C or POSIX locale
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
+ERROR: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
+WARNING: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
RESET icu_validation_level;
CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'posix'); DROP COLLATION testx;
+-- test special variants
+CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-cu-eur" for ICU locale "@EURO"
+CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-co-pinyin" for ICU locale "@pinyin"
+CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-co-stroke" for ICU locale "@stroke"
CREATE COLLATION test4 FROM nonsense;
ERROR: collation "nonsense" for encoding "UTF8" does not exist
CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 63d5352ee6..e4bbd2c009 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -382,14 +382,21 @@ CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
CREATE COLLATION testx (provider = icu, locale = 'c', deterministic = false); -- fails
CREATE COLLATION testx (provider = icu, locale = 'c', rules = '&V << w <<< W'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
RESET icu_validation_level;
CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'posix'); DROP COLLATION testx;
+-- test special variants
+CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
+
CREATE COLLATION test4 FROM nonsense;
CREATE COLLATION test5 FROM test0;
--
2.34.1
v3-0004-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchtext/x-patch; charset=UTF-8; name=v3-0004-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchDownload
From 066eac039f86d95ad853aa1e5de2b34fdf688f2e Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 25 Apr 2023 15:01:55 -0700
Subject: [PATCH v3 4/4] Make LOCALE apply to ICU_LOCALE for CREATE DATABASE.
LOCALE is now an alias for LC_COLLATE, LC_CTYPE, and (if the provider
is ICU) ICU_LOCALE. The ICU provider accepts more locale names than
libc (e.g. language tags and locale names containing collation
attributes), so in some cases LC_COLLATE, LC_CTYPE, and ICU_LOCALE
will still need to be specified separately.
Previously, LOCALE applied only to LC_COLLATE and LC_CTYPE (and
similarly for --locale in initdb and createdb). That could lead to
confusion when the provider is implicit, such as when it is inherited
from the template database, or when ICU was made default at initdb
time in commit 27b62377b4.
Reverts incomplete fix 5cd1a5af4d.
Discussion: https://postgr.es/m/3391932.1682107209@sss.pgh.pa.us
---
doc/src/sgml/ref/create_database.sgml | 6 ++--
doc/src/sgml/ref/createdb.sgml | 5 ++-
doc/src/sgml/ref/initdb.sgml | 7 +++--
src/backend/commands/collationcmds.c | 2 +-
src/backend/commands/dbcommands.c | 15 ++++++---
src/bin/initdb/initdb.c | 31 ++++++++++++-------
src/bin/scripts/createdb.c | 13 +++-----
src/bin/scripts/t/020_createdb.pl | 4 +--
src/test/icu/t/010_database.pl | 23 +++++++++-----
.../regress/expected/collate.icu.utf8.out | 22 ++++++-------
10 files changed, 77 insertions(+), 51 deletions(-)
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 13793bb6b7..844773ff44 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -145,8 +145,10 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
<term><replaceable class="parameter">locale</replaceable></term>
<listitem>
<para>
- This is a shortcut for setting <symbol>LC_COLLATE</symbol>
- and <symbol>LC_CTYPE</symbol> at once.
+ This is a shortcut for setting <symbol>LC_COLLATE</symbol>,
+ <symbol>LC_CTYPE</symbol> and <symbol>ICU_LOCALE</symbol> at
+ once. Some locales are only valid for ICU, and must be set separately
+ with <symbol>ICU_LOCALE</symbol>.
</para>
<tip>
<para>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index e23419ba6c..e4647d5ce7 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -124,7 +124,10 @@ PostgreSQL documentation
<listitem>
<para>
Specifies the locale to be used in this database. This is equivalent
- to specifying both <option>--lc-collate</option> and <option>--lc-ctype</option>.
+ to specifying <option>--lc-collate</option>,
+ <option>--lc-ctype</option>, and <option>--icu-locale</option> to the
+ same value. Some locales are only valid for ICU and must be set with
+ <option>--icu-locale</option>.
</para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 87945b4b62..f850dc404d 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -116,9 +116,10 @@ PostgreSQL documentation
<para>
To choose a different locale for the cluster, use the option
<option>--locale</option>. There are also individual options
- <option>--lc-*</option> (see below) to set values for the individual locale
- categories. Note that inconsistent settings for different locale
- categories can give nonsensical results, so this should be used with care.
+ <option>--lc-*</option> and <option>--icu-locale</option> (see below) to
+ set values for the individual locale categories. Note that inconsistent
+ settings for different locale categories can give nonsensical results, so
+ this should be used with care.
</para>
<para>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 7e69a889fb..e481f20dc8 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -288,7 +288,7 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
if (langtag && strcmp(colliculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, colliculocale)));
colliculocale = langtag;
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 8ef33871f0..b447dc55f3 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1017,7 +1017,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (dblocprovider == '\0')
dblocprovider = src_locprovider;
if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
- dbiculocale = src_iculocale;
+ {
+ if (dlocale && dlocale->arg)
+ dbiculocale = defGetString(dlocale);
+ else
+ dbiculocale = src_iculocale;
+ }
if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
dbicurules = src_icurules;
@@ -1031,12 +1036,14 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (!check_locale(LC_COLLATE, dbcollate, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbcollate)));
+ errmsg("invalid LC_COLLATE locale name: \"%s\"", dbcollate),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbcollate = canonname;
if (!check_locale(LC_CTYPE, dbctype, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbctype)));
+ errmsg("invalid LC_CTYPE locale name: \"%s\"", dbctype),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbctype = canonname;
check_encoding_locale_matches(encoding, dbcollate, dbctype);
@@ -1080,7 +1087,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (langtag && strcmp(dbiculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, dbiculocale)));
dbiculocale = langtag;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 600c8d93f3..7e316c8ba9 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2157,7 +2157,11 @@ check_locale_name(int category, const char *locale, char **canonname)
if (res == NULL)
{
if (*locale)
- pg_fatal("invalid locale name \"%s\"", locale);
+ {
+ pg_log_error("invalid locale name \"%s\"", locale);
+ pg_log_error_hint("If the locale name is specific to ICU, use --icu-locale.");
+ exit(1);
+ }
else
{
/*
@@ -2452,7 +2456,7 @@ setlocales(void)
{
char *canonname;
- /* set empty lc_* values to locale config if set */
+ /* set empty lc_* and iculocale values to locale config if set */
if (locale)
{
@@ -2468,6 +2472,8 @@ setlocales(void)
lc_monetary = locale;
if (!lc_messages)
lc_messages = locale;
+ if (!icu_locale && locale_provider == COLLPROVIDER_ICU)
+ icu_locale = locale;
}
/*
@@ -2504,14 +2510,18 @@ setlocales(void)
printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
}
- /* canonicalize to a language tag */
- langtag = icu_language_tag(icu_locale);
- printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
- langtag, icu_locale);
- pg_free(icu_locale);
- icu_locale = langtag;
-
- icu_validate_locale(icu_locale);
+ if (pg_strcasecmp(icu_locale, "C") != 0 &&
+ pg_strcasecmp(icu_locale, "POSIX") != 0)
+ {
+ /* canonicalize to a language tag */
+ langtag = icu_language_tag(icu_locale);
+ printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+ langtag, icu_locale);
+ pg_free(icu_locale);
+ icu_locale = langtag;
+
+ icu_validate_locale(icu_locale);
+ }
/*
* In supported builds, the ICU locale ID will be opened during
@@ -3343,7 +3353,6 @@ main(int argc, char *argv[])
break;
case 8:
locale = "C";
- locale_provider = COLLPROVIDER_LIBC;
break;
case 9:
pwfilename = pg_strdup(optarg);
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index b4205c4fa5..9ca86a3e53 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -164,14 +164,6 @@ main(int argc, char *argv[])
exit(1);
}
- if (locale)
- {
- if (!lc_ctype)
- lc_ctype = locale;
- if (!lc_collate)
- lc_collate = locale;
- }
-
if (encoding)
{
if (pg_char_to_encoding(encoding) < 0)
@@ -219,6 +211,11 @@ main(int argc, char *argv[])
appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
if (template)
appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
+ if (locale)
+ {
+ appendPQExpBufferStr(&sql, " LOCALE ");
+ appendStringLiteralConn(&sql, locale, conn);
+ }
if (lc_collate)
{
appendPQExpBufferStr(&sql, " LC_COLLATE ");
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index af3b1492e3..3db9fe931f 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -126,7 +126,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_COLLATE locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
],
'createdb with incorrect --lc-collate');
$node->command_checks_all(
@@ -134,7 +134,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_CTYPE locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
],
'createdb with incorrect --lc-ctype');
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 715b1bffd6..df4af00afe 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -51,16 +51,23 @@ b),
'sort by explicit collation upper first');
-# Test error cases in CREATE DATABASE involving locale-related options
+# Test that LOCALE='C' works for ICU
-my ($ret, $stdout, $stderr) = $node1->psql('postgres',
- q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
-isnt($ret, 0,
- "ICU locale must be specified for ICU provider: exit code not 0");
+my $ret1 = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
+is($ret1, 0,
+ "C locale works for ICU");
+
+# Test that ICU-specific locale string must be specified with ICU_LOCALE,
+# not LOCALE
+
+my ($ret2, $stdout, $stderr) = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu3 LOCALE_PROVIDER icu LOCALE '@colStrength=primary' TEMPLATE template0 ENCODING UTF8});
+isnt($ret2, 0,
+ "ICU-specific locale must be specified with ICU_LOCALE: exit code not 0");
like(
$stderr,
- qr/ERROR: ICU locale must be specified/,
- "ICU locale must be specified for ICU provider: error message");
-
+ qr/ERROR: invalid LC_COLLATE locale name/,
+ "ICU-specific locale must be specified with ICU_LOCALE: error message");
done_testing();
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 5f5b61d036..566e91d2d9 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1213,9 +1213,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
(1 row)
CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
-NOTICE: using standard form "und-u-kf-lower" for locale "@colCaseFirst=lower"
+NOTICE: using standard form "und-u-kf-lower" for ICU locale "@colCaseFirst=lower"
CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
-NOTICE: using standard form "und-u-kf-upper" for locale "@colCaseFirst=upper"
+NOTICE: using standard form "und-u-kf-upper" for ICU locale "@colCaseFirst=upper"
SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
?column? | ?column?
----------+----------
@@ -1223,7 +1223,7 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
(1 row)
CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
-NOTICE: using standard form "und-u-ka-shifted" for locale "@colAlternate=shifted"
+NOTICE: using standard form "und-u-ka-shifted" for ICU locale "@colAlternate=shifted"
SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
?column? | ?column?
----------+----------
@@ -1240,12 +1240,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
(1 row)
CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-NOTICE: using standard form "und-u-kn-lower" for locale "@colNumeric=lower"
+NOTICE: using standard form "und-u-kn-lower" for ICU locale "@colNumeric=lower"
ERROR: could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
-- test that attributes not handled by icu_set_collation_attributes()
-- (handled by ucol_open() directly) also work
CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
-NOTICE: using standard form "de-u-co-phonebk" for locale "de@collation=phonebook"
+NOTICE: using standard form "de-u-co-phonebk" for ICU locale "de@collation=phonebook"
SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
?column? | ?column?
----------+----------
@@ -1254,7 +1254,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
-- rules
CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test7 (a text);
-- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1282,13 +1282,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
DROP TABLE test7;
CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
ERROR: could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
-- nondeterministic collations
CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test6 (a int, b text);
-- same string in different normal forms
INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1338,9 +1338,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
(2 rows)
CREATE COLLATION case_sensitive (provider = icu, locale = '');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
-NOTICE: using standard form "und-u-ks-level2" for locale "@colStrength=secondary"
+NOTICE: using standard form "und-u-ks-level2" for ICU locale "@colStrength=secondary"
SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
?column? | ?column?
----------+----------
--
2.34.1
On Fri, 2023-04-28 at 14:35 -0700, Jeff Davis wrote:
On Thu, 2023-04-27 at 14:23 +0200, Daniel Verite wrote:
This should be pg_strcasecmp(...) == 0
Good catch, thank you! Fixed in updated patches.
Rebased patches.
=== 0001: do not convert C to en-US-u-va-posix
I plan to commit this soon. If someone specifies "C", they are probably
expecting memcmp()-like behavior, or some kind of error/warning that it
can't be provided.
Removing this transformation means that if you specify iculocale=C,
you'll get an error or warning (depending on icu_validation_level),
because C is not a recognized icu locale. Depending on how some of the
other issues in this thread are sorted out, we may want to relax the
validation.
=== 0002: fix @euro, etc. in ICU >= 64
I'd like to commit this soon too, but I'll wait for someone to take a
look. It makes it more reliable to map libc names to icu locale names
regardless of the ICU version.
It doesn't solve the problem for locales like "de__PHONEBOOK", but
those don't seem to be a libc format (I think just an old ICU format),
so I don't see a big reason to carry it forward. It might be another
reason to turn down the validation level to WARNING, though.
=== 0003: support C memcmp() behavior with ICU provider
The current patch 0003 has a problem, because in previous postgres
versions (going all the way back), we allowed "C" as a valid ICU
locale, that would actually be passed to ICU as a locale name. But ICU
didn't recognize it, and it would end up opening the root locale. So we
can't simply redefine "C" to mean "memcmp", because that would
potentially break indexes.
I see the following potential solutions:
1. Represent the memcmp behavior with iculocale=NULL, or some other
catalog hack, so that we can distinguish between a locale "C" upgraded
from a previous version (which should pass "C" to ICU and get the root
locale), and a new collation defined with locale "C" (which should have
memcmp behavior). The catalog representation for locale information is
already complex, so I'm not excited about this option, but it will
work.
2. When provider=icu and locale=C, magically transform that into
provider=libc to get memcmp-like behavior for new collations but
preserve the existing behavior for upgraded collations. Not especially
clean, but if we issue a NOTICE perhaps that would avoid confusion.
3. Like #2, except create a new provider type "none" which may be
slightly less confusing.
=== 0004: make LOCALE apply to ICU for CREATE DATABASE
To understand this patch it helps to understand the confusing situation
with CREATE DATABASE in version 15:
The keywords LC_CTYPE and LC_COLLATE set the server environment
LC_CTYPE/LC_COLLATE for that database and can be specified regardless
of the provider. LOCALE can be specified along with (or instead of)
LC_CTYPE and LC_COLLATE, in which case whichever of LC_CTYPE or
LC_COLLATE is unspecified defaults to the setting of LOCALE. Iff the
provider is libc, LC_CTYPE and LC_COLLATE also act as the database
default collation's locale. If the provider is icu, then none of
LOCALE, LC_CTYPE, or LC_COLLATE affect the database default collation's
locale at all; that's controlled by ICU_LOCALE (which may be omitted if
the template's daticulocale is non-NULL).
The idea of patch 0004 is to address the last part, which is probably
the most confusing aspect. But for that to work smoothly, we need
something like 0003 so that LOCALE=C gives the same semantics
regardless of the provider.
Regards,
Jeff Davis
Attachments:
v4-0001-ICU-do-not-convert-locale-C-to-en-US-u-va-posix.patchtext/x-patch; charset=UTF-8; name=v4-0001-ICU-do-not-convert-locale-C-to-en-US-u-va-posix.patchDownload
From ddda683963959a175dff17ab0e3d8519641498b9 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 21 Apr 2023 14:03:57 -0700
Subject: [PATCH v4 1/4] ICU: do not convert locale 'C' to 'en-US-u-va-posix'.
The conversion was intended to be for convenience, but it's more
likely to be confusing than useful.
The user can still directly specify 'en-US-u-va-posix' if desired.
Discussion: https://postgr.es/m/f83f089ee1e9acd5dbbbf3353294d24e1f196e95.camel@j-davis.com
---
src/backend/utils/adt/pg_locale.c | 19 +------------------
src/bin/initdb/initdb.c | 17 +----------------
.../regress/expected/collate.icu.utf8.out | 8 ++++++++
src/test/regress/sql/collate.icu.utf8.sql | 4 ++++
4 files changed, 14 insertions(+), 34 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index f0b6567da1..51b4221a39 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2782,26 +2782,10 @@ icu_language_tag(const char *loc_str, int elevel)
{
#ifdef USE_ICU
UErrorCode status;
- char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
- status = U_ZERO_ERROR;
- uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
- {
- if (elevel > 0)
- ereport(elevel,
- (errmsg("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status))));
- return NULL;
- }
-
- /* C/POSIX locales aren't handled by uloc_getLanguageTag() */
- if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
- return pstrdup("en-US-u-va-posix");
-
/*
* A BCP47 language tag doesn't have a clearly-defined upper limit
* (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2889,8 +2873,7 @@ icu_validate_locale(const char *loc_str)
/* check for special language name */
if (strcmp(lang, "") == 0 ||
- strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
- strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+ strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
found = true;
/* search for matching language within ICU */
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2c208ead01..4086834458 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2238,24 +2238,10 @@ icu_language_tag(const char *loc_str)
{
#ifdef USE_ICU
UErrorCode status;
- char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
- status = U_ZERO_ERROR;
- uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
- {
- pg_fatal("could not get language from locale \"%s\": %s",
- loc_str, u_errorName(status));
- return NULL;
- }
-
- /* C/POSIX locales aren't handled by uloc_getLanguageTag() */
- if (strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
- return pstrdup("en-US-u-va-posix");
-
/*
* A BCP47 language tag doesn't have a clearly-defined upper limit
* (cf. RFC5646 section 4.4). Additionally, in older ICU versions,
@@ -2327,8 +2313,7 @@ icu_validate_locale(const char *loc_str)
/* check for special language name */
if (strcmp(lang, "") == 0 ||
- strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0 ||
- strcmp(lang, "c") == 0 || strcmp(lang, "posix") == 0)
+ strcmp(lang, "root") == 0 || strcmp(lang, "und") == 0)
found = true;
/* search for matching language within ICU */
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index b5a221b030..99f12d2e73 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1020,6 +1020,7 @@ CREATE ROLE regress_test_role;
CREATE SCHEMA test_schema;
-- We need to do this this way to cope with varying names for encodings:
SET client_min_messages TO WARNING;
+SET icu_validation_level = disabled;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
@@ -1034,17 +1035,24 @@ BEGIN
quote_literal(current_setting('lc_collate')) || ');';
END
$$;
+RESET icu_validation_level;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
ERROR: parameter "locale" must be specified
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
+ERROR: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+WARNING: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+WARNING: ICU locale "c" has unknown language "c"
+HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 85e26951b6..d9778faacc 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -358,6 +358,7 @@ CREATE SCHEMA test_schema;
-- We need to do this this way to cope with varying names for encodings:
SET client_min_messages TO WARNING;
+SET icu_validation_level = disabled;
do $$
BEGIN
@@ -373,13 +374,16 @@ BEGIN
END
$$;
+RESET icu_validation_level;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
RESET icu_validation_level;
--
2.34.1
v4-0002-ICU-fix-up-old-libc-style-locale-strings.patchtext/x-patch; charset=UTF-8; name=v4-0002-ICU-fix-up-old-libc-style-locale-strings.patchDownload
From 3db6abd0fe56e9c4b6653e04e28f6f77381c2fc8 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 28 Apr 2023 12:22:41 -0700
Subject: [PATCH v4 2/4] ICU: fix up old libc-style locale strings.
Before transforming a locale string into a language tag, fix up old
libc-style locale strings such as 'fr_FR@euro'. Older ICU versions did
this automatically, but ICU version 64 removed that support.
Discussion: https://postgr.es/m/654a49f7ff7461bcf47be4181430678d45f93858.camel%40j-davis.com
---
src/backend/utils/adt/pg_locale.c | 59 ++++++++++++++++-
src/bin/initdb/initdb.c | 63 ++++++++++++++++++-
.../regress/expected/collate.icu.utf8.out | 11 ++++
src/test/regress/sql/collate.icu.utf8.sql | 7 +++
4 files changed, 138 insertions(+), 2 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 51b4221a39..0e7343b28b 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2766,6 +2766,60 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
pfree(lower_str);
}
+static const char *icu_variant_map[][2] = {
+ { "@EURO", "@currency=EUR" },
+ { "@PINYIN", "@collation=pinyin" },
+ { "@STROKE", "@collation=stroke" },
+};
+
+#define ICU_VARIANT_MAP_SIZE \
+ (sizeof(icu_variant_map)/sizeof(icu_variant_map[0]))
+
+/*
+ * ICU version 64 removed the ability to transform locale strings of the form
+ * '...@VARIANT' into proper language tags. Perform the transformation from
+ * within Postgres so that ICU supports any libc locale name consistently,
+ * regardless of the ICU version.
+ */
+static char *
+icu_fix_variants(const char *loc_str)
+{
+ const char *old_variant = strrchr(loc_str, '@');
+
+ /*
+ * Extract a variant of the form '...@VARIANT', and replace with
+ * the appropriate '...@keyword=value' if found in the map.
+ */
+ if (old_variant)
+ {
+ size_t prefix_len = old_variant - loc_str; /* bytes before the '@' */
+
+ for (int i = 0; i < ICU_VARIANT_MAP_SIZE; i++)
+ {
+ const char *map_variant = icu_variant_map[i][0];
+ const char *map_replacement = icu_variant_map[i][1];
+
+ if (pg_strcasecmp(old_variant, map_variant) == 0)
+ {
+ size_t replacement_len = strlen(map_replacement);
+ size_t result_len;
+ char *result;
+
+ result_len = prefix_len + replacement_len + 1;
+ result = palloc(result_len);
+
+ memcpy(result, loc_str, prefix_len);
+ memcpy(result + prefix_len, map_replacement, replacement_len);
+ result[prefix_len + replacement_len] = '\0';
+
+ return result;
+ }
+ }
+ }
+
+ return pstrdup(loc_str);
+}
+
#endif
/*
@@ -2782,6 +2836,7 @@ icu_language_tag(const char *loc_str, int elevel)
{
#ifdef USE_ICU
UErrorCode status;
+ char *fixed_loc_str = icu_fix_variants(loc_str);
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
@@ -2798,7 +2853,7 @@ icu_language_tag(const char *loc_str, int elevel)
int32_t len;
status = U_ZERO_ERROR;
- len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+ len = uloc_toLanguageTag(fixed_loc_str, langtag, buflen, strict, &status);
/*
* If the result fits in the buffer exactly (len == buflen),
@@ -2818,6 +2873,8 @@ icu_language_tag(const char *loc_str, int elevel)
break;
}
+ pfree(fixed_loc_str);
+
if (U_FAILURE(status))
{
pfree(langtag);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 4086834458..600c8d93f3 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2229,6 +2229,64 @@ check_icu_locale_encoding(int user_enc)
return true;
}
+#ifdef USE_ICU
+
+static const char *icu_variant_map[][2] = {
+ { "@EURO", "@currency=EUR" },
+ { "@PINYIN", "@collation=pinyin" },
+ { "@STROKE", "@collation=stroke" },
+};
+
+#define ICU_VARIANT_MAP_SIZE \
+ (sizeof(icu_variant_map)/sizeof(icu_variant_map[0]))
+
+/*
+ * ICU version 64 removed the ability to transform locale strings of the form
+ * '...@VARIANT' into proper language tags. Perform the transformation from
+ * within Postgres so that ICU supports any libc locale name consistently,
+ * regardless of the ICU version.
+ */
+static char *
+icu_fix_variants(const char *loc_str)
+{
+ const char *old_variant = strrchr(loc_str, '@');
+
+ /*
+ * Extract a variant of the form '...@VARIANT', and replace with
+ * the appropriate '...@keyword=value' if found in the map.
+ */
+ if (old_variant)
+ {
+ size_t prefix_len = old_variant - loc_str; /* bytes before the '@' */
+
+ for (int i = 0; i < ICU_VARIANT_MAP_SIZE; i++)
+ {
+ const char *map_variant = icu_variant_map[i][0];
+ const char *map_replacement = icu_variant_map[i][1];
+
+ if (pg_strcasecmp(old_variant, map_variant) == 0)
+ {
+ size_t replacement_len = strlen(map_replacement);
+ size_t result_len;
+ char *result;
+
+ result_len = prefix_len + replacement_len + 1;
+ result = pg_malloc(result_len);
+
+ memcpy(result, loc_str, prefix_len);
+ memcpy(result + prefix_len, map_replacement, replacement_len);
+ result[prefix_len + replacement_len] = '\0';
+
+ return result;
+ }
+ }
+ }
+
+ return pg_strdup(loc_str);
+}
+
+#endif
+
/*
* Convert to canonical BCP47 language tag. Must be consistent with
* icu_language_tag().
@@ -2238,6 +2296,7 @@ icu_language_tag(const char *loc_str)
{
#ifdef USE_ICU
UErrorCode status;
+ char *fixed_loc_str = icu_fix_variants(loc_str);
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
const bool strict = true;
@@ -2254,7 +2313,7 @@ icu_language_tag(const char *loc_str)
int32_t len;
status = U_ZERO_ERROR;
- len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+ len = uloc_toLanguageTag(fixed_loc_str, langtag, buflen, strict, &status);
/*
* If the result fits in the buffer exactly (len == buflen),
@@ -2273,6 +2332,8 @@ icu_language_tag(const char *loc_str)
break;
}
+ pg_free(fixed_loc_str);
+
if (U_FAILURE(status))
{
pg_free(langtag);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 99f12d2e73..d520674edf 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1046,6 +1046,8 @@ CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
ERROR: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
+ERROR: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
@@ -1056,7 +1058,16 @@ HINT: To disable ICU locale validation, set parameter icu_validation_level to D
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
+WARNING: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
RESET icu_validation_level;
+-- test special variants
+CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-cu-eur" for locale "@EURO"
+CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-co-pinyin" for locale "@pinyin"
+CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-co-stroke" for locale "@stroke"
CREATE COLLATION test4 FROM nonsense;
ERROR: collation "nonsense" for encoding "UTF8" does not exist
CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index d9778faacc..ab9a8484b9 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -381,12 +381,19 @@ CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, nee
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
RESET icu_validation_level;
+-- test special variants
+CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
+
CREATE COLLATION test4 FROM nonsense;
CREATE COLLATION test5 FROM test0;
--
2.34.1
v4-0003-ICU-support-locale-C-with-the-same-behavior-as-li.patchtext/x-patch; charset=UTF-8; name=v4-0003-ICU-support-locale-C-with-the-same-behavior-as-li.patchDownload
From 05597bea2f48cb1ef78a745401bcabdd29245b84 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 24 Apr 2023 15:46:17 -0700
Subject: [PATCH v4 3/4] ICU: support locale "C" with the same behavior as
libc.
The "C" locale doesn't actually use a provider at all, it's a special
locale that uses memcmp() and built-in character classification. Make
it behave the same in ICU as libc (even though it doesn't actually
make use of either provider).
Discussion: https://postgr.es/m/87v8hoexdv.fsf@news-spur.riddles.org.uk
---
src/backend/commands/collationcmds.c | 43 ++++++----
src/backend/commands/dbcommands.c | 42 +++++----
src/backend/utils/adt/pg_locale.c | 86 ++++++++++++++-----
.../regress/expected/collate.icu.utf8.out | 12 +--
src/test/regress/sql/collate.icu.utf8.sql | 7 +-
5 files changed, 131 insertions(+), 59 deletions(-)
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index c91fe66d9b..7e69a889fb 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -264,26 +264,39 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("parameter \"locale\" must be specified")));
- /*
- * During binary upgrade, preserve the locale string. Otherwise,
- * canonicalize to a language tag.
- */
- if (!IsBinaryUpgrade)
+ if (pg_strcasecmp(colliculocale, "C") == 0 ||
+ pg_strcasecmp(colliculocale, "POSIX") == 0)
{
- char *langtag = icu_language_tag(colliculocale,
- icu_validation_level);
-
- if (langtag && strcmp(colliculocale, langtag) != 0)
+ if (!collisdeterministic)
+ ereport(ERROR,
+ (errmsg("nondeterministic collations not supported for C or POSIX locale")));
+ if (collicurules != NULL)
+ ereport(ERROR,
+ (errmsg("RULES not supported for C or POSIX locale")));
+ }
+ else
+ {
+ /*
+ * During binary upgrade, preserve the locale
+ * string. Otherwise, canonicalize to a language tag.
+ */
+ if (!IsBinaryUpgrade)
{
- ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
- langtag, colliculocale)));
+ char *langtag = icu_language_tag(colliculocale,
+ icu_validation_level);
+
+ if (langtag && strcmp(colliculocale, langtag) != 0)
+ {
+ ereport(NOTICE,
+ (errmsg("using standard form \"%s\" for locale \"%s\"",
+ langtag, colliculocale)));
- colliculocale = langtag;
+ colliculocale = langtag;
+ }
}
- }
- icu_validate_locale(colliculocale);
+ icu_validate_locale(colliculocale);
+ }
}
/*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2e242eeff2..8ef33871f0 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1058,27 +1058,37 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("ICU locale must be specified")));
- /*
- * During binary upgrade, or when the locale came from the template
- * database, preserve locale string. Otherwise, canonicalize to a
- * language tag.
- */
- if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
+ if (pg_strcasecmp(dbiculocale, "C") == 0 ||
+ pg_strcasecmp(dbiculocale, "POSIX") == 0)
{
- char *langtag = icu_language_tag(dbiculocale,
- icu_validation_level);
-
- if (langtag && strcmp(dbiculocale, langtag) != 0)
+ if (dbicurules != NULL)
+ ereport(ERROR,
+ (errmsg("ICU_RULES not supported for C or POSIX locale")));
+ }
+ else
+ {
+ /*
+ * During binary upgrade, or when the locale came from the
+ * template database, preserve locale string. Otherwise,
+ * canonicalize to a language tag.
+ */
+ if (!IsBinaryUpgrade && dbiculocale != src_iculocale)
{
- ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
- langtag, dbiculocale)));
+ char *langtag = icu_language_tag(dbiculocale,
+ icu_validation_level);
+
+ if (langtag && strcmp(dbiculocale, langtag) != 0)
+ {
+ ereport(NOTICE,
+ (errmsg("using standard form \"%s\" for locale \"%s\"",
+ langtag, dbiculocale)));
- dbiculocale = langtag;
+ dbiculocale = langtag;
+ }
}
- }
- icu_validate_locale(dbiculocale);
+ icu_validate_locale(dbiculocale);
+ }
}
else
{
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 0e7343b28b..76ca42441d 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1246,8 +1246,15 @@ lookup_collation_cache(Oid collation, bool set_flags)
}
else
{
- cache_entry->collate_is_c = false;
- cache_entry->ctype_is_c = false;
+ Datum datum;
+ const char *colliculocale;
+
+ datum = SysCacheGetAttrNotNull(COLLOID, tp, Anum_pg_collation_colliculocale);
+ colliculocale = TextDatumGetCString(datum);
+
+ cache_entry->collate_is_c = ((strcmp(colliculocale, "C") == 0) ||
+ (strcmp(colliculocale, "POSIX") == 0));
+ cache_entry->ctype_is_c = cache_entry->collate_is_c;
}
cache_entry->flags_valid = true;
@@ -1279,16 +1286,27 @@ lc_collate_is_c(Oid collation)
if (collation == DEFAULT_COLLATION_OID)
{
static int result = -1;
- char *localeptr;
-
- if (default_locale.provider == COLLPROVIDER_ICU)
- return false;
+ const char *localeptr;
if (result >= 0)
return (bool) result;
- localeptr = setlocale(LC_COLLATE, NULL);
- if (!localeptr)
- elog(ERROR, "invalid LC_COLLATE setting");
+
+ if (default_locale.provider == COLLPROVIDER_ICU)
+ {
+#ifdef USE_ICU
+ localeptr = default_locale.info.icu.locale;
+#else
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ICU is not supported in this build")));
+#endif
+ }
+ else
+ {
+ localeptr = setlocale(LC_COLLATE, NULL);
+ if (!localeptr)
+ elog(ERROR, "invalid LC_COLLATE setting");
+ }
if (strcmp(localeptr, "C") == 0)
result = true;
@@ -1332,16 +1350,27 @@ lc_ctype_is_c(Oid collation)
if (collation == DEFAULT_COLLATION_OID)
{
static int result = -1;
- char *localeptr;
-
- if (default_locale.provider == COLLPROVIDER_ICU)
- return false;
+ const char *localeptr;
if (result >= 0)
return (bool) result;
- localeptr = setlocale(LC_CTYPE, NULL);
- if (!localeptr)
- elog(ERROR, "invalid LC_CTYPE setting");
+
+ if (default_locale.provider == COLLPROVIDER_ICU)
+ {
+#ifdef USE_ICU
+ localeptr = default_locale.info.icu.locale;
+#else
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("ICU is not supported in this build")));
+#endif
+ }
+ else
+ {
+ localeptr = setlocale(LC_CTYPE, NULL);
+ if (!localeptr)
+ elog(ERROR, "invalid LC_CTYPE setting");
+ }
if (strcmp(localeptr, "C") == 0)
result = true;
@@ -1375,7 +1404,14 @@ make_icu_collator(const char *iculocstr,
#ifdef USE_ICU
UCollator *collator;
- collator = pg_ucol_open(iculocstr);
+ if (pg_strcasecmp(iculocstr, "C") == 0 ||
+ pg_strcasecmp(iculocstr, "POSIX") == 0)
+ {
+ Assert(icurules == NULL);
+ collator = NULL;
+ }
+ else
+ collator = pg_ucol_open(iculocstr);
/*
* If rules are specified, we extract the rules of the standard collation,
@@ -1650,6 +1686,10 @@ get_collation_actual_version(char collprovider, const char *collcollate)
{
char *collversion = NULL;
+ if (pg_strcasecmp("C", collcollate) == 0 ||
+ pg_strcasecmp("POSIX", collcollate) == 0)
+ return NULL;
+
#ifdef USE_ICU
if (collprovider == COLLPROVIDER_ICU)
{
@@ -1668,9 +1708,7 @@ get_collation_actual_version(char collprovider, const char *collcollate)
else
#endif
if (collprovider == COLLPROVIDER_LIBC &&
- pg_strcasecmp("C", collcollate) != 0 &&
- pg_strncasecmp("C.", collcollate, 2) != 0 &&
- pg_strcasecmp("POSIX", collcollate) != 0)
+ pg_strncasecmp("C.", collcollate, 2) != 0)
{
#if defined(__GLIBC__)
/* Use the glibc version because we don't have anything better. */
@@ -2457,6 +2495,14 @@ pg_ucol_open(const char *loc_str)
if (loc_str == NULL)
elog(ERROR, "opening default collator is not supported");
+ /*
+ * Must never open special values C or POSIX, which are treated specially
+ * and not passed to the provider.
+ */
+ if (pg_strcasecmp(loc_str, "C") == 0 ||
+ pg_strcasecmp(loc_str, "POSIX") == 0)
+ elog(ERROR, "unexpected ICU locale string: %s", loc_str);
+
/*
* In ICU versions 54 and earlier, "und" is not a recognized spelling of
* the root locale. If the first component of the locale is "und", replace
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index d520674edf..f217658151 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1042,8 +1042,10 @@ ERROR: parameter "locale" must be specified
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
-CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
-ERROR: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = 'c', deterministic = false); -- fails
+ERROR: nondeterministic collations not supported for C or POSIX locale
+CREATE COLLATION testx (provider = icu, locale = 'c', rules = '&V << w <<< W'); -- fails
+ERROR: RULES not supported for C or POSIX locale
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
@@ -1051,16 +1053,14 @@ ERROR: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMEN
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
-WARNING: could not convert locale name "c" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-WARNING: ICU locale "c" has unknown language "c"
-HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
WARNING: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
RESET icu_validation_level;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'posix'); DROP COLLATION testx;
-- test special variants
CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
NOTICE: using standard form "und-u-cu-eur" for locale "@EURO"
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index ab9a8484b9..e4bbd2c009 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -379,16 +379,19 @@ RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
-CREATE COLLATION testx (provider = icu, locale = 'c'); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c', deterministic = false); -- fails
+CREATE COLLATION testx (provider = icu, locale = 'c', rules = '&V << w <<< W'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
-CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
RESET icu_validation_level;
+CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = 'posix'); DROP COLLATION testx;
+
-- test special variants
CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
--
2.34.1
v4-0004-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchtext/x-patch; charset=UTF-8; name=v4-0004-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchDownload
From 310bdcd136e44bfca1eea4da5181886eac02d52d Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 25 Apr 2023 15:01:55 -0700
Subject: [PATCH v4 4/4] Make LOCALE apply to ICU_LOCALE for CREATE DATABASE.
LOCALE is now an alias for LC_COLLATE, LC_CTYPE, and (if the provider
is ICU) ICU_LOCALE. The ICU provider accepts more locale names than
libc (e.g. language tags and locale names containing collation
attributes), so in some cases LC_COLLATE, LC_CTYPE, and ICU_LOCALE
will still need to be specified separately.
Previously, LOCALE applied only to LC_COLLATE and LC_CTYPE (and
similarly for --locale in initdb and createdb). That could lead to
confusion when the provider is implicit, such as when it is inherited
from the template database, or when ICU was made default at initdb
time in commit 27b62377b4.
Reverts incomplete fix 5cd1a5af4d.
Discussion: https://postgr.es/m/3391932.1682107209@sss.pgh.pa.us
---
doc/src/sgml/ref/create_database.sgml | 6 ++--
doc/src/sgml/ref/createdb.sgml | 5 ++-
doc/src/sgml/ref/initdb.sgml | 7 +++--
src/backend/commands/collationcmds.c | 2 +-
src/backend/commands/dbcommands.c | 15 ++++++---
src/bin/initdb/initdb.c | 31 ++++++++++++-------
src/bin/scripts/createdb.c | 13 +++-----
src/bin/scripts/t/020_createdb.pl | 4 +--
src/test/icu/t/010_database.pl | 23 +++++++++-----
.../regress/expected/collate.icu.utf8.out | 28 ++++++++---------
10 files changed, 80 insertions(+), 54 deletions(-)
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 13793bb6b7..844773ff44 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -145,8 +145,10 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
<term><replaceable class="parameter">locale</replaceable></term>
<listitem>
<para>
- This is a shortcut for setting <symbol>LC_COLLATE</symbol>
- and <symbol>LC_CTYPE</symbol> at once.
+ This is a shortcut for setting <symbol>LC_COLLATE</symbol>,
+ <symbol>LC_CTYPE</symbol> and <symbol>ICU_LOCALE</symbol> at
+ once. Some locales are only valid for ICU, and must be set separately
+ with <symbol>ICU_LOCALE</symbol>.
</para>
<tip>
<para>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index e23419ba6c..e4647d5ce7 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -124,7 +124,10 @@ PostgreSQL documentation
<listitem>
<para>
Specifies the locale to be used in this database. This is equivalent
- to specifying both <option>--lc-collate</option> and <option>--lc-ctype</option>.
+ to specifying <option>--lc-collate</option>,
+ <option>--lc-ctype</option>, and <option>--icu-locale</option> to the
+ same value. Some locales are only valid for ICU and must be set with
+ <option>--icu-locale</option>.
</para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 87945b4b62..f850dc404d 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -116,9 +116,10 @@ PostgreSQL documentation
<para>
To choose a different locale for the cluster, use the option
<option>--locale</option>. There are also individual options
- <option>--lc-*</option> (see below) to set values for the individual locale
- categories. Note that inconsistent settings for different locale
- categories can give nonsensical results, so this should be used with care.
+ <option>--lc-*</option> and <option>--icu-locale</option> (see below) to
+ set values for the individual locale categories. Note that inconsistent
+ settings for different locale categories can give nonsensical results, so
+ this should be used with care.
</para>
<para>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 7e69a889fb..e481f20dc8 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -288,7 +288,7 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
if (langtag && strcmp(colliculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, colliculocale)));
colliculocale = langtag;
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 8ef33871f0..b447dc55f3 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1017,7 +1017,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (dblocprovider == '\0')
dblocprovider = src_locprovider;
if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
- dbiculocale = src_iculocale;
+ {
+ if (dlocale && dlocale->arg)
+ dbiculocale = defGetString(dlocale);
+ else
+ dbiculocale = src_iculocale;
+ }
if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
dbicurules = src_icurules;
@@ -1031,12 +1036,14 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (!check_locale(LC_COLLATE, dbcollate, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbcollate)));
+ errmsg("invalid LC_COLLATE locale name: \"%s\"", dbcollate),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbcollate = canonname;
if (!check_locale(LC_CTYPE, dbctype, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbctype)));
+ errmsg("invalid LC_CTYPE locale name: \"%s\"", dbctype),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbctype = canonname;
check_encoding_locale_matches(encoding, dbcollate, dbctype);
@@ -1080,7 +1087,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (langtag && strcmp(dbiculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, dbiculocale)));
dbiculocale = langtag;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 600c8d93f3..7e316c8ba9 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2157,7 +2157,11 @@ check_locale_name(int category, const char *locale, char **canonname)
if (res == NULL)
{
if (*locale)
- pg_fatal("invalid locale name \"%s\"", locale);
+ {
+ pg_log_error("invalid locale name \"%s\"", locale);
+ pg_log_error_hint("If the locale name is specific to ICU, use --icu-locale.");
+ exit(1);
+ }
else
{
/*
@@ -2452,7 +2456,7 @@ setlocales(void)
{
char *canonname;
- /* set empty lc_* values to locale config if set */
+ /* set empty lc_* and iculocale values to locale config if set */
if (locale)
{
@@ -2468,6 +2472,8 @@ setlocales(void)
lc_monetary = locale;
if (!lc_messages)
lc_messages = locale;
+ if (!icu_locale && locale_provider == COLLPROVIDER_ICU)
+ icu_locale = locale;
}
/*
@@ -2504,14 +2510,18 @@ setlocales(void)
printf(_("Using default ICU locale \"%s\".\n"), icu_locale);
}
- /* canonicalize to a language tag */
- langtag = icu_language_tag(icu_locale);
- printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
- langtag, icu_locale);
- pg_free(icu_locale);
- icu_locale = langtag;
-
- icu_validate_locale(icu_locale);
+ if (pg_strcasecmp(icu_locale, "C") != 0 &&
+ pg_strcasecmp(icu_locale, "POSIX") != 0)
+ {
+ /* canonicalize to a language tag */
+ langtag = icu_language_tag(icu_locale);
+ printf(_("Using language tag \"%s\" for ICU locale \"%s\".\n"),
+ langtag, icu_locale);
+ pg_free(icu_locale);
+ icu_locale = langtag;
+
+ icu_validate_locale(icu_locale);
+ }
/*
* In supported builds, the ICU locale ID will be opened during
@@ -3343,7 +3353,6 @@ main(int argc, char *argv[])
break;
case 8:
locale = "C";
- locale_provider = COLLPROVIDER_LIBC;
break;
case 9:
pwfilename = pg_strdup(optarg);
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index b4205c4fa5..9ca86a3e53 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -164,14 +164,6 @@ main(int argc, char *argv[])
exit(1);
}
- if (locale)
- {
- if (!lc_ctype)
- lc_ctype = locale;
- if (!lc_collate)
- lc_collate = locale;
- }
-
if (encoding)
{
if (pg_char_to_encoding(encoding) < 0)
@@ -219,6 +211,11 @@ main(int argc, char *argv[])
appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
if (template)
appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
+ if (locale)
+ {
+ appendPQExpBufferStr(&sql, " LOCALE ");
+ appendStringLiteralConn(&sql, locale, conn);
+ }
if (lc_collate)
{
appendPQExpBufferStr(&sql, " LC_COLLATE ");
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index af3b1492e3..3db9fe931f 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -126,7 +126,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_COLLATE locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
],
'createdb with incorrect --lc-collate');
$node->command_checks_all(
@@ -134,7 +134,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_CTYPE locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
],
'createdb with incorrect --lc-ctype');
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 715b1bffd6..df4af00afe 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -51,16 +51,23 @@ b),
'sort by explicit collation upper first');
-# Test error cases in CREATE DATABASE involving locale-related options
+# Test that LOCALE='C' works for ICU
-my ($ret, $stdout, $stderr) = $node1->psql('postgres',
- q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
-isnt($ret, 0,
- "ICU locale must be specified for ICU provider: exit code not 0");
+my $ret1 = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
+is($ret1, 0,
+ "C locale works for ICU");
+
+# Test that ICU-specific locale string must be specified with ICU_LOCALE,
+# not LOCALE
+
+my ($ret2, $stdout, $stderr) = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu3 LOCALE_PROVIDER icu LOCALE '@colStrength=primary' TEMPLATE template0 ENCODING UTF8});
+isnt($ret2, 0,
+ "ICU-specific locale must be specified with ICU_LOCALE: exit code not 0");
like(
$stderr,
- qr/ERROR: ICU locale must be specified/,
- "ICU locale must be specified for ICU provider: error message");
-
+ qr/ERROR: invalid LC_COLLATE locale name/,
+ "ICU-specific locale must be specified with ICU_LOCALE: error message");
done_testing();
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index f217658151..566e91d2d9 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1063,11 +1063,11 @@ CREATE COLLATION testx (provider = icu, locale = 'c'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'posix'); DROP COLLATION testx;
-- test special variants
CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
-NOTICE: using standard form "und-u-cu-eur" for locale "@EURO"
+NOTICE: using standard form "und-u-cu-eur" for ICU locale "@EURO"
CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
-NOTICE: using standard form "und-u-co-pinyin" for locale "@pinyin"
+NOTICE: using standard form "und-u-co-pinyin" for ICU locale "@pinyin"
CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
-NOTICE: using standard form "und-u-co-stroke" for locale "@stroke"
+NOTICE: using standard form "und-u-co-stroke" for ICU locale "@stroke"
CREATE COLLATION test4 FROM nonsense;
ERROR: collation "nonsense" for encoding "UTF8" does not exist
CREATE COLLATION test5 FROM test0;
@@ -1213,9 +1213,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
(1 row)
CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
-NOTICE: using standard form "und-u-kf-lower" for locale "@colCaseFirst=lower"
+NOTICE: using standard form "und-u-kf-lower" for ICU locale "@colCaseFirst=lower"
CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
-NOTICE: using standard form "und-u-kf-upper" for locale "@colCaseFirst=upper"
+NOTICE: using standard form "und-u-kf-upper" for ICU locale "@colCaseFirst=upper"
SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
?column? | ?column?
----------+----------
@@ -1223,7 +1223,7 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
(1 row)
CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
-NOTICE: using standard form "und-u-ka-shifted" for locale "@colAlternate=shifted"
+NOTICE: using standard form "und-u-ka-shifted" for ICU locale "@colAlternate=shifted"
SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
?column? | ?column?
----------+----------
@@ -1240,12 +1240,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
(1 row)
CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-NOTICE: using standard form "und-u-kn-lower" for locale "@colNumeric=lower"
+NOTICE: using standard form "und-u-kn-lower" for ICU locale "@colNumeric=lower"
ERROR: could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
-- test that attributes not handled by icu_set_collation_attributes()
-- (handled by ucol_open() directly) also work
CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
-NOTICE: using standard form "de-u-co-phonebk" for locale "de@collation=phonebook"
+NOTICE: using standard form "de-u-co-phonebk" for ICU locale "de@collation=phonebook"
SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
?column? | ?column?
----------+----------
@@ -1254,7 +1254,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
-- rules
CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test7 (a text);
-- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1282,13 +1282,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
DROP TABLE test7;
CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
ERROR: could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
-- nondeterministic collations
CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test6 (a int, b text);
-- same string in different normal forms
INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1338,9 +1338,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
(2 rows)
CREATE COLLATION case_sensitive (provider = icu, locale = '');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
-NOTICE: using standard form "und-u-ks-level2" for locale "@colStrength=secondary"
+NOTICE: using standard form "und-u-ks-level2" for ICU locale "@colStrength=secondary"
SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
?column? | ?column?
----------+----------
--
2.34.1
On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:
On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql@j-davis.com> wrote:
Most of the complaints seem to be complaints about v15 as well, and
while those complaints may be a reason to not make ICU the default,
they are also an argument that we should continue to learn and try
to
fix those issues because they exist in an already-released version.
Leaving it the default for now will help us fix those issues rather
than hide them.It's still early, so we have plenty of time to revert the initdb
default if we need to.That's fair enough, but I really think it's important that some
energy
get invested in providing adequate documentation for this stuff. Just
patching the code is not enough.
Attached a significant documentation patch.
I tried to make it comprehensive without trying to be exhaustive, and I
separated the explanation of language tags from what collation settings
you can include in a language tag, so hopefully that's more clear.
I added quite a few examples spread throughout the various sections,
and I preserved the existing examples at the end. I also left all of
the external links at the bottom for those interested enough to go
beyond what's there.
I didn't add additional documentation for ICU rules. There are so many
options for collations that it's hard for me to think of realistic
examples to specify the rules directly, unless someone wants to invent
a new language. Perhaps useful if working with an interesting text file
format with special treatment for delimiters?
I asked the question about rules here:
/messages/by-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel@cybertec.at
and got some limited response about addressing sort complaints. That
sounds reasonable, but a lot of that can also be handled just by
specifying the right collation settings. Someone who understands the
use case better could add some more documentation.
--
Jeff Davis
PostgreSQL Contributor Team - AWS
Attachments:
v1-0001-Doc-improvements-for-language-tags-and-custom-ICU.patchtext/x-patch; charset=UTF-8; name=v1-0001-Doc-improvements-for-language-tags-and-custom-ICU.patchDownload
From b09515bfaf5e9de330138ec4a627d02a7947de1a Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 27 Apr 2023 14:43:46 -0700
Subject: [PATCH v1] Doc improvements for language tags and custom ICU
collations.
Separate the documentation for language tags from the documentaiton
for the available collation settings which can be included in a
language tag.
Include tables of the available options, more details about the
effects of each option, and additional examples.
Also include an explanation of the "levels" of textual features and
how they relate to collation.
---
doc/src/sgml/charset.sgml | 656 +++++++++++++++++++++++++++++++-------
1 file changed, 535 insertions(+), 121 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 6dd95b8966..be74064168 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -377,7 +377,125 @@ initdb --locale-provider=icu --icu-locale=en
variants and customization options.
</para>
</sect2>
+ <sect2 id="icu-locales">
+ <title>ICU Locales</title>
+ <sect3 id="icu-locale-names">
+ <title>ICU Locale Names</title>
+ <para>
+ The ICU format for the locale name is a <link
+ linkend="icu-language-tag">Language Tag</link>.
+
+<programlisting>
+CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);
+CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr');
+</programlisting>
+ </para>
+ </sect3>
+ <sect3 id="icu-canonicalization">
+ <title>Locale Canonicalization and Validation</title>
+ <para>
+ When defining a new ICU collation object or database with ICU as the
+ provider, the given locale name is transformed ("canonicalized") into a
+ language tag if not already in that form. For instance,
+
+<screen>
+CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true');
+NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true"
+CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8');
+NOTICE: using standard form "de-DE" for locale "de_DE.utf8"
+</screen>
+
+ If you see such a message, ensure that the <symbol>PROVIDER</symbol> and
+ <symbol>LOCALE</symbol> are as you expect, and consider specifying
+ directly as the canonical language tag instead of relying on the
+ transformation.
+ </para>
+ <note>
+ <para>
+ ICU can transform most libc locale names, as well as some other formats,
+ into language tags for easier transition to ICU. If a libc locale name
+ is used in ICU, it may not have precisely the same behavior as in libc.
+ </para>
+ </note>
+ <para>
+ If there is some problem interpreting the locale name, or if it represents
+ a language or region that ICU does not recognize, a message will be reported:
+<screen>
+SET icu_validation_level = ERROR;
+CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense');
+ERROR: ICU locale "nonsense" has unknown language "nonsense"
+HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+</screen>
+
+ <xref
+ linkend="guc-icu-validation-level"/> controls how the message is
+ reported. If set below <literal>ERROR</literal>, the collation will still
+ be created, but the behavior may not be what the user intended.
+ </para>
+ </sect3>
+ <sect3 id="icu-language-tag">
+ <title>Language Tag</title>
+ <para>
+ Basic language tags are simply
+ <replaceable>language</replaceable><literal>-</literal><replaceable>region</replaceable>;
+ or even just <replaceable>language</replaceable>. The
+ <replaceable>language</replaceable> is a language code
+ (e.g. <literal>fr</literal> for French or <literal>und</literal> for
+ "undefined"), and <replaceable>region</replaceable> is a region code
+ (e.g. <literal>CA</literal> for Canada). Examples:
+ <literal>ja-JP</literal>, <literal>de</literal>, or
+ <literal>fr-CA</literal>.
+ </para>
+ <para>
+ Collation settings may be included in the language tag to customize
+ collation behavior. ICU allows extensive customization, such as
+ sensitivity (or insensitivity) to accents, case, and punctuation;
+ treatment of digits within text; and many other options to satisfy a
+ variety of uses.
+ </para>
+ <para>
+ To include this additional collation information in a language tag,
+ append <literal>-u</literal>, followed by one or more
+ <literal>-</literal><replaceable>key</replaceable><literal>-</literal><replaceable>value</replaceable>
+ pairs, where <replaceable>key</replaceable> is the key for a collation
+ setting and <replaceable>value</replaceable> is a valid value for that
+ setting. For boolean settings, the
+ <literal>-</literal><replaceable>key</replaceable> may be specified
+ without a corresponding
+ <literal>-</literal><replaceable>value</replaceable>, which implies a
+ value of <literal>true</literal>.
+ </para>
+ <para>
+ For example, the language tag <literal>en-US-u-kn-ks-level2</literal>
+ means the locale with the English language in the US region, with
+ collation settings <literal>kn</literal> set to <literal>true</literal>
+ and <literal>ks</literal> set to <literal>level2</literal>. Those
+ settings mean the collation will be case-insensitive and treat a sequence
+ of digits as a single number:
+
+<screen>
+CREATE COLLATION mycollation5 (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'en-US-u-kn-ks-level2');
+SELECT 'aB' = 'Ab' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+
+SELECT 'N-45' < 'N-123' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+</screen>
+ </para>
+ <para>
+ See <xref linkend="icu-custom-collations"/> for details and additional
+ examples of using language tags with custom collation information for the
+ locale.
+ </para>
+ </sect3>
+ </sect2>
<sect2 id="locale-problems">
<title>Problems</title>
@@ -658,6 +776,13 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
code byte values.
</para>
+ <note>
+ <para>
+ The <literal>C</literal> and <literal>POSIX</literal> locales may behave
+ differently depending on the database encoding.
+ </para>
+ </note>
+
<para>
Additionally, two SQL standard collation names are available:
@@ -869,132 +994,23 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
<sect4 id="collation-managing-create-icu">
<title>ICU Collations</title>
- <para>
- ICU allows collations to be customized beyond the basic language+country
- set that is preloaded by <command>initdb</command>. Users are encouraged
- to define their own collation objects that make use of these facilities to
- suit the sorting behavior to their requirements.
- See <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
- and <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink> for
- information on ICU locale naming. The set of acceptable names and
- attributes depends on the particular ICU version.
- </para>
-
- <para>
- Here are some examples:
-
- <variablelist>
- <varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
- <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
- <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');</literal></term>
- <listitem>
- <para>German collation with phone book collation type</para>
- <para>
- The first example selects the ICU locale using a <quote>language
- tag</quote> per BCP 47. The second example uses the traditional
- ICU-specific locale syntax. The first style is preferred going
- forward, and is used internally to store locales.
- </para>
- <para>
- Note that you can name the collation objects in the SQL environment
- anything you want. In this example, we follow the naming style that
- the predefined collations use, which in turn also follow BCP 47, but
- that is not required for user-defined collations.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
- <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
- <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');</literal></term>
- <listitem>
- <para>
- Root collation with Emoji collation type, per Unicode Technical Standard #51
- </para>
- <para>
- Observe how in the traditional ICU locale naming system, the root
- locale is selected by an empty string.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
- <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
- <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn');</literal></term>
- <listitem>
- <para>
- Sort Greek letters before Latin ones. (The default is Latin before Greek.)
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kf-upper">
- <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
- <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');</literal></term>
- <listitem>
- <para>
- Sort upper-case letters before lower-case letters. (The default is
- lower-case letters first.)
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
- <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
- <term><literal>CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn');</literal></term>
- <listitem>
- <para>
- Combines both of the above options.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kn-true">
- <term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');</literal></term>
- <term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');</literal></term>
- <listitem>
- <para>
- Numeric ordering, sorts sequences of digits by their numeric value,
- for example: <literal>A-21</literal> < <literal>A-123</literal>
- (also known as natural sort).
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
-
- See <ulink url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
- Technical Standard #35</ulink>
- and <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink> for
- details. The list of possible collation types (<literal>co</literal>
- subtag) can be found in
- the <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
- repository</ulink>.
- </para>
+ <para>
+ ICU collations can be created like:
- <para>
- Note that while this system allows creating collations that <quote>ignore
- case</quote> or <quote>ignore accents</quote> or similar (using the
- <literal>ks</literal> key), in order for such collations to act in a
- truly case- or accent-insensitive manner, they also need to be declared as not
- <firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>;
- see <xref linkend="collation-nondeterministic"/>.
- Otherwise, any strings that compare equal according to the collation but
- are not byte-wise equal will be sorted according to their byte values.
- </para>
+<programlisting>
+CREATE COLLATION german (provider = icu, locale = 'de-DE');
+</programlisting>
- <note>
+ ICU locales are specified as a <link linkend="icu-language-tag">Language
+ Tag</link>, but can also accept most libc-style locale names (which will
+ be transformed into language tags if possible).
+ </para>
<para>
- By design, ICU will accept almost any string as a locale name and match
- it to the closest locale it can provide, using the fallback procedure
- described in its documentation. Thus, there will be no direct feedback
- if a collation specification is composed using features that the given
- ICU installation does not actually support. It is therefore recommended
- to create application-level test cases to check that the collation
- definitions satisfy one's requirements.
+ New ICU collations can customize collation behavior extensively by
+ including collation attributes in the langugage tag. See <xref
+ linkend="icu-custom-collations"/> for details and examples.
</para>
- </note>
</sect4>
-
<sect4 id="collation-copy">
<title>Copying Collations</title>
@@ -1072,6 +1088,404 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
</tip>
</sect3>
</sect2>
+ <sect2 id="icu-custom-collations">
+ <title>ICU Custom Collations</title>
+
+ <para>
+ ICU allows extensive control over collation behavior by defining new
+ collations with collation settings as a part of the language tag. These
+ settings can modify the collation order to suit a variety of needs. For
+ instance:
+
+<programlisting>
+-- ignore differences in accents and case
+CREATE COLLATION ignore_accent_case (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ks-level1');
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
+
+-- upper case letters sort before lower case.
+CREATE COLLATION upper_first (PROVIDER=icu, LOCALE = 'und-u-kf-upper');
+SELECT 'B' < 'b' COLLATE upper_first; -- true
+
+-- treat digits numerically and ignore punctuation
+CREATE COLLATION num_ignore_punct (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ka-shifted-kn');
+SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true
+SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
+</programlisting>
+
+ Many of the available options are described in <xref
+ linkend="icu-collation-settings"/>, or see <xref
+ linkend="icu-external-references"/> for more details.
+ </para>
+ <sect3 id="icu-collation-comparison-levels">
+ <title>ICU Comparison Levels</title>
+ <para>
+ Comparison of two strings (collation) in ICU is determined by a
+ multi-level process, where textual features are grouped into
+ "levels". Treatment of each level is controlled by the <link
+ linkend="icu-collation-settings-table">collation settings</link>. Higher
+ levels correspond to finer textual features.
+ </para>
+ <para>
+ <table id="icu-collation-levels">
+ <title>ICU Collation Levels</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Level</entry>
+ <entry>Description</entry>
+ <entry><literal>'f' = 'f'</literal></entry>
+ <entry><literal>'ab' = U&'a\2063b'</literal></entry>
+ <entry><literal>'x-y' = 'x_y'</literal></entry>
+ <entry><literal>'g' = 'G'</literal></entry>
+ <entry><literal>'n' = 'ñ'</literal></entry>
+ <entry><literal>'y' = 'z'</literal></entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>level1</entry>
+ <entry>Base Character</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>level2</entry>
+ <entry>Accents</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>level3</entry>
+ <entry>Case/Variants</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>level4</entry>
+ <entry>Punctuation</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>identic</entry>
+ <entry>All</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ The above table shows which textual feature differences are
+ considered significant when determining equality at the given level. The
+ unicode character <literal>U+2063</literal> is an invisible separator,
+ and as seen in the table, is ignored for at all levels of comparison less
+ than <literal>identic</literal>.
+ </para>
+ <para>
+ Examples:
+
+<programlisting>
+CREATE COLLATION level3 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level3');
+CREATE COLLATION level4 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level4');
+CREATE COLLATION identic (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-identic');
+
+-- invisible separator ignored at all levels except identic
+SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true
+SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false
+
+-- punctuation ignored at level3 but not at level 4
+SELECT 'x-y' = 'x_y' COLLATE level3; -- true
+SELECT 'x-y' = 'x_y' COLLATE level4; -- false
+</programlisting>
+
+ </para>
+ <note>
+ <para>
+ For many collation settings, you must create the collation with
+ <option>DETERMINISTIC</option> set to <literal>false</literal> for the
+ setting to have the desired effect. Additionally, some settings only
+ take effect when the key <literal>ka</literal> is set to
+ <literal>shifted</literal> (see <xref
+ linkend="icu-collation-settings-table"/>).
+ </para>
+ </note>
+ </sect3>
+ <sect3 id="icu-collation-settings">
+ <title>Collation Settings for an ICU Locale</title>
+ <para>
+ <table id="icu-collation-settings-table">
+ <title>ICU Collation Settings</title>
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Key</entry>
+ <entry>Values</entry>
+ <entry>Default</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>co</literal></entry>
+ <entry><literal>emoji</literal>, <literal>phonebk</literal>, <literal>standard</literal>, <replaceable>...</replaceable></entry>
+ <entry><literal>standard</literal></entry>
+ <entry>
+ Collation type. See <xref linkend="icu-external-references"/> for additional options and details.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>ks</literal></entry>
+ <entry><literal>level1</literal>, <literal>level2</literal>, <literal>level3</literal>, <literal>level4</literal>, <literal>identic</literal></entry>
+ <entry><literal>level3</literal></entry>
+ <entry>
+ Sensitivity when determining equality, with
+ <literal>level1</literal> the least sensitive and
+ <literal>identic</literal> the most sensitive. See <xref
+ linkend="icu-collation-levels"/> for details.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>ka</literal></entry>
+ <entry><literal>noignore</literal>, <literal>shifted</literal></entry>
+ <entry><literal>noignore</literal></entry>
+ <entry>
+ If set to <literal>shifted</literal>, causes some characters
+ (e.g. punctuation or space) to be ignored in comparison. Key
+ <literal>ks</literal> must be set to <literal>level3</literal> or
+ lower to take effect. Set key <literal>kv</literal> to control which
+ character classes are ignored.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kb</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ Backwards comparison for the level 2 differences. For example,
+ locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
+ before <literal>'aé'</literal>.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kk</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ <para>
+ Enable full normalization; may affect performance. Basic
+ normalization is performed even when set to
+ <literal>false</literal>.
+ </para>
+ <para>
+ Full normalization is important in some cases, such as when
+ multiple accents are applied to a single character (e.g. in
+ Vietnamese or Arabic). Locales for languages that require full
+ normalization typically enable it by default.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kc</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ <para>
+ Separates case into a "level 2.5" that falls between accents and
+ other level 3 features.
+ </para>
+ <para>
+ If set to <literal>true</literal> and <literal>ks</literal> is set
+ to <literal>level1</literal>, will ignore accents but take case
+ into account.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kf</literal></entry>
+ <entry>
+ <literal>upper</literal>, <literal>lower</literal>,
+ <literal>false</literal>
+ </entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ If set to <literal>upper</literal>, upper case sorts before lower
+ case. If set to <literal>lower</literal>, lower case sorts before
+ upper case. If set to <literal>false</literal>, it depends on the
+ locale.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kn</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ If set to <literal>true</literal>, numbers within a string are
+ treated as a single numeric value rather than a sequence of
+ digits. For example, <literal>'id-45'</literal> sorts before
+ <literal>'id-123'</literal>.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kr</literal></entry>
+ <entry>
+ <literal>space</literal>, <literal>punct</literal>,
+ <literal>symbol</literal>, <literal>currency</literal>,
+ <literal>digit</literal>, <replaceable>script-id</replaceable>
+ </entry>
+ <entry></entry>
+ <entry>
+ <para>
+ Set to one or more of the valid values, or any BCP 47
+ <replaceable>script-id</replaceable>, e.g. <literal>latn</literal>
+ ("Latin") or <literal>grek</literal> ("Greek"). Multiple values are
+ separated by "<literal>-</literal>".
+ </para>
+ <para>
+ Redefines the ordering of classes of characters; those characters
+ belonging to a class earlier in the list sort before characters
+ belonging to a class later in the list. For instance, the value
+ <literal>digit-currency-space</literal> (as part of a language tag
+ like <literal>und-u-kr-digit-currency-space</literal>) sorts
+ punctuation before digits and spaces.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kv</literal></entry>
+ <entry>
+ <literal>space</literal>, <literal>punct</literal>,
+ <literal>symbol</literal>, <literal>currency</literal>
+ </entry>
+ <entry><literal>punct</literal></entry>
+ <entry>
+ Classes of characters ignored during comparison at level 3. Setting
+ to a later value includes earlier values;
+ e.g. <literal>symbol</literal> also includes
+ <literal>punct</literal> and <literal>space</literal> in the
+ characters to be ignored. Key <literal>ka</literal> must be set to
+ <literal>shifted</literal> and key <literal>ks</literal> must be set
+ to <literal>level3</literal> or lower to take effect.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ Defaults may depend on locale. The above table is not meant to be
+ complete. See <xref linkend="icu-external-references"/> for additinal
+ options and details.
+ </para>
+ </sect3>
+ <sect3 id="icu-locale-examples">
+ <title>Examples</title>
+ <para>
+ <variablelist>
+ <varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
+ <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
+ <listitem>
+ <para>German collation with phone book collation type</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
+ <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
+ <listitem>
+ <para>
+ Root collation with Emoji collation type, per Unicode Technical Standard #51
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
+ <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
+ <listitem>
+ <para>
+ Sort Greek letters before Latin ones. (The default is Latin before Greek.)
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-en-u-kf-upper">
+ <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
+ <listitem>
+ <para>
+ Sort upper-case letters before lower-case letters. (The default is
+ lower-case letters first.)
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
+ <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
+ <listitem>
+ <para>
+ Combines both of the above options.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </sect3>
+ <sect3 id="icu-external-references">
+ <title>External References for ICU</title>
+ <para>
+ This section (<xref linkend="icu-custom-collations"/>) is only a brief
+ overview of ICU behavior and language tags. Refer to the following
+ documents for technical details, additional options, and new behavior:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <ulink
+ url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
+ Technical Standard #35</ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
+ repository</ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink>
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect3>
+ </sect2>
</sect1>
<sect1 id="multibyte">
--
2.34.1
Jeff Davis <pgsql@j-davis.com> writes:
=== 0001: do not convert C to en-US-u-va-posix
I plan to commit this soon.
Several buildfarm animals have failed since this went in. The
only one showing enough info to diagnose is siskin [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=siskin&dt=2023-05-08%2020%3A09%3A26:
@@ -1043,16 +1043,15 @@
ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = 'C'); -- fails
-ERROR: could not convert locale name "C" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+NOTICE: using standard form "en-US-u-va-posix" for locale "C"
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+ERROR: collation "testx" already exists
CREATE COLLATION testx (provider = icu, locale = 'C'); DROP COLLATION testx;
-WARNING: could not convert locale name "C" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-WARNING: ICU locale "C" has unknown language "c"
-HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+NOTICE: using standard form "en-US-u-va-posix" for locale "C"
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
I suppose this is environment-dependent. Sadly, the buildfarm
client does not show the prevailing LANG or LC_XXX settings.
regards, tom lane
[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=siskin&dt=2023-05-08%2020%3A09%3A26
On Mon, 2023-05-08 at 17:47 -0400, Tom Lane wrote:
-ERROR: could not convert locale name "C" to language tag: U_ILLEGAL_ARGUMENT_ERROR +NOTICE: using standard form "en-US-u-va-posix" for locale "C"
...
I suppose this is environment-dependent. Sadly, the buildfarm
client does not show the prevailing LANG or LC_XXX settings.
Looks like it's failing-to-fail on some versions of ICU which
automatically perform that conversion.
The easiest thing to do is revert it for now, and after we sort out the
memcmp() path for the ICU provider, then I can commit it again (after
that point it would just be code cleanup and should have no functional
impact).
Regards,
Jeff Davis
On 2023-Apr-24, Peter Eisentraut wrote:
The GUC settings lc_collate and lc_ctype are from a time when those locale
settings were cluster-global. When we made those locale settings
per-database (PG 8.4), we kept them as read-only. As of PG 15, you can use
ICU as the per-database locale provider, so what is being attempted in the
above example is already meaningless before PG 16, since you need to look
into pg_database to find out what is really happening.I think we should just remove the GUC parameters lc_collate and lc_ctype.
I agree with removing these in v16, since they are going to become more
meaningless and confusing.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
On Tue, 2023-05-09 at 10:25 +0200, Alvaro Herrera wrote:
I agree with removing these in v16, since they are going to become
more
meaningless and confusing.
Agreed, but it would be nice to have an alternative that does the right
thing.
It's awkward for a user to read pg_database.datlocprovider, then
depending on that, either look in datcollate or daticulocale. (It's
awkward in the code, too.)
Maybe some built-in function that returns a tuple of the default
provider, the locale, and the version? Or should we also output the
ctype somehow (which affects the results of upper()/lower())?
Regards,
Jeff Davis
On 09.05.23 10:25, Alvaro Herrera wrote:
On 2023-Apr-24, Peter Eisentraut wrote:
The GUC settings lc_collate and lc_ctype are from a time when those locale
settings were cluster-global. When we made those locale settings
per-database (PG 8.4), we kept them as read-only. As of PG 15, you can use
ICU as the per-database locale provider, so what is being attempted in the
above example is already meaningless before PG 16, since you need to look
into pg_database to find out what is really happening.I think we should just remove the GUC parameters lc_collate and lc_ctype.
I agree with removing these in v16, since they are going to become more
meaningless and confusing.
Here is my proposed patch for this.
Attachments:
0001-Remove-read-only-server-settings-lc_collate-and-lc_c.patchtext/plain; charset=UTF-8; name=0001-Remove-read-only-server-settings-lc_collate-and-lc_c.patchDownload
From b548a671ad02a5c851a4984db6e4535a0b70f881 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter@eisentraut.org>
Date: Thu, 11 May 2023 13:02:02 +0200
Subject: [PATCH] Remove read-only server settings lc_collate and lc_ctype
The GUC settings lc_collate and lc_ctype are from a time when those
locale settings were cluster-global. When those locale settings were
made per-database (PG 8.4), the settings were kept as read-only. As
of PG 15, you can use ICU as the per-database locale provider, so
examining these settings is already meaningless, since you need to
look into pg_database to find out what is really happening.
Discussion: https://www.postgresql.org/message-id/696054d1-bc88-b6ab-129a-18b8bce6a6f0@enterprisedb.com
---
contrib/citext/expected/citext_utf8.out | 4 +--
contrib/citext/expected/citext_utf8_1.out | 4 +--
contrib/citext/sql/citext_utf8.sql | 4 +--
doc/src/sgml/config.sgml | 32 -------------------
src/backend/utils/init/postinit.c | 4 ---
src/backend/utils/misc/guc_tables.c | 26 ---------------
.../regress/expected/collate.icu.utf8.out | 4 +--
.../regress/expected/collate.linux.utf8.out | 6 ++--
.../expected/collate.windows.win1252.out | 6 ++--
src/test/regress/sql/collate.icu.utf8.sql | 4 +--
src/test/regress/sql/collate.linux.utf8.sql | 6 ++--
.../regress/sql/collate.windows.win1252.sql | 6 ++--
12 files changed, 22 insertions(+), 84 deletions(-)
diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
index 77b4586d8f..6630e09a4d 100644
--- a/contrib/citext/expected/citext_utf8.out
+++ b/contrib/citext/expected/citext_utf8.out
@@ -8,8 +8,8 @@
* to the "tr-TR-x-icu" collation where it will succeed.
*/
SELECT getdatabaseencoding() <> 'UTF8' OR
- current_setting('lc_ctype') = 'C' OR
- (SELECT datlocprovider='i' FROM pg_database
+ (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
+ FROM pg_database
WHERE datname=current_database())
AS skip_test \gset
\if :skip_test
diff --git a/contrib/citext/expected/citext_utf8_1.out b/contrib/citext/expected/citext_utf8_1.out
index d1e1fe1a9d..3caa7a00d4 100644
--- a/contrib/citext/expected/citext_utf8_1.out
+++ b/contrib/citext/expected/citext_utf8_1.out
@@ -8,8 +8,8 @@
* to the "tr-TR-x-icu" collation where it will succeed.
*/
SELECT getdatabaseencoding() <> 'UTF8' OR
- current_setting('lc_ctype') = 'C' OR
- (SELECT datlocprovider='i' FROM pg_database
+ (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
+ FROM pg_database
WHERE datname=current_database())
AS skip_test \gset
\if :skip_test
diff --git a/contrib/citext/sql/citext_utf8.sql b/contrib/citext/sql/citext_utf8.sql
index 8530c68dd7..1f51df134b 100644
--- a/contrib/citext/sql/citext_utf8.sql
+++ b/contrib/citext/sql/citext_utf8.sql
@@ -9,8 +9,8 @@
*/
SELECT getdatabaseencoding() <> 'UTF8' OR
- current_setting('lc_ctype') = 'C' OR
- (SELECT datlocprovider='i' FROM pg_database
+ (SELECT (datlocprovider = 'c' AND datctype = 'C') OR datlocprovider = 'i'
+ FROM pg_database
WHERE datname=current_database())
AS skip_test \gset
\if :skip_test
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 909a3f28c7..3e9030e3d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -10788,38 +10788,6 @@ <title>Preset Options</title>
</listitem>
</varlistentry>
- <varlistentry id="guc-lc-collate" xreflabel="lc_collate">
- <term><varname>lc_collate</varname> (<type>string</type>)
- <indexterm>
- <primary><varname>lc_collate</varname> configuration parameter</primary>
- </indexterm>
- </term>
- <listitem>
- <para>
- Reports the locale in which sorting of textual data is done.
- See <xref linkend="locale"/> for more information.
- This value is determined when a database is created.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="guc-lc-ctype" xreflabel="lc_ctype">
- <term><varname>lc_ctype</varname> (<type>string</type>)
- <indexterm>
- <primary><varname>lc_ctype</varname> configuration parameter</primary>
- </indexterm>
- </term>
- <listitem>
- <para>
- Reports the locale that determines character classifications.
- See <xref linkend="locale"/> for more information.
- This value is determined when a database is created.
- Ordinarily this will be the same as <varname>lc_collate</varname>,
- but for special applications it might be set differently.
- </para>
- </listitem>
- </varlistentry>
-
<varlistentry id="guc-max-function-args" xreflabel="max_function_args">
<term><varname>max_function_args</varname> (<type>integer</type>)
<indexterm>
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 53420f4974..df81e35eb8 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -483,10 +483,6 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
quote_identifier(name))));
}
- /* Make the locale settings visible as GUC variables, too */
- SetConfigOption("lc_collate", collate, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
- SetConfigOption("lc_ctype", ctype, PGC_INTERNAL, PGC_S_DYNAMIC_DEFAULT);
-
ReleaseSysCache(tup);
}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5f90aecd47..23d4b38e72 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,8 +563,6 @@ static char *syslog_ident_str;
static double phony_random_seed;
static char *client_encoding_string;
static char *datestyle_string;
-static char *locale_collate;
-static char *locale_ctype;
static char *server_encoding_string;
static char *server_version_string;
static int server_version_num;
@@ -4050,30 +4048,6 @@ struct config_string ConfigureNamesString[] =
NULL, NULL, NULL
},
- /* See main.c about why defaults for LC_foo are not all alike */
-
- {
- {"lc_collate", PGC_INTERNAL, PRESET_OPTIONS,
- gettext_noop("Shows the collation order locale."),
- NULL,
- GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
- },
- &locale_collate,
- "C",
- NULL, NULL, NULL
- },
-
- {
- {"lc_ctype", PGC_INTERNAL, PRESET_OPTIONS,
- gettext_noop("Shows the character classification and case conversion locale."),
- NULL,
- GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
- },
- &locale_ctype,
- "C",
- NULL, NULL, NULL
- },
-
{
{"lc_messages", PGC_SUSET, CLIENT_CONN_LOCALE,
gettext_noop("Sets the language in which messages are displayed."),
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index b5a221b030..21840815c9 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1023,7 +1023,7 @@ SET client_min_messages TO WARNING;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
- quote_literal(current_setting('lc_collate')) || ');';
+ quote_literal((SELECT daticulocale FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test0 FROM "C"; -- fail, duplicate name
@@ -1031,7 +1031,7 @@ ERROR: collation "test0" already exists
do $$
BEGIN
EXECUTE 'CREATE COLLATION test1 (provider = icu, locale = ' ||
- quote_literal(current_setting('lc_collate')) || ');';
+ quote_literal((SELECT daticulocale FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
RESET client_min_messages;
diff --git a/src/test/regress/expected/collate.linux.utf8.out b/src/test/regress/expected/collate.linux.utf8.out
index 6d34667ceb..01664f7c1b 100644
--- a/src/test/regress/expected/collate.linux.utf8.out
+++ b/src/test/regress/expected/collate.linux.utf8.out
@@ -1027,7 +1027,7 @@ CREATE SCHEMA test_schema;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (locale = ' ||
- quote_literal(current_setting('lc_collate')) || ');';
+ quote_literal((SELECT datcollate FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test0 FROM "C"; -- fail, duplicate name
@@ -1039,9 +1039,9 @@ NOTICE: collation "test0" for encoding "UTF8" already exists, skipping
do $$
BEGIN
EXECUTE 'CREATE COLLATION test1 (lc_collate = ' ||
- quote_literal(current_setting('lc_collate')) ||
+ quote_literal((SELECT datcollate FROM pg_database WHERE datname = current_database())) ||
', lc_ctype = ' ||
- quote_literal(current_setting('lc_ctype')) || ');';
+ quote_literal((SELECT datctype FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test3 (lc_collate = 'en_US.utf8'); -- fail, need lc_ctype
diff --git a/src/test/regress/expected/collate.windows.win1252.out b/src/test/regress/expected/collate.windows.win1252.out
index 61b421161f..b7b93959de 100644
--- a/src/test/regress/expected/collate.windows.win1252.out
+++ b/src/test/regress/expected/collate.windows.win1252.out
@@ -863,7 +863,7 @@ CREATE SCHEMA test_schema;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (locale = ' ||
- quote_literal(current_setting('lc_collate')) || ');';
+ quote_literal((SELECT datcollate FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test0 FROM "C"; -- fail, duplicate name
@@ -875,9 +875,9 @@ NOTICE: collation "test0" for encoding "WIN1252" already exists, skipping
do $$
BEGIN
EXECUTE 'CREATE COLLATION test1 (lc_collate = ' ||
- quote_literal(current_setting('lc_collate')) ||
+ quote_literal((SELECT datcollate FROM pg_database WHERE datname = current_database())) ||
', lc_ctype = ' ||
- quote_literal(current_setting('lc_ctype')) || ');';
+ quote_literal((SELECT datctype FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test3 (lc_collate = 'en_US.utf8'); -- fail, need lc_ctype
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 85e26951b6..c9c2ab8fa6 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -362,14 +362,14 @@ CREATE SCHEMA test_schema;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (provider = icu, locale = ' ||
- quote_literal(current_setting('lc_collate')) || ');';
+ quote_literal((SELECT daticulocale FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test0 FROM "C"; -- fail, duplicate name
do $$
BEGIN
EXECUTE 'CREATE COLLATION test1 (provider = icu, locale = ' ||
- quote_literal(current_setting('lc_collate')) || ');';
+ quote_literal((SELECT daticulocale FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
diff --git a/src/test/regress/sql/collate.linux.utf8.sql b/src/test/regress/sql/collate.linux.utf8.sql
index 2b787507c5..132d13af0a 100644
--- a/src/test/regress/sql/collate.linux.utf8.sql
+++ b/src/test/regress/sql/collate.linux.utf8.sql
@@ -359,7 +359,7 @@ CREATE SCHEMA test_schema;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (locale = ' ||
- quote_literal(current_setting('lc_collate')) || ');';
+ quote_literal((SELECT datcollate FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test0 FROM "C"; -- fail, duplicate name
@@ -368,9 +368,9 @@ CREATE COLLATION IF NOT EXISTS test0 (locale = 'foo'); -- ok, skipped
do $$
BEGIN
EXECUTE 'CREATE COLLATION test1 (lc_collate = ' ||
- quote_literal(current_setting('lc_collate')) ||
+ quote_literal((SELECT datcollate FROM pg_database WHERE datname = current_database())) ||
', lc_ctype = ' ||
- quote_literal(current_setting('lc_ctype')) || ');';
+ quote_literal((SELECT datctype FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test3 (lc_collate = 'en_US.utf8'); -- fail, need lc_ctype
diff --git a/src/test/regress/sql/collate.windows.win1252.sql b/src/test/regress/sql/collate.windows.win1252.sql
index b5c45e1810..353d769a5b 100644
--- a/src/test/regress/sql/collate.windows.win1252.sql
+++ b/src/test/regress/sql/collate.windows.win1252.sql
@@ -310,7 +310,7 @@ CREATE SCHEMA test_schema;
do $$
BEGIN
EXECUTE 'CREATE COLLATION test0 (locale = ' ||
- quote_literal(current_setting('lc_collate')) || ');';
+ quote_literal((SELECT datcollate FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test0 FROM "C"; -- fail, duplicate name
@@ -319,9 +319,9 @@ CREATE COLLATION IF NOT EXISTS test0 (locale = 'foo'); -- ok, skipped
do $$
BEGIN
EXECUTE 'CREATE COLLATION test1 (lc_collate = ' ||
- quote_literal(current_setting('lc_collate')) ||
+ quote_literal((SELECT datcollate FROM pg_database WHERE datname = current_database())) ||
', lc_ctype = ' ||
- quote_literal(current_setting('lc_ctype')) || ');';
+ quote_literal((SELECT datctype FROM pg_database WHERE datname = current_database())) || ');';
END
$$;
CREATE COLLATION test3 (lc_collate = 'en_US.utf8'); -- fail, need lc_ctype
--
2.40.0
On 09.05.23 17:09, Jeff Davis wrote:
It's awkward for a user to read pg_database.datlocprovider, then
depending on that, either look in datcollate or daticulocale. (It's
awkward in the code, too.)Maybe some built-in function that returns a tuple of the default
provider, the locale, and the version? Or should we also output the
ctype somehow (which affects the results of upper()/lower())?
There is also the deterministic flag and the icurules setting.
Depending on what level of detail you imagine the user needs, you really
do need to look at the whole picture, not some subset of it.
New patch series attached.
=== 0001: fix bug that allows creating hidden collations
Bug:
/messages/by-id/051c9395cf880307865ee8b17acdbf7f838c1e39.camel@j-davis.com
=== 0002: handle some kinds of libc-stlye locale strings
ICU used to handle libc locale strings like 'fr_FR@euro', but doesn't
in later versions. Handle them in postgres for consistency.
=== 0003: reduce icu_validation_level to WARNING
Given that we've seen some inconsistency in which locale names are
accepted in different ICU versions, it seems best not to be too strict.
Peter Eisentraut suggested that it be set to ERROR originally, but a
WARNING should be sufficient to see problems without introducing risks
migrating to version 16.
I don't expect objections to 0003, so I may commit this soon, but I'll
give it a little time in case someone has an opinion.
=== 0004-0006:
To solve the issues that have come up in this thread, we need CREATE
DATABASE (and createdb and initdb) to use LOCALE to mean the collation
locale regardless of which provider is in use (which is what 0006
does).
0006 depends on ICU handling libc locale names. It already does a good
job for most libc locale names (though patch 0002 fixes a few cases
where it doesn't). There may be more cases, but for the most part libc
names are interpreted in a reasonable way. But one important case is
missing: ICU does not handle the "C" locale as we expect (that is,
using memcmp()).
We've already allowed users to create ICU collations with the C locale
in the past, which uses the root collation (not memcmp()), and we need
to keep supporting that for upgraded clusters. So that leaves us with a
catalog representation problem. I mentioned upthread that we can solve
that by:
1. Using iculocale=NULL to mean "C-as-in-memcmp", or having some
other catalog hack (like another field). That's not desirable because
the catalog representation is already complex and it may be hard for
users to tell what's happening.
2. When provider=icu and locale=C, switch to provider=libc locale=C.
This is very messy, because currently the syntax allows specifying a
database with LOCALE_PROVIDER='icu' ICU_LOCALE='C' LC_COLLATE='en_US' -
- if the provider gets changed to libc, what would we set datcollate
to? I don't think this is workable without some breakage. We can't
simply override datcollate to be C in that case, because there are some
things other than the default collation that might need it set to en_US
as the user specified.
3. Introduce collation provider "none", which is always memcmp-based
(patch 0004). It's equivalent to the libc locale=C, but it allows
specifying the LC_COLLATE and LC_CTYPE independently. A command like
CREATE DATABASE ... LOCALE_PROVIDER='icu' ICU_LOCALE='C'
LC_COLLATE='en_US' would get changed (with a NOTICE) to provider "none"
(patch 0005), so you'd have datlocprovider=none, datcollate=en_US. For
the database default collation, that would always use memcmp(), but the
server environment LC_COLLATE would be set to en_US as the user
specified.
For this patch series, I chose approach #3. I think it works out nicely
-- it provides a better place to document the "no locale" behavior
(including a warning that it depends on the database encoding), and I
think it's more clear to the user that locale=C is not actually using a
provider at all. It's more invasive, but feels like a better solution.
If others don't like it I can implement approach #1 instead.
=== 0007: Add a GUC to control the default collation provider
Having a GUC would make it easier to migrate to ICU without surprises.
This only affects the default for CREATE COLLATION, not CREATE DATABASE
(and obviously not initdb).
--
Jeff Davis
PostgreSQL Contributor Team - AWS
Attachments:
v5-0001-For-user-defined-collations-never-set-collencodin.patchtext/x-patch; charset=UTF-8; name=v5-0001-For-user-defined-collations-never-set-collencodin.patchDownload
From fc66f02976bb11b629bcf71346c2858eccbcf1a3 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 11 May 2023 10:36:04 -0700
Subject: [PATCH v5 1/7] For user-defined collations, never set
collencoding=-1.
For new user-defined collations, always set collencoding to the
current database encoding so that it is never shadowed by a built-in
collation.
Built in collations that work with any encoding may have
collencoding=-1, and if a user defines a collation with the same name,
it will shadow the built-in collation.
Previously it was possible to create an ICU collation (which was
assigned collencoding=-1) that was shadowed by a built-in collation
and completely inaccessible.
---
src/backend/commands/collationcmds.c | 28 +++++++++++++------
.../regress/expected/collate.icu.utf8.out | 2 +-
2 files changed, 21 insertions(+), 9 deletions(-)
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index c91fe66d9b..a53700256b 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -302,16 +302,29 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("ICU rules cannot be specified unless locale provider is ICU")));
+ /*
+ * The collencoding is used to hide built-in collations that are
+ * incompatible with the current database encoding, allowing users to
+ * define a compatible collation with the same name if
+ * desired. Built-in collations that work with any encoding have
+ * collencoding=-1.
+ *
+ * A collation that's a match to the current database encoding will
+ * shadow a collation with the same name and collencoding=-1. We never
+ * want a user-created collation to be shadowed by a built-in
+ * collation, so for user-created collations, always set collencoding
+ * to the current database encoding.
+ */
+ collencoding = GetDatabaseEncoding();
+
if (collprovider == COLLPROVIDER_ICU)
{
#ifdef USE_ICU
/*
- * We could create ICU collations with collencoding == database
- * encoding, but it seems better to use -1 so that it matches the
- * way initdb would create ICU collations. However, only allow
- * one to be created when the current database's encoding is
- * supported. Otherwise the collation is useless, plus we get
- * surprising behaviors like not being able to drop the collation.
+ * Only allow an ICU collation to be created when the current
+ * database's encoding is supported. Otherwise the collation is
+ * useless, plus we get surprising behaviors like not being able
+ * to drop the collation.
*
* Skip this test when !USE_ICU, because the error we want to
* throw for that isn't thrown till later.
@@ -321,11 +334,10 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("current database's encoding is not supported with this provider")));
#endif
- collencoding = -1;
}
else
{
- collencoding = GetDatabaseEncoding();
+ Assert(collprovider == COLLPROVIDER_LIBC);
check_encoding_locale_matches(collencoding, collcollate, collctype);
}
}
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index b5a221b030..9c9e1e4f48 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1062,7 +1062,7 @@ SELECT collname FROM pg_collation WHERE collname LIKE 'test%' ORDER BY 1;
ALTER COLLATION test1 RENAME TO test11;
ALTER COLLATION test0 RENAME TO test11; -- fail
-ERROR: collation "test11" already exists in schema "collate_tests"
+ERROR: collation "test11" for encoding "UTF8" already exists in schema "collate_tests"
ALTER COLLATION test1 RENAME TO test22; -- fail
ERROR: collation "test1" for encoding "UTF8" does not exist
ALTER COLLATION test11 OWNER TO regress_test_role;
--
2.34.1
v5-0002-ICU-fix-up-old-libc-style-locale-strings.patchtext/x-patch; charset=UTF-8; name=v5-0002-ICU-fix-up-old-libc-style-locale-strings.patchDownload
From 25824dc213272c739eecd16b17a3458fc5f81339 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 28 Apr 2023 12:22:41 -0700
Subject: [PATCH v5 2/7] ICU: fix up old libc-style locale strings.
Before transforming a locale string into a language tag, fix up old
libc-style locale strings such as 'fr_FR@euro'. Older ICU versions did
this automatically, but ICU version 64 removed that support.
Discussion: https://postgr.es/m/654a49f7ff7461bcf47be4181430678d45f93858.camel%40j-davis.com
---
src/backend/utils/adt/pg_locale.c | 59 ++++++++++++++++-
src/bin/initdb/initdb.c | 63 ++++++++++++++++++-
.../regress/expected/collate.icu.utf8.out | 11 ++++
src/test/regress/sql/collate.icu.utf8.sql | 7 +++
4 files changed, 138 insertions(+), 2 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index f0b6567da1..e7b166461b 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2766,6 +2766,60 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
pfree(lower_str);
}
+static const char *icu_variant_map[][2] = {
+ { "@EURO", "@currency=EUR" },
+ { "@PINYIN", "@collation=pinyin" },
+ { "@STROKE", "@collation=stroke" },
+};
+
+#define ICU_VARIANT_MAP_SIZE \
+ (sizeof(icu_variant_map)/sizeof(icu_variant_map[0]))
+
+/*
+ * ICU version 64 removed the ability to transform locale strings of the form
+ * '...@VARIANT' into proper language tags. Perform the transformation from
+ * within Postgres so that ICU supports any libc locale name consistently,
+ * regardless of the ICU version.
+ */
+static char *
+icu_fix_variants(const char *loc_str)
+{
+ const char *old_variant = strrchr(loc_str, '@');
+
+ /*
+ * Extract a variant of the form '...@VARIANT', and replace with
+ * the appropriate '...@keyword=value' if found in the map.
+ */
+ if (old_variant)
+ {
+ size_t prefix_len = old_variant - loc_str; /* bytes before the '@' */
+
+ for (int i = 0; i < ICU_VARIANT_MAP_SIZE; i++)
+ {
+ const char *map_variant = icu_variant_map[i][0];
+ const char *map_replacement = icu_variant_map[i][1];
+
+ if (pg_strcasecmp(old_variant, map_variant) == 0)
+ {
+ size_t replacement_len = strlen(map_replacement);
+ size_t result_len;
+ char *result;
+
+ result_len = prefix_len + replacement_len + 1;
+ result = palloc(result_len);
+
+ memcpy(result, loc_str, prefix_len);
+ memcpy(result + prefix_len, map_replacement, replacement_len);
+ result[prefix_len + replacement_len] = '\0';
+
+ return result;
+ }
+ }
+ }
+
+ return pstrdup(loc_str);
+}
+
#endif
/*
@@ -2782,6 +2836,7 @@ icu_language_tag(const char *loc_str, int elevel)
{
#ifdef USE_ICU
UErrorCode status;
+ char *fixed_loc_str = icu_fix_variants(loc_str);
char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
@@ -2814,7 +2869,7 @@ icu_language_tag(const char *loc_str, int elevel)
int32_t len;
status = U_ZERO_ERROR;
- len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+ len = uloc_toLanguageTag(fixed_loc_str, langtag, buflen, strict, &status);
/*
* If the result fits in the buffer exactly (len == buflen),
@@ -2834,6 +2889,8 @@ icu_language_tag(const char *loc_str, int elevel)
break;
}
+ pfree(fixed_loc_str);
+
if (U_FAILURE(status))
{
pfree(langtag);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2c208ead01..2b5cc30955 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2229,6 +2229,64 @@ check_icu_locale_encoding(int user_enc)
return true;
}
+#ifdef USE_ICU
+
+static const char *icu_variant_map[][2] = {
+ { "@EURO", "@currency=EUR" },
+ { "@PINYIN", "@collation=pinyin" },
+ { "@STROKE", "@collation=stroke" },
+};
+
+#define ICU_VARIANT_MAP_SIZE \
+ (sizeof(icu_variant_map)/sizeof(icu_variant_map[0]))
+
+/*
+ * ICU version 64 removed the ability to transform locale strings of the form
+ * '...@VARIANT' into proper language tags. Perform the transformation from
+ * within Postgres so that ICU supports any libc locale name consistently,
+ * regardless of the ICU version.
+ */
+static char *
+icu_fix_variants(const char *loc_str)
+{
+ const char *old_variant = strrchr(loc_str, '@');
+
+ /*
+ * Extract a variant of the form '...@VARIANT', and replace with
+ * the appropriate '...@keyword=value' if found in the map.
+ */
+ if (old_variant)
+ {
+ size_t prefix_len = old_variant - loc_str; /* bytes before the '@' */
+
+ for (int i = 0; i < ICU_VARIANT_MAP_SIZE; i++)
+ {
+ const char *map_variant = icu_variant_map[i][0];
+ const char *map_replacement = icu_variant_map[i][1];
+
+ if (pg_strcasecmp(old_variant, map_variant) == 0)
+ {
+ size_t replacement_len = strlen(map_replacement);
+ size_t result_len;
+ char *result;
+
+ result_len = prefix_len + replacement_len + 1;
+ result = pg_malloc(result_len);
+
+ memcpy(result, loc_str, prefix_len);
+ memcpy(result + prefix_len, map_replacement, replacement_len);
+ result[prefix_len + replacement_len] = '\0';
+
+ return result;
+ }
+ }
+ }
+
+ return pg_strdup(loc_str);
+}
+
+#endif
+
/*
* Convert to canonical BCP47 language tag. Must be consistent with
* icu_language_tag().
@@ -2238,6 +2296,7 @@ icu_language_tag(const char *loc_str)
{
#ifdef USE_ICU
UErrorCode status;
+ char *fixed_loc_str = icu_fix_variants(loc_str);
char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
@@ -2268,7 +2327,7 @@ icu_language_tag(const char *loc_str)
int32_t len;
status = U_ZERO_ERROR;
- len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+ len = uloc_toLanguageTag(fixed_loc_str, langtag, buflen, strict, &status);
/*
* If the result fits in the buffer exactly (len == buflen),
@@ -2287,6 +2346,8 @@ icu_language_tag(const char *loc_str)
break;
}
+ pg_free(fixed_loc_str);
+
if (U_FAILURE(status))
{
pg_free(langtag);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 9c9e1e4f48..e0f11e3cd4 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1042,13 +1042,24 @@ ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
+ERROR: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
+WARNING: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
RESET icu_validation_level;
+-- test special variants
+CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-cu-eur" for locale "@EURO"
+CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-co-pinyin" for locale "@pinyin"
+CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-co-stroke" for locale "@stroke"
CREATE COLLATION test4 FROM nonsense;
ERROR: collation "nonsense" for encoding "UTF8" does not exist
CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 85e26951b6..8d5423bc17 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -378,11 +378,18 @@ RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
SET icu_validation_level = WARNING;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
RESET icu_validation_level;
+-- test special variants
+CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
+
CREATE COLLATION test4 FROM nonsense;
CREATE COLLATION test5 FROM test0;
--
2.34.1
v5-0003-Reduce-icu_validation_level-default-to-WARNING.patchtext/x-patch; charset=UTF-8; name=v5-0003-Reduce-icu_validation_level-default-to-WARNING.patchDownload
From cd839f069cc09a71788bafa28730e4caf8f9d768 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Wed, 10 May 2023 10:47:16 -0700
Subject: [PATCH v5 3/7] Reduce icu_validation_level default to WARNING.
---
doc/src/sgml/config.sgml | 2 +-
src/backend/utils/adt/pg_locale.c | 2 +-
src/backend/utils/misc/guc_tables.c | 2 +-
src/backend/utils/misc/postgresql.conf.sample | 2 +-
src/test/regress/expected/collate.icu.utf8.out | 4 ++--
src/test/regress/sql/collate.icu.utf8.sql | 4 ++--
6 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b56f073a91..c4a9dcb9ae 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9840,7 +9840,7 @@ SET XML OPTION { DOCUMENT | CONTENT };
<para>
If set to <literal>DISABLED</literal>, does not report validation
problems at all. Otherwise reports problems at the given message
- level. The default is <literal>ERROR</literal>.
+ level. The default is <literal>WARNING</literal>.
</para>
</listitem>
</varlistentry>
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index e7b166461b..bb4a8d84f6 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -96,7 +96,7 @@ char *locale_monetary;
char *locale_numeric;
char *locale_time;
-int icu_validation_level = ERROR;
+int icu_validation_level = WARNING;
/*
* lc_time localization cache.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 2f42cebaf6..8c843f4ab6 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -4689,7 +4689,7 @@ struct config_enum ConfigureNamesEnum[] =
NULL
},
&icu_validation_level,
- ERROR, icu_validation_level_options,
+ WARNING, icu_validation_level_options,
NULL, NULL, NULL
},
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b70c66ca87..87bad8ecbf 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -734,7 +734,7 @@
#lc_numeric = 'C' # locale for number formatting
#lc_time = 'C' # locale for time formatting
-#icu_validation_level = ERROR # report ICU locale validation
+#icu_validation_level = WARNING # report ICU locale validation
# errors at the given level
# default configuration for text search
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index e0f11e3cd4..12afc3b65a 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1037,6 +1037,7 @@ $$;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
ERROR: parameter "locale" must be specified
+SET icu_validation_level = ERROR;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
@@ -1044,7 +1045,7 @@ CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
ERROR: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-SET icu_validation_level = WARNING;
+RESET icu_validation_level;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
@@ -1052,7 +1053,6 @@ WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
WARNING: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-RESET icu_validation_level;
-- test special variants
CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
NOTICE: using standard form "und-u-cu-eur" for locale "@EURO"
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 8d5423bc17..655c965f46 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -376,14 +376,14 @@ $$;
RESET client_min_messages;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
+SET icu_validation_level = ERROR;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
-SET icu_validation_level = WARNING;
+RESET icu_validation_level;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
-RESET icu_validation_level;
-- test special variants
CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
--
2.34.1
v5-0004-Introduce-collation-provider-none.patchtext/x-patch; charset=UTF-8; name=v5-0004-Introduce-collation-provider-none.patchDownload
From a13a15988ab2e991e42569b8b1e0cd1d6e940baf Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 1 May 2023 15:38:29 -0700
Subject: [PATCH v5 4/7] Introduce collation provider "none".
Provides locale-unaware semantics that are implemented as fast byte
operations in Postgres, independent of the operating system or any
provider libraries.
Equivalent (in semantics and implementation) to the libc provider with
locale "C", except that LC_COLLATE and LC_CTYPE can be set
independently.
Use provider "none" for built-in collation "ucs_basic" instead of
libc.
---
doc/src/sgml/charset.sgml | 87 +++++++++++++++++++++-----
doc/src/sgml/ref/create_collation.sgml | 2 +-
doc/src/sgml/ref/create_database.sgml | 2 +-
doc/src/sgml/ref/createdb.sgml | 2 +-
doc/src/sgml/ref/initdb.sgml | 2 +-
src/backend/catalog/pg_collation.c | 7 ++-
src/backend/commands/collationcmds.c | 84 ++++++++++++++++++++-----
src/backend/commands/dbcommands.c | 69 +++++++++++++++++---
src/backend/utils/adt/pg_locale.c | 27 +++++++-
src/backend/utils/init/postinit.c | 10 ++-
src/bin/initdb/initdb.c | 33 +++++++++-
src/bin/initdb/t/001_initdb.pl | 29 +++++++++
src/bin/pg_dump/pg_dump.c | 8 ++-
src/bin/pg_upgrade/t/002_pg_upgrade.pl | 18 +++++-
src/bin/psql/describe.c | 2 +-
src/bin/scripts/createdb.c | 2 +-
src/bin/scripts/t/020_createdb.pl | 29 +++++++++
src/include/catalog/pg_collation.dat | 3 +-
src/include/catalog/pg_collation.h | 3 +
src/test/regress/expected/collate.out | 10 ++-
src/test/regress/sql/collate.sql | 6 ++
21 files changed, 372 insertions(+), 63 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 6dd95b8966..de7c65ae35 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -342,22 +342,14 @@ initdb --locale=sv_SE
<title>Locale Providers</title>
<para>
- <productname>PostgreSQL</productname> supports multiple <firstterm>locale
- providers</firstterm>. This specifies which library supplies the locale
- data. One standard provider name is <literal>libc</literal>, which uses
- the locales provided by the operating system C library. These are the
- locales used by most tools provided by the operating system. Another
- provider is <literal>icu</literal>, which uses the external
- ICU<indexterm><primary>ICU</primary></indexterm> library. ICU locales can
- only be used if support for ICU was configured when PostgreSQL was built.
+ A locale provider specifies which library defines the locale behavior for
+ collations and character classifications.
</para>
<para>
The commands and tools that select the locale settings, as described
- above, each have an option to select the locale provider. The examples
- shown earlier all use the <literal>libc</literal> provider, which is the
- default. Here is an example to initialize a database cluster using the
- ICU provider:
+ above, each have an option to select the locale provider. Here is an
+ example to initialize a database cluster using the ICU provider:
<programlisting>
initdb --locale-provider=icu --icu-locale=en
</programlisting>
@@ -370,12 +362,73 @@ initdb --locale-provider=icu --icu-locale=en
</para>
<para>
- Which locale provider to use depends on individual requirements. For most
- basic uses, either provider will give adequate results. For the libc
- provider, it depends on what the operating system offers; some operating
- systems are better than others. For advanced uses, ICU offers more locale
- variants and customization options.
+ Regardless of the locale provider, the operating system is still used to
+ provide some locale-aware behavior, such as messages (see <xref
+ linkend="guc-lc-messages"/>).
</para>
+
+ <para>
+ The available locale providers are listed below.
+ </para>
+
+ <sect3 id="locale-provider-none">
+ <title>None</title>
+ <para>
+ The <literal>none</literal> provider uses simple built-in operations
+ which are not locale-aware.
+ </para>
+ <para>
+ The collation and character classification behavior is equivalent to
+ using the <literal>libc</literal> provider with locale
+ <literal>C</literal>, except that <literal>LC_COLLATE</literal> and
+ <literal>LC_CTYPE</literal> can be set independently.
+ </para>
+ <note>
+ <para>
+ When using the <literal>none</literal> locale provider, behavior may
+ depend on the database encoding.
+ </para>
+ </note>
+ </sect3>
+ <sect3 id="locale-provider-icu">
+ <title>ICU</title>
+ <para>
+ The <literal>icu</literal> provider uses the external
+ ICU<indexterm><primary>ICU</primary></indexterm>
+ library. <productname>PostgreSQL</productname> must have been configured
+ with support.
+ </para>
+ <para>
+ ICU provides collation and character classification behavior that is
+ independent of the operating system and database encoding, which is
+ preferable if you expect to transition to other platforms without any
+ change in results. <literal>LC_COLLATE</literal> and
+ <literal>LC_CTYPE</literal> can be set independently of the ICU locale.
+ </para>
+ <note>
+ <para>
+ For the ICU provider, results may depend on the version of the ICU
+ library used, as it is updated to reflect changes in natural language
+ over time.
+ </para>
+ </note>
+ </sect3>
+ <sect3 id="locale-provider-libc">
+ <title>libc</title>
+ <para>
+ The <literal>libc</literal> provider uses the operating system's C
+ library. The collation and character classification behavior is
+ controlled by the settings <literal>LC_COLLATE</literal> and
+ <literal>LC_CTYPE</literal>, so they cannot be set independently.
+ </para>
+ <note>
+ <para>
+ The same locale name may have different behavior on different platforms
+ when using the libc provider.
+ </para>
+ </note>
+ </sect3>
+
</sect2>
<sect2 id="locale-problems">
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index f6353da5c1..5489ae7413 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -120,7 +120,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
<listitem>
<para>
Specifies the provider to use for locale services associated with this
- collation. Possible values are
+ collation. Possible values are <literal>none</literal>,
<literal>icu</literal><indexterm><primary>ICU</primary></indexterm>
(if the server was built with ICU support) or <literal>libc</literal>.
<literal>libc</literal> is the default. See <xref
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 13793bb6b7..60b9da0952 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -212,7 +212,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
<listitem>
<para>
Specifies the provider to use for the default collation in this
- database. Possible values are
+ database. Possible values are <literal>none</literal>,
<literal>icu</literal><indexterm><primary>ICU</primary></indexterm>
(if the server was built with ICU support) or <literal>libc</literal>.
By default, the provider is the same as that of the <xref
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index e23419ba6c..326a371d34 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -168,7 +168,7 @@ PostgreSQL documentation
</varlistentry>
<varlistentry>
- <term><option>--locale-provider={<literal>libc</literal>|<literal>icu</literal>}</option></term>
+ <term><option>--locale-provider={<literal>none</literal>|<literal>libc</literal>|<literal>icu</literal>}</option></term>
<listitem>
<para>
Specifies the locale provider for the database's default collation.
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 87945b4b62..e604ab48b7 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -323,7 +323,7 @@ PostgreSQL documentation
</varlistentry>
<varlistentry id="app-initdb-option-locale-provider">
- <term><option>--locale-provider={<literal>libc</literal>|<literal>icu</literal>}</option></term>
+ <term><option>--locale-provider={<literal>none</literal>|<literal>libc</literal>|<literal>icu</literal>}</option></term>
<listitem>
<para>
This option sets the locale provider for databases created in the new
diff --git a/src/backend/catalog/pg_collation.c b/src/backend/catalog/pg_collation.c
index fd022e6fc2..86b6ba2375 100644
--- a/src/backend/catalog/pg_collation.c
+++ b/src/backend/catalog/pg_collation.c
@@ -68,7 +68,12 @@ CollationCreate(const char *collname, Oid collnamespace,
Assert(collname);
Assert(collnamespace);
Assert(collowner);
- Assert((collcollate && collctype) || colliculocale);
+ Assert((collprovider == COLLPROVIDER_NONE &&
+ !collcollate && !collctype && !colliculocale) ||
+ (collprovider == COLLPROVIDER_LIBC &&
+ collcollate && collctype && !colliculocale) ||
+ (collprovider == COLLPROVIDER_ICU &&
+ !collcollate && !collctype && colliculocale));
/*
* Make sure there is no existing collation of same name & encoding.
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index a53700256b..267a551818 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -215,7 +215,9 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
if (collproviderstr)
{
- if (pg_strcasecmp(collproviderstr, "icu") == 0)
+ if (pg_strcasecmp(collproviderstr, "none") == 0)
+ collprovider = COLLPROVIDER_NONE;
+ else if (pg_strcasecmp(collproviderstr, "icu") == 0)
collprovider = COLLPROVIDER_ICU;
else if (pg_strcasecmp(collproviderstr, "libc") == 0)
collprovider = COLLPROVIDER_LIBC;
@@ -228,6 +230,13 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
else
collprovider = COLLPROVIDER_LIBC;
+ if (collprovider == COLLPROVIDER_NONE
+ && (localeEl || lccollateEl || lcctypeEl))
+ {
+ ereport(ERROR,
+ (errmsg("collation provider \"none\" does not support LOCALE, LC_COLLATE, or LC_CTYPE")));
+ }
+
if (localeEl)
{
if (collprovider == COLLPROVIDER_LIBC)
@@ -317,7 +326,15 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
*/
collencoding = GetDatabaseEncoding();
- if (collprovider == COLLPROVIDER_ICU)
+ if (collprovider == COLLPROVIDER_NONE)
+ {
+ /*
+ * The "none" provider works with all encodings, so no checking is
+ * required. NB: the behavior may be different for different
+ * encodings, though.
+ */
+ }
+ else if (collprovider == COLLPROVIDER_ICU)
{
#ifdef USE_ICU
/*
@@ -343,7 +360,18 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
}
if (!collversion)
- collversion = get_collation_actual_version(collprovider, collprovider == COLLPROVIDER_ICU ? colliculocale : collcollate);
+ {
+ char *locale;
+
+ if (collprovider == COLLPROVIDER_ICU)
+ locale = colliculocale;
+ else if (collprovider == COLLPROVIDER_LIBC)
+ locale = collcollate;
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ collversion = get_collation_actual_version(collprovider, locale);
+ }
newoid = CollationCreate(collName,
collNamespace,
@@ -418,6 +446,7 @@ AlterCollation(AlterCollationStmt *stmt)
Form_pg_collation collForm;
Datum datum;
bool isnull;
+ char *locale;
char *oldversion;
char *newversion;
ObjectAddress address;
@@ -442,8 +471,20 @@ AlterCollation(AlterCollationStmt *stmt)
datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collversion, &isnull);
oldversion = isnull ? NULL : TextDatumGetCString(datum);
- datum = SysCacheGetAttrNotNull(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate);
- newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
+ if (collForm->collprovider == COLLPROVIDER_ICU)
+ {
+ datum = SysCacheGetAttrNotNull(COLLOID, tup, Anum_pg_collation_colliculocale);
+ locale = TextDatumGetCString(datum);
+ }
+ else if (collForm->collprovider == COLLPROVIDER_LIBC)
+ {
+ datum = SysCacheGetAttrNotNull(COLLOID, tup, Anum_pg_collation_collcollate);
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ newversion = get_collation_actual_version(collForm->collprovider, locale);
/* cannot change from NULL to non-NULL or vice versa */
if ((!oldversion && newversion) || (oldversion && !newversion))
@@ -506,11 +547,18 @@ pg_collation_actual_version(PG_FUNCTION_ARGS)
provider = ((Form_pg_database) GETSTRUCT(dbtup))->datlocprovider;
- datum = SysCacheGetAttrNotNull(DATABASEOID, dbtup,
- provider == COLLPROVIDER_ICU ?
- Anum_pg_database_daticulocale : Anum_pg_database_datcollate);
-
- locale = TextDatumGetCString(datum);
+ if (provider == COLLPROVIDER_ICU)
+ {
+ datum = SysCacheGetAttrNotNull(DATABASEOID, dbtup, Anum_pg_database_daticulocale);
+ locale = TextDatumGetCString(datum);
+ }
+ else if (provider == COLLPROVIDER_LIBC)
+ {
+ datum = SysCacheGetAttrNotNull(DATABASEOID, dbtup, Anum_pg_database_datcollate);
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
ReleaseSysCache(dbtup);
}
@@ -526,11 +574,19 @@ pg_collation_actual_version(PG_FUNCTION_ARGS)
provider = ((Form_pg_collation) GETSTRUCT(colltp))->collprovider;
Assert(provider != COLLPROVIDER_DEFAULT);
- datum = SysCacheGetAttrNotNull(COLLOID, colltp,
- provider == COLLPROVIDER_ICU ?
- Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate);
- locale = TextDatumGetCString(datum);
+ if (provider == COLLPROVIDER_ICU)
+ {
+ datum = SysCacheGetAttrNotNull(COLLOID, colltp, Anum_pg_collation_colliculocale);
+ locale = TextDatumGetCString(datum);
+ }
+ else if (provider == COLLPROVIDER_LIBC)
+ {
+ datum = SysCacheGetAttrNotNull(COLLOID, colltp, Anum_pg_collation_collcollate);
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
ReleaseSysCache(colltp);
}
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2e242eeff2..9e73f54803 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -909,7 +909,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
{
char *locproviderstr = defGetString(dlocprovider);
- if (pg_strcasecmp(locproviderstr, "icu") == 0)
+ if (pg_strcasecmp(locproviderstr, "none") == 0)
+ dblocprovider = COLLPROVIDER_NONE;
+ else if (pg_strcasecmp(locproviderstr, "icu") == 0)
dblocprovider = COLLPROVIDER_ICU;
else if (pg_strcasecmp(locproviderstr, "libc") == 0)
dblocprovider = COLLPROVIDER_LIBC;
@@ -1177,9 +1179,17 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
*/
if (src_collversion && !dcollversion)
{
- char *actual_versionstr;
+ char *actual_versionstr;
+ char *locale;
- actual_versionstr = get_collation_actual_version(dblocprovider, dblocprovider == COLLPROVIDER_ICU ? dbiculocale : dbcollate);
+ if (dblocprovider == COLLPROVIDER_ICU)
+ locale = dbiculocale;
+ else if (dblocprovider == COLLPROVIDER_LIBC)
+ locale = dbcollate;
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ actual_versionstr = get_collation_actual_version(dblocprovider, locale);
if (!actual_versionstr)
ereport(ERROR,
(errmsg("template database \"%s\" has a collation version, but no actual collation version could be determined",
@@ -1207,7 +1217,18 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
* collation version, which is normally only the case for template0.
*/
if (dbcollversion == NULL)
- dbcollversion = get_collation_actual_version(dblocprovider, dblocprovider == COLLPROVIDER_ICU ? dbiculocale : dbcollate);
+ {
+ char *locale;
+
+ if (dblocprovider == COLLPROVIDER_ICU)
+ locale = dbiculocale;
+ else if (dblocprovider == COLLPROVIDER_LIBC)
+ locale = dbcollate;
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ dbcollversion = get_collation_actual_version(dblocprovider, locale);
+ }
/* Resolve default tablespace for new database */
if (dtablespacename && dtablespacename->arg)
@@ -2403,6 +2424,7 @@ AlterDatabaseRefreshColl(AlterDatabaseRefreshCollStmt *stmt)
ObjectAddress address;
Datum datum;
bool isnull;
+ char *locale;
char *oldversion;
char *newversion;
@@ -2429,10 +2451,24 @@ AlterDatabaseRefreshColl(AlterDatabaseRefreshCollStmt *stmt)
datum = heap_getattr(tuple, Anum_pg_database_datcollversion, RelationGetDescr(rel), &isnull);
oldversion = isnull ? NULL : TextDatumGetCString(datum);
- datum = heap_getattr(tuple, datForm->datlocprovider == COLLPROVIDER_ICU ? Anum_pg_database_daticulocale : Anum_pg_database_datcollate, RelationGetDescr(rel), &isnull);
- if (isnull)
- elog(ERROR, "unexpected null in pg_database");
- newversion = get_collation_actual_version(datForm->datlocprovider, TextDatumGetCString(datum));
+ if (datForm->datlocprovider == COLLPROVIDER_ICU)
+ {
+ datum = heap_getattr(tuple, Anum_pg_database_daticulocale, RelationGetDescr(rel), &isnull);
+ if (isnull)
+ elog(ERROR, "unexpected null in pg_database");
+ locale = TextDatumGetCString(datum);
+ }
+ else if (datForm->datlocprovider == COLLPROVIDER_LIBC)
+ {
+ datum = heap_getattr(tuple, Anum_pg_database_datcollate, RelationGetDescr(rel), &isnull);
+ if (isnull)
+ elog(ERROR, "unexpected null in pg_database");
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ newversion = get_collation_actual_version(datForm->datlocprovider, locale);
/* cannot change from NULL to non-NULL or vice versa */
if ((!oldversion && newversion) || (oldversion && !newversion))
@@ -2617,6 +2653,7 @@ pg_database_collation_actual_version(PG_FUNCTION_ARGS)
HeapTuple tp;
char datlocprovider;
Datum datum;
+ char *locale;
char *version;
tp = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(dbid));
@@ -2627,8 +2664,20 @@ pg_database_collation_actual_version(PG_FUNCTION_ARGS)
datlocprovider = ((Form_pg_database) GETSTRUCT(tp))->datlocprovider;
- datum = SysCacheGetAttrNotNull(DATABASEOID, tp, datlocprovider == COLLPROVIDER_ICU ? Anum_pg_database_daticulocale : Anum_pg_database_datcollate);
- version = get_collation_actual_version(datlocprovider, TextDatumGetCString(datum));
+ if (datlocprovider == COLLPROVIDER_ICU)
+ {
+ datum = SysCacheGetAttrNotNull(DATABASEOID, tp, Anum_pg_database_daticulocale);
+ locale = TextDatumGetCString(datum);
+ }
+ else if (datlocprovider == COLLPROVIDER_LIBC)
+ {
+ datum = SysCacheGetAttrNotNull(DATABASEOID, tp, Anum_pg_database_datcollate);
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ version = get_collation_actual_version(datlocprovider, locale);
ReleaseSysCache(tp);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index bb4a8d84f6..5ac5036f05 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1228,7 +1228,12 @@ lookup_collation_cache(Oid collation, bool set_flags)
elog(ERROR, "cache lookup failed for collation %u", collation);
collform = (Form_pg_collation) GETSTRUCT(tp);
- if (collform->collprovider == COLLPROVIDER_LIBC)
+ if (collform->collprovider == COLLPROVIDER_NONE)
+ {
+ cache_entry->collate_is_c = true;
+ cache_entry->ctype_is_c = true;
+ }
+ else if (collform->collprovider == COLLPROVIDER_LIBC)
{
Datum datum;
const char *collcollate;
@@ -1281,6 +1286,9 @@ lc_collate_is_c(Oid collation)
static int result = -1;
char *localeptr;
+ if (default_locale.provider == COLLPROVIDER_NONE)
+ return true;
+
if (default_locale.provider == COLLPROVIDER_ICU)
return false;
@@ -1334,6 +1342,9 @@ lc_ctype_is_c(Oid collation)
static int result = -1;
char *localeptr;
+ if (default_locale.provider == COLLPROVIDER_NONE)
+ return true;
+
if (default_locale.provider == COLLPROVIDER_ICU)
return false;
@@ -1487,8 +1498,10 @@ pg_newlocale_from_collation(Oid collid)
{
if (default_locale.provider == COLLPROVIDER_ICU)
return &default_locale;
- else
+ else if (default_locale.provider == COLLPROVIDER_LIBC)
return (pg_locale_t) 0;
+ else
+ elog(ERROR, "cannot open collation with provider \"none\"");
}
cache_entry = lookup_collation_cache(collid, false);
@@ -1513,7 +1526,11 @@ pg_newlocale_from_collation(Oid collid)
result.provider = collform->collprovider;
result.deterministic = collform->collisdeterministic;
- if (collform->collprovider == COLLPROVIDER_LIBC)
+ if (collform->collprovider == COLLPROVIDER_NONE)
+ {
+ elog(ERROR, "cannot open collation with provider \"none\"");
+ }
+ else if (collform->collprovider == COLLPROVIDER_LIBC)
{
#ifdef HAVE_LOCALE_T
const char *collcollate;
@@ -1599,6 +1616,7 @@ pg_newlocale_from_collation(Oid collid)
collversionstr = TextDatumGetCString(datum);
+ Assert(collform->collprovider != COLLPROVIDER_NONE);
datum = SysCacheGetAttrNotNull(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate);
actual_versionstr = get_collation_actual_version(collform->collprovider,
@@ -1650,6 +1668,9 @@ get_collation_actual_version(char collprovider, const char *collcollate)
{
char *collversion = NULL;
+ if (collprovider == COLLPROVIDER_NONE)
+ return NULL;
+
#ifdef USE_ICU
if (collprovider == COLLPROVIDER_ICU)
{
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 53420f4974..8053642fd3 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -461,10 +461,18 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
{
char *actual_versionstr;
char *collversionstr;
+ char *locale;
collversionstr = TextDatumGetCString(datum);
- actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
+ if (dbform->datlocprovider == COLLPROVIDER_ICU)
+ locale = iculocale;
+ else if (dbform->datlocprovider == COLLPROVIDER_LIBC)
+ locale = collate;
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ actual_versionstr = get_collation_actual_version(dbform->datlocprovider, locale);
if (!actual_versionstr)
/* should not happen */
elog(WARNING,
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2b5cc30955..4cf6892bee 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2469,6 +2469,22 @@ setlocales(void)
/* set empty lc_* values to locale config if set */
+ if (locale_provider == COLLPROVIDER_NONE)
+ {
+ if (!lc_ctype)
+ lc_ctype = "C";
+ if (!lc_collate)
+ lc_collate = "C";
+ if (!lc_numeric)
+ lc_numeric = "C";
+ if (!lc_time)
+ lc_time = "C";
+ if (!lc_monetary)
+ lc_monetary = "C";
+ if (!lc_messages)
+ lc_messages = "C";
+ }
+
if (locale)
{
if (!lc_ctype)
@@ -2563,7 +2579,7 @@ usage(const char *progname)
" set default locale in the respective category for\n"
" new databases (default taken from environment)\n"));
printf(_(" --no-locale equivalent to --locale=C\n"));
- printf(_(" --locale-provider={libc|icu}\n"
+ printf(_(" --locale-provider={none|libc|icu}\n"
" set default locale provider for new databases\n"));
printf(_(" --pwfile=FILE read password for the new superuser from file\n"));
printf(_(" -T, --text-search-config=CFG\n"
@@ -2713,7 +2729,15 @@ setup_locale_encoding(void)
{
setlocales();
- if (locale_provider == COLLPROVIDER_LIBC &&
+ if (locale_provider == COLLPROVIDER_NONE &&
+ strcmp(lc_ctype, "C") == 0 &&
+ strcmp(lc_collate, "C") == 0 &&
+ strcmp(lc_time, "C") == 0 &&
+ strcmp(lc_numeric, "C") == 0 &&
+ strcmp(lc_monetary, "C") == 0 &&
+ strcmp(lc_messages, "C") == 0)
+ printf(_("The database cluster will be initialized with no locale.\n"));
+ else if (locale_provider == COLLPROVIDER_LIBC &&
strcmp(lc_ctype, lc_collate) == 0 &&
strcmp(lc_ctype, lc_time) == 0 &&
strcmp(lc_ctype, lc_numeric) == 0 &&
@@ -3387,7 +3411,9 @@ main(int argc, char *argv[])
"-c debug_discard_caches=1");
break;
case 15:
- if (strcmp(optarg, "icu") == 0)
+ if (strcmp(optarg, "none") == 0)
+ locale_provider = COLLPROVIDER_NONE;
+ else if (strcmp(optarg, "icu") == 0)
locale_provider = COLLPROVIDER_ICU;
else if (strcmp(optarg, "libc") == 0)
locale_provider = COLLPROVIDER_LIBC;
@@ -3426,6 +3452,7 @@ main(int argc, char *argv[])
exit(1);
}
+
if (icu_locale && locale_provider != COLLPROVIDER_ICU)
pg_fatal("%s cannot be specified unless locale provider \"%s\" is chosen",
"--icu-locale", "icu");
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 17a444d80c..fe6d224e5b 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -154,6 +154,35 @@ else
'locale provider ICU fails since no ICU support');
}
+command_ok(
+ [ 'initdb', '--no-sync', '--locale-provider=none', "$tempdir/data6" ],
+ 'locale provider none');
+
+command_ok(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--locale=C',
+ "$tempdir/data7" ],
+ 'locale provider none with --locale');
+
+command_ok(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--lc-collate=C',
+ "$tempdir/data8" ],
+ 'locale provider none with --lc-collate');
+
+command_ok(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--lc-ctype=C',
+ "$tempdir/data9" ],
+ 'locale provider none with --lc-ctype');
+
+command_fails(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--icu-locale=en',
+ "$tempdir/dataX" ],
+ 'fails for locale provider none with ICU locale');
+
+command_fails(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--icu-rules=""',
+ "$tempdir/dataX" ],
+ 'fails for locale provider none with ICU rules');
+
command_fails(
[ 'initdb', '--no-sync', '--locale-provider=xyz', "$tempdir/dataX" ],
'fails for invalid locale provider');
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 41a51ec5cd..be6580ab3c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3070,7 +3070,9 @@ dumpDatabase(Archive *fout)
}
appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = ");
- if (datlocprovider[0] == 'c')
+ if (datlocprovider[0] == 'n')
+ appendPQExpBufferStr(creaQry, "none");
+ else if (datlocprovider[0] == 'c')
appendPQExpBufferStr(creaQry, "libc");
else if (datlocprovider[0] == 'i')
appendPQExpBufferStr(creaQry, "icu");
@@ -13446,7 +13448,9 @@ dumpCollation(Archive *fout, const CollInfo *collinfo)
fmtQualifiedDumpable(collinfo));
appendPQExpBufferStr(q, "provider = ");
- if (collprovider[0] == 'c')
+ if (collprovider[0] == 'n')
+ appendPQExpBufferStr(q, "none");
+ else if (collprovider[0] == 'c')
appendPQExpBufferStr(q, "libc");
else if (collprovider[0] == 'i')
appendPQExpBufferStr(q, "icu");
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 4a7895a756..6d58f6103e 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -114,12 +114,20 @@ my $original_locale = "C";
my $original_iculocale = "";
my $provider_field = "'c' AS datlocprovider";
my $iculocale_field = "NULL AS daticulocale";
-if ($oldnode->pg_version >= 15 && $ENV{with_icu} eq 'yes')
+if ($oldnode->pg_version >= 15)
{
$provider_field = "datlocprovider";
$iculocale_field = "daticulocale";
- $original_provider = "i";
- $original_iculocale = "fr-CA";
+
+ if ($ENV{with_icu} eq 'yes')
+ {
+ $original_provider = "i";
+ $original_iculocale = "fr-CA";
+ }
+ else
+ {
+ $original_provider = "n";
+ }
}
my @initdb_params = @custom_opts;
@@ -131,6 +139,10 @@ if ($original_provider eq "i")
push @initdb_params, ('--locale-provider', 'icu');
push @initdb_params, ('--icu-locale', 'fr-CA');
}
+elsif ($original_provider eq "n")
+{
+ push @initdb_params, ('--locale-provider', 'none');
+}
$node_params{extra} = \@initdb_params;
$oldnode->init(%node_params);
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 058e41e749..16e726b784 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -932,7 +932,7 @@ listAllDbs(const char *pattern, bool verbose)
gettext_noop("Encoding"));
if (pset.sversion >= 150000)
appendPQExpBuffer(&buf,
- " CASE d.datlocprovider WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu' END AS \"%s\",\n",
+ " CASE d.datlocprovider WHEN 'n' THEN 'none' WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu' END AS \"%s\",\n",
gettext_noop("Locale Provider"));
else
appendPQExpBuffer(&buf,
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index b4205c4fa5..79367d933b 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -299,7 +299,7 @@ help(const char *progname)
printf(_(" --lc-ctype=LOCALE LC_CTYPE setting for the database\n"));
printf(_(" --icu-locale=LOCALE ICU locale setting for the database\n"));
printf(_(" --icu-rules=RULES ICU rules setting for the database\n"));
- printf(_(" --locale-provider={libc|icu}\n"
+ printf(_(" --locale-provider={none|libc|icu}\n"
" locale provider for the database's default collation\n"));
printf(_(" -O, --owner=OWNER database user to own the new database\n"));
printf(_(" -S, --strategy=STRATEGY database creation strategy wal_log or file_copy\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index af3b1492e3..5aa658b671 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -83,6 +83,35 @@ else
'create database with ICU fails since no ICU support');
}
+$node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', 'testnone1' ],
+ 'create database with provider "none"');
+
+$node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--locale=C',
+ 'testnone2' ],
+ 'create database with provider "none" and locale "C"');
+
+$node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--lc-collate=C',
+ 'testnone3' ],
+ 'create database with provider "none" and LC_COLLATE=C');
+
+$node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--lc-ctype=C',
+ 'testnone4' ],
+ 'create database with provider "none" and LC_CTYPE=C');
+
+$node->command_fails(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--icu-locale=en',
+ 'testnone5' ],
+ 'create database with provider "none" and ICU_LOCALE="en"');
+
+$node->command_fails(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--icu-rules=""',
+ 'testnone6' ],
+ 'create database with provider "none" and ICU_RULES=""');
+
$node->command_fails([ 'createdb', 'foobar1' ],
'fails if database already exists');
diff --git a/src/include/catalog/pg_collation.dat b/src/include/catalog/pg_collation.dat
index b6a69d1d42..40d62416ea 100644
--- a/src/include/catalog/pg_collation.dat
+++ b/src/include/catalog/pg_collation.dat
@@ -24,8 +24,7 @@
collname => 'POSIX', collprovider => 'c', collencoding => '-1',
collcollate => 'POSIX', collctype => 'POSIX' },
{ oid => '962', descr => 'sorts by Unicode code point',
- collname => 'ucs_basic', collprovider => 'c', collencoding => '6',
- collcollate => 'C', collctype => 'C' },
+ collname => 'ucs_basic', collprovider => 'n', collencoding => '6' },
{ oid => '963',
descr => 'sorts using the Unicode Collation Algorithm with default settings',
collname => 'unicode', collprovider => 'i', collencoding => '-1',
diff --git a/src/include/catalog/pg_collation.h b/src/include/catalog/pg_collation.h
index bfa3568451..29be3f8d94 100644
--- a/src/include/catalog/pg_collation.h
+++ b/src/include/catalog/pg_collation.h
@@ -64,6 +64,7 @@ DECLARE_UNIQUE_INDEX_PKEY(pg_collation_oid_index, 3085, CollationOidIndexId, on
#ifdef EXPOSE_TO_CLIENT_CODE
+#define COLLPROVIDER_NONE 'n'
#define COLLPROVIDER_DEFAULT 'd'
#define COLLPROVIDER_ICU 'i'
#define COLLPROVIDER_LIBC 'c'
@@ -73,6 +74,8 @@ collprovider_name(char c)
{
switch (c)
{
+ case COLLPROVIDER_NONE:
+ return "none";
case COLLPROVIDER_ICU:
return "icu";
case COLLPROVIDER_LIBC:
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index 0649564485..b7603c9f6c 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -650,6 +650,13 @@ EXPLAIN (COSTS OFF)
(3 rows)
-- CREATE/DROP COLLATION
+CREATE COLLATION none ( PROVIDER = none );
+CREATE COLLATION none2 ( PROVIDER = none, LOCALE="POSIX" ); -- fails
+ERROR: collation provider "none" does not support LOCALE, LC_COLLATE, or LC_CTYPE
+CREATE COLLATION none2 ( PROVIDER = none, LC_CTYPE="POSIX" ); -- fails
+ERROR: collation provider "none" does not support LOCALE, LC_COLLATE, or LC_CTYPE
+CREATE COLLATION none2 ( PROVIDER = none, LC_COLLATE="POSIX" ); -- fails
+ERROR: collation provider "none" does not support LOCALE, LC_COLLATE, or LC_CTYPE
CREATE COLLATION mycoll1 FROM "C";
CREATE COLLATION mycoll2 ( LC_COLLATE = "POSIX", LC_CTYPE = "POSIX" );
CREATE COLLATION mycoll3 FROM "default"; -- intentionally unsupported
@@ -754,7 +761,7 @@ DETAIL: FROM cannot be specified together with any other options.
-- must get rid of them.
--
DROP SCHEMA collate_tests CASCADE;
-NOTICE: drop cascades to 19 other objects
+NOTICE: drop cascades to 20 other objects
DETAIL: drop cascades to table collate_test1
drop cascades to table collate_test_like
drop cascades to table collate_test2
@@ -771,6 +778,7 @@ drop cascades to function dup(anyelement)
drop cascades to table collate_test20
drop cascades to table collate_test21
drop cascades to table collate_test22
+drop cascades to collation "none"
drop cascades to collation mycoll2
drop cascades to table collate_test23
drop cascades to view collate_on_int
diff --git a/src/test/regress/sql/collate.sql b/src/test/regress/sql/collate.sql
index c3d40fc195..e2dceb8dff 100644
--- a/src/test/regress/sql/collate.sql
+++ b/src/test/regress/sql/collate.sql
@@ -244,6 +244,12 @@ EXPLAIN (COSTS OFF)
-- CREATE/DROP COLLATION
+CREATE COLLATION none ( PROVIDER = none );
+
+CREATE COLLATION none2 ( PROVIDER = none, LOCALE="POSIX" ); -- fails
+CREATE COLLATION none2 ( PROVIDER = none, LC_CTYPE="POSIX" ); -- fails
+CREATE COLLATION none2 ( PROVIDER = none, LC_COLLATE="POSIX" ); -- fails
+
CREATE COLLATION mycoll1 FROM "C";
CREATE COLLATION mycoll2 ( LC_COLLATE = "POSIX", LC_CTYPE = "POSIX" );
CREATE COLLATION mycoll3 FROM "default"; -- intentionally unsupported
--
2.34.1
v5-0005-ICU-for-locale-C-automatically-use-none-provider-.patchtext/x-patch; charset=UTF-8; name=v5-0005-ICU-for-locale-C-automatically-use-none-provider-.patchDownload
From 23e85920dbcfd1d3e71041f92c4adea589acd4f2 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 8 May 2023 13:48:01 -0700
Subject: [PATCH v5 5/7] ICU: for locale "C", automatically use "none" provider
instead.
Postgres expects locale C to be optimizable to simple locale-unaware
byte operations; while ICU does not recognize the locale "C" at all,
and falls back to the root locale.
If the user specifies locale "C" when creating a new collation or a
new database with the ICU provider, automatically switch it to the
"none" provider.
If provider is libc, behavior is unchanged.
---
doc/src/sgml/charset.sgml | 6 +++
doc/src/sgml/ref/create_collation.sgml | 6 +++
doc/src/sgml/ref/create_database.sgml | 5 +++
doc/src/sgml/ref/createdb.sgml | 5 +++
doc/src/sgml/ref/initdb.sgml | 5 +++
src/backend/commands/collationcmds.c | 17 ++++++++
src/backend/commands/dbcommands.c | 21 ++++++++++
src/bin/initdb/initdb.c | 10 +++++
src/bin/initdb/t/001_initdb.pl | 39 +++++++++++++++++++
src/bin/scripts/createdb.c | 11 ++++++
src/bin/scripts/t/020_createdb.pl | 12 ++++++
.../regress/expected/collate.icu.utf8.out | 12 ++++--
src/test/regress/sql/collate.icu.utf8.sql | 3 ++
13 files changed, 149 insertions(+), 3 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index de7c65ae35..5c4f713e8b 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -405,6 +405,12 @@ initdb --locale-provider=icu --icu-locale=en
change in results. <literal>LC_COLLATE</literal> and
<literal>LC_CTYPE</literal> can be set independently of the ICU locale.
</para>
+ <para>
+ The ICU provider does not accept the <literal>C</literal>
+ locale. Commands that create collations or database with the
+ <literal>icu</literal> provider and ICU locale <literal>C</literal> use
+ the provider <literal>none</literal> instead.
+ </para>
<note>
<para>
For the ICU provider, results may depend on the version of the ICU
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index 5489ae7413..1ac41831d8 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -126,6 +126,12 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
<literal>libc</literal> is the default. See <xref
linkend="locale-providers"/> for details.
</para>
+ <para>
+ If the provider is <literal>icu</literal> and the locale is
+ <literal>C</literal> or <literal>POSIX</literal>, the provider is
+ automatically set to <literal>none</literal>; as the ICU provider
+ doesn't support an ICU locale of <literal>C</literal>.
+ </para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 60b9da0952..c730d02e15 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -190,6 +190,11 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
<para>
Specifies the ICU locale ID if the ICU locale provider is used.
</para>
+ <para>
+ If specified as <literal>C</literal> or <literal>POSIX</literal>, the
+ provider is automatically set to <literal>none</literal>, as the ICU
+ provider doesn't support an ICU locale of <literal>C</literal>.
+ </para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index 326a371d34..7c573e848a 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -154,6 +154,11 @@ PostgreSQL documentation
Specifies the ICU locale ID to be used in this database, if the
ICU locale provider is selected.
</para>
+ <para>
+ If specified as <literal>C</literal> or <literal>POSIX</literal>, the
+ provider is automatically set to <literal>none</literal>, as the ICU
+ provider doesn't support an ICU locale of <literal>C</literal>.
+ </para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index e604ab48b7..76993acdfe 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -250,6 +250,11 @@ PostgreSQL documentation
Specifies the ICU locale when the ICU provider is used. Locale support
is described in <xref linkend="locale"/>.
</para>
+ <para>
+ If specified as <literal>C</literal> or <literal>POSIX</literal>, the
+ provider is automatically set to <literal>none</literal>, as the ICU
+ provider doesn't support an ICU locale of <literal>C</literal>.
+ </para>
<para>
If this option is not specified, the locale is inherited from the
environment in which <command>initdb</command> runs. The environment's
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 267a551818..ed64e17504 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -254,6 +254,23 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
if (lcctypeEl)
collctype = defGetString(lcctypeEl);
+ /*
+ * Postgres defines the "C" (and equivalently, "POSIX") locales to be
+ * optimizable to byte operations (memcmp(), pg_ascii_tolower(),
+ * etc.); transform into the "none" provider. Don't transform during
+ * binary upgrade.
+ */
+ if (!IsBinaryUpgrade && collprovider == COLLPROVIDER_ICU &&
+ colliculocale && (pg_strcasecmp(colliculocale, "C") == 0 ||
+ pg_strcasecmp(colliculocale, "POSIX") == 0))
+ {
+ ereport(NOTICE,
+ (errmsg("using locale provider \"none\" for ICU locale \"%s\"",
+ colliculocale)));
+ colliculocale = NULL;
+ collprovider = COLLPROVIDER_NONE;
+ }
+
if (collprovider == COLLPROVIDER_LIBC)
{
if (!collcollate)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9e73f54803..6dc737aebb 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1043,6 +1043,27 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
check_encoding_locale_matches(encoding, dbcollate, dbctype);
+ /*
+ * Postgres defines the "C" (and equivalently, "POSIX") locales to be
+ * optimizable to byte operations (memcmp(), pg_ascii_tolower(), etc.);
+ * transform into the "none" provider.
+ *
+ * Don't transform during binary upgrade or when both the provider and ICU
+ * locale are unchanged from the template.
+ */
+ if (!IsBinaryUpgrade && dblocprovider == COLLPROVIDER_ICU &&
+ (src_locprovider != COLLPROVIDER_ICU ||
+ strcmp(dbiculocale, src_iculocale) != 0) &&
+ dbiculocale && (pg_strcasecmp(dbiculocale, "C") == 0 ||
+ pg_strcasecmp(dbiculocale, "POSIX") == 0))
+ {
+ ereport(NOTICE,
+ (errmsg("using locale provider \"none\" for ICU locale \"%s\"",
+ dbiculocale)));
+ dbiculocale = NULL;
+ dblocprovider = COLLPROVIDER_NONE;
+ }
+
if (dblocprovider == COLLPROVIDER_ICU)
{
if (!(is_encoding_supported_by_icu(encoding)))
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 4cf6892bee..ea26bf8361 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2501,6 +2501,16 @@ setlocales(void)
lc_messages = locale;
}
+ if (icu_locale && locale_provider == COLLPROVIDER_ICU &&
+ (pg_strcasecmp(icu_locale, "C") == 0 ||
+ pg_strcasecmp(icu_locale, "POSIX") == 0))
+ {
+ pg_log_info("using locale provider \"none\" for ICU locale \"%s\"",
+ icu_locale);
+ icu_locale = NULL;
+ locale_provider = COLLPROVIDER_NONE;
+ }
+
/*
* canonicalize locale names, and obtain any missing values from our
* current environment
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index fe6d224e5b..ea92b08511 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -111,6 +111,45 @@ if ($ENV{with_icu} eq 'yes')
],
'option --icu-locale');
+ # transformed to provider=none
+ command_ok(
+ [
+ 'initdb', '--no-sync',
+ '--locale-provider=icu', '--icu-locale=C',
+ "$tempdir/data4a"
+ ],
+ 'option --icu-locale=C');
+
+ # transformed to provider=none
+ command_ok(
+ [
+ 'initdb', '--no-sync',
+ '--locale-provider=icu', '--icu-locale=C',
+ '--locale=C',
+ "$tempdir/data4b"
+ ],
+ 'option --icu-locale=C --locale=C');
+
+ # transformed to provider=none
+ command_ok(
+ [
+ 'initdb', '--no-sync',
+ '--locale-provider=icu', '--icu-locale=C',
+ '--lc-collate=C',
+ "$tempdir/data4c"
+ ],
+ 'option --icu-locale=C --lc-collate=C');
+
+ # transformed to provider=none
+ command_ok(
+ [
+ 'initdb', '--no-sync',
+ '--locale-provider=icu', '--icu-locale=C',
+ '--lc-ctype=C',
+ "$tempdir/data4d"
+ ],
+ 'option --icu-locale=C --lc-ctype=C');
+
command_fails_like(
[
'initdb', '--no-sync',
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 79367d933b..9caf9190cf 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -172,6 +172,17 @@ main(int argc, char *argv[])
lc_collate = locale;
}
+ if (locale_provider && pg_strcasecmp(locale_provider, "icu") == 0 &&
+ icu_locale &&
+ (pg_strcasecmp(icu_locale, "C") == 0 ||
+ pg_strcasecmp(icu_locale, "POSIX") == 0))
+ {
+ pg_log_info("using locale provider \"none\" for ICU locale \"%s\"",
+ icu_locale);
+ icu_locale = NULL;
+ locale_provider = "none";
+ }
+
if (encoding)
{
if (pg_char_to_encoding(encoding) < 0)
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 5aa658b671..eb3682f0fd 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -75,6 +75,18 @@ if ($ENV{with_icu} eq 'yes')
$node2->command_ok(
[ 'createdb', '-T', 'template0', '--icu-locale', 'en-US', 'foobar56' ],
'create database with icu locale from template database with icu provider');
+
+ # transformed into provider "none"
+ $node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=icu', '--icu-locale=C',
+ 'test_none_icu1' ],
+ 'create database with provider "icu" and ICU_LOCALE="C"');
+
+ # transformed into provider "none"
+ $node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=icu', '--icu-locale=C',
+ '--lc-ctype=C', 'test_none_icu_2' ],
+ 'create database with provider "icu" and ICU_LOCALE="C" and LC_CTYPE=C');
}
else
{
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 12afc3b65a..c0437231ad 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1035,6 +1035,9 @@ BEGIN
END
$$;
RESET client_min_messages;
+-- uses "none" provider instead
+CREATE COLLATION testc (provider = icu, locale='C');
+NOTICE: using locale provider "none" for ICU locale "C"
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
ERROR: parameter "locale" must be specified
SET icu_validation_level = ERROR;
@@ -1069,7 +1072,8 @@ SELECT collname FROM pg_collation WHERE collname LIKE 'test%' ORDER BY 1;
test0
test1
test5
-(3 rows)
+ testc
+(4 rows)
ALTER COLLATION test1 RENAME TO test11;
ALTER COLLATION test0 RENAME TO test11; -- fail
@@ -1090,7 +1094,8 @@ SELECT collname, nspname, obj_description(pg_collation.oid, 'pg_collation')
test0 | collate_tests | US English
test11 | test_schema |
test5 | collate_tests |
-(3 rows)
+ testc | collate_tests |
+(4 rows)
DROP COLLATION test0, test_schema.test11, test5;
DROP COLLATION test0; -- fail
@@ -1100,7 +1105,8 @@ NOTICE: collation "test0" does not exist, skipping
SELECT collname FROM pg_collation WHERE collname LIKE 'test%';
collname
----------
-(0 rows)
+ testc
+(1 row)
DROP SCHEMA test_schema;
DROP ROLE regress_test_role;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 655c965f46..63c29dfe2a 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -375,6 +375,9 @@ $$;
RESET client_min_messages;
+-- uses "none" provider instead
+CREATE COLLATION testc (provider = icu, locale='C');
+
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
SET icu_validation_level = ERROR;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
--
2.34.1
v5-0006-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchtext/x-patch; charset=UTF-8; name=v5-0006-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchDownload
From 79732b2f94d5097b5ceebd2a22fdbb692c780156 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 25 Apr 2023 15:01:55 -0700
Subject: [PATCH v5 6/7] Make LOCALE apply to ICU_LOCALE for CREATE DATABASE.
LOCALE is now an alias for LC_COLLATE, LC_CTYPE, and (if the provider
is ICU) ICU_LOCALE. The ICU provider accepts more locale names than
libc (e.g. language tags and locale names containing collation
attributes), so in some cases LC_COLLATE, LC_CTYPE, and ICU_LOCALE
will still need to be specified separately.
Previously, LOCALE applied only to LC_COLLATE and LC_CTYPE (and
similarly for --locale in initdb and createdb). That could lead to
confusion when the provider is implicit, such as when it is inherited
from the template database, or when ICU was made default at initdb
time in commit 27b62377b4.
Reverts incomplete fix 5cd1a5af4d.
Discussion: https://postgr.es/m/3391932.1682107209@sss.pgh.pa.us
---
doc/src/sgml/ref/create_database.sgml | 6 ++--
doc/src/sgml/ref/createdb.sgml | 5 +++-
doc/src/sgml/ref/initdb.sgml | 7 +++--
src/backend/commands/collationcmds.c | 2 +-
src/backend/commands/dbcommands.c | 15 +++++++---
src/bin/initdb/initdb.c | 11 ++++++--
src/bin/scripts/createdb.c | 13 ++++-----
src/bin/scripts/t/020_createdb.pl | 4 +--
src/test/icu/t/010_database.pl | 23 +++++++++------
.../regress/expected/collate.icu.utf8.out | 28 +++++++++----------
10 files changed, 68 insertions(+), 46 deletions(-)
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index c730d02e15..dc57ba0c8b 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -145,8 +145,10 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
<term><replaceable class="parameter">locale</replaceable></term>
<listitem>
<para>
- This is a shortcut for setting <symbol>LC_COLLATE</symbol>
- and <symbol>LC_CTYPE</symbol> at once.
+ This is a shortcut for setting <symbol>LC_COLLATE</symbol>,
+ <symbol>LC_CTYPE</symbol> and <symbol>ICU_LOCALE</symbol> at
+ once. Some locales are only valid for ICU, and must be set separately
+ with <symbol>ICU_LOCALE</symbol>.
</para>
<tip>
<para>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index 7c573e848a..7991153ecc 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -124,7 +124,10 @@ PostgreSQL documentation
<listitem>
<para>
Specifies the locale to be used in this database. This is equivalent
- to specifying both <option>--lc-collate</option> and <option>--lc-ctype</option>.
+ to specifying <option>--lc-collate</option>,
+ <option>--lc-ctype</option>, and <option>--icu-locale</option> to the
+ same value. Some locales are only valid for ICU and must be set with
+ <option>--icu-locale</option>.
</para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 76993acdfe..d9ef21c422 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -116,9 +116,10 @@ PostgreSQL documentation
<para>
To choose a different locale for the cluster, use the option
<option>--locale</option>. There are also individual options
- <option>--lc-*</option> (see below) to set values for the individual locale
- categories. Note that inconsistent settings for different locale
- categories can give nonsensical results, so this should be used with care.
+ <option>--lc-*</option> and <option>--icu-locale</option> (see below) to
+ set values for the individual locale categories. Note that inconsistent
+ settings for different locale categories can give nonsensical results, so
+ this should be used with care.
</para>
<para>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index ed64e17504..9a83f9f303 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -302,7 +302,7 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
if (langtag && strcmp(colliculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, colliculocale)));
colliculocale = langtag;
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 6dc737aebb..154f20573c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1019,7 +1019,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (dblocprovider == '\0')
dblocprovider = src_locprovider;
if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
- dbiculocale = src_iculocale;
+ {
+ if (dlocale && dlocale->arg)
+ dbiculocale = defGetString(dlocale);
+ else
+ dbiculocale = src_iculocale;
+ }
if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
dbicurules = src_icurules;
@@ -1033,12 +1038,14 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (!check_locale(LC_COLLATE, dbcollate, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbcollate)));
+ errmsg("invalid LC_COLLATE locale name: \"%s\"", dbcollate),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbcollate = canonname;
if (!check_locale(LC_CTYPE, dbctype, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbctype)));
+ errmsg("invalid LC_CTYPE locale name: \"%s\"", dbctype),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbctype = canonname;
check_encoding_locale_matches(encoding, dbcollate, dbctype);
@@ -1094,7 +1101,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (langtag && strcmp(dbiculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, dbiculocale)));
dbiculocale = langtag;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ea26bf8361..ccb2414fed 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2157,7 +2157,11 @@ check_locale_name(int category, const char *locale, char **canonname)
if (res == NULL)
{
if (*locale)
- pg_fatal("invalid locale name \"%s\"", locale);
+ {
+ pg_log_error("invalid locale name \"%s\"", locale);
+ pg_log_error_hint("If the locale name is specific to ICU, use --icu-locale.");
+ exit(1);
+ }
else
{
/*
@@ -2467,7 +2471,7 @@ setlocales(void)
{
char *canonname;
- /* set empty lc_* values to locale config if set */
+ /* set empty lc_* and iculocale values to locale config if set */
if (locale_provider == COLLPROVIDER_NONE)
{
@@ -2499,6 +2503,8 @@ setlocales(void)
lc_monetary = locale;
if (!lc_messages)
lc_messages = locale;
+ if (!icu_locale && locale_provider == COLLPROVIDER_ICU)
+ icu_locale = locale;
}
if (icu_locale && locale_provider == COLLPROVIDER_ICU &&
@@ -3392,7 +3398,6 @@ main(int argc, char *argv[])
break;
case 8:
locale = "C";
- locale_provider = COLLPROVIDER_LIBC;
break;
case 9:
pwfilename = pg_strdup(optarg);
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 9caf9190cf..51c4bb3592 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -164,14 +164,6 @@ main(int argc, char *argv[])
exit(1);
}
- if (locale)
- {
- if (!lc_ctype)
- lc_ctype = locale;
- if (!lc_collate)
- lc_collate = locale;
- }
-
if (locale_provider && pg_strcasecmp(locale_provider, "icu") == 0 &&
icu_locale &&
(pg_strcasecmp(icu_locale, "C") == 0 ||
@@ -230,6 +222,11 @@ main(int argc, char *argv[])
appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
if (template)
appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
+ if (locale)
+ {
+ appendPQExpBufferStr(&sql, " LOCALE ");
+ appendStringLiteralConn(&sql, locale, conn);
+ }
if (lc_collate)
{
appendPQExpBufferStr(&sql, " LC_COLLATE ");
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index eb3682f0fd..81a9931c09 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -167,7 +167,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_COLLATE locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
],
'createdb with incorrect --lc-collate');
$node->command_checks_all(
@@ -175,7 +175,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_CTYPE locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
],
'createdb with incorrect --lc-ctype');
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 715b1bffd6..df4af00afe 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -51,16 +51,23 @@ b),
'sort by explicit collation upper first');
-# Test error cases in CREATE DATABASE involving locale-related options
+# Test that LOCALE='C' works for ICU
-my ($ret, $stdout, $stderr) = $node1->psql('postgres',
- q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
-isnt($ret, 0,
- "ICU locale must be specified for ICU provider: exit code not 0");
+my $ret1 = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
+is($ret1, 0,
+ "C locale works for ICU");
+
+# Test that ICU-specific locale string must be specified with ICU_LOCALE,
+# not LOCALE
+
+my ($ret2, $stdout, $stderr) = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu3 LOCALE_PROVIDER icu LOCALE '@colStrength=primary' TEMPLATE template0 ENCODING UTF8});
+isnt($ret2, 0,
+ "ICU-specific locale must be specified with ICU_LOCALE: exit code not 0");
like(
$stderr,
- qr/ERROR: ICU locale must be specified/,
- "ICU locale must be specified for ICU provider: error message");
-
+ qr/ERROR: invalid LC_COLLATE locale name/,
+ "ICU-specific locale must be specified with ICU_LOCALE: error message");
done_testing();
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index c0437231ad..39f61ca281 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1058,11 +1058,11 @@ CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
WARNING: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-- test special variants
CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
-NOTICE: using standard form "und-u-cu-eur" for locale "@EURO"
+NOTICE: using standard form "und-u-cu-eur" for ICU locale "@EURO"
CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
-NOTICE: using standard form "und-u-co-pinyin" for locale "@pinyin"
+NOTICE: using standard form "und-u-co-pinyin" for ICU locale "@pinyin"
CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
-NOTICE: using standard form "und-u-co-stroke" for locale "@stroke"
+NOTICE: using standard form "und-u-co-stroke" for ICU locale "@stroke"
CREATE COLLATION test4 FROM nonsense;
ERROR: collation "nonsense" for encoding "UTF8" does not exist
CREATE COLLATION test5 FROM test0;
@@ -1211,9 +1211,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
(1 row)
CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
-NOTICE: using standard form "und-u-kf-lower" for locale "@colCaseFirst=lower"
+NOTICE: using standard form "und-u-kf-lower" for ICU locale "@colCaseFirst=lower"
CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
-NOTICE: using standard form "und-u-kf-upper" for locale "@colCaseFirst=upper"
+NOTICE: using standard form "und-u-kf-upper" for ICU locale "@colCaseFirst=upper"
SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
?column? | ?column?
----------+----------
@@ -1221,7 +1221,7 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
(1 row)
CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
-NOTICE: using standard form "und-u-ka-shifted" for locale "@colAlternate=shifted"
+NOTICE: using standard form "und-u-ka-shifted" for ICU locale "@colAlternate=shifted"
SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
?column? | ?column?
----------+----------
@@ -1238,12 +1238,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
(1 row)
CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-NOTICE: using standard form "und-u-kn-lower" for locale "@colNumeric=lower"
+NOTICE: using standard form "und-u-kn-lower" for ICU locale "@colNumeric=lower"
ERROR: could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
-- test that attributes not handled by icu_set_collation_attributes()
-- (handled by ucol_open() directly) also work
CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
-NOTICE: using standard form "de-u-co-phonebk" for locale "de@collation=phonebook"
+NOTICE: using standard form "de-u-co-phonebk" for ICU locale "de@collation=phonebook"
SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
?column? | ?column?
----------+----------
@@ -1252,7 +1252,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
-- rules
CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test7 (a text);
-- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1280,13 +1280,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
DROP TABLE test7;
CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
ERROR: could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
-- nondeterministic collations
CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test6 (a int, b text);
-- same string in different normal forms
INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1336,9 +1336,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
(2 rows)
CREATE COLLATION case_sensitive (provider = icu, locale = '');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
-NOTICE: using standard form "und-u-ks-level2" for locale "@colStrength=secondary"
+NOTICE: using standard form "und-u-ks-level2" for ICU locale "@colStrength=secondary"
SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
?column? | ?column?
----------+----------
--
2.34.1
v5-0007-Add-default_collation_provider-GUC.patchtext/x-patch; charset=UTF-8; name=v5-0007-Add-default_collation_provider-GUC.patchDownload
From 3ca8e0a84f6593ffff9a409bd31dc1c9ed253d3a Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 11 May 2023 12:54:31 -0700
Subject: [PATCH v5 7/7] Add default_collation_provider GUC.
Controls default collation provider for CREATE COLLATION. Does not
affect CREATE DATABASE, which gets its default from the template
database.
---
doc/src/sgml/config.sgml | 17 +++++++++++++++++
src/backend/commands/collationcmds.c | 3 ++-
src/backend/utils/misc/guc_tables.c | 18 ++++++++++++++++++
src/backend/utils/misc/postgresql.conf.sample | 4 ++++
src/include/commands/collationcmds.h | 2 ++
src/test/regress/expected/collate.icu.utf8.out | 17 +++++++++++++++++
src/test/regress/sql/collate.icu.utf8.sql | 10 ++++++++++
7 files changed, 70 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4a9dcb9ae..038ecf9811 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9819,6 +9819,23 @@ SET XML OPTION { DOCUMENT | CONTENT };
</listitem>
</varlistentry>
+ <varlistentry id="guc-default-collation-provider" xreflabel="default_collation_provider">
+ <term><varname>default_collation_provider</varname> (<type>enum</type>)
+ <indexterm>
+ <primary><varname>default_collation_provider</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Default collation provider for <command>CREATE
+ COLLATION</command>. Does not affect <command>CREATE
+ DATABASE</command>, which gets the default collation provider from the
+ template database. Valid values are <literal>icu</literal> and
+ <literal>libc</literal>. The default is <literal>libc</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-icu-validation-level" xreflabel="icu_validation_level">
<term><varname>icu_validation_level</varname> (<type>enum</type>)
<indexterm>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 9a83f9f303..b42a660386 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -47,6 +47,7 @@ typedef struct
int enc; /* encoding */
} CollAliasData;
+int default_collation_provider = (int) COLLPROVIDER_LIBC;
/*
* CREATE COLLATION
@@ -228,7 +229,7 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
collproviderstr)));
}
else
- collprovider = COLLPROVIDER_LIBC;
+ collprovider = (char) default_collation_provider;
if (collprovider == COLLPROVIDER_NONE
&& (localeEl || lccollateEl || lcctypeEl))
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8c843f4ab6..d64b3a9a6f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -35,8 +35,10 @@
#include "access/xlogrecovery.h"
#include "archive/archive_module.h"
#include "catalog/namespace.h"
+#include "catalog/pg_collation.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/collationcmds.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
#include "commands/user.h"
@@ -166,6 +168,12 @@ static const struct config_enum_entry intervalstyle_options[] = {
{NULL, 0, false}
};
+static const struct config_enum_entry collation_provider_options[] = {
+ {"icu", (int) 'i', false},
+ {"libc", (int) 'c', false},
+ {NULL, 0, false}
+};
+
static const struct config_enum_entry icu_validation_level_options[] = {
{"disabled", -1, false},
{"debug5", DEBUG5, false},
@@ -4683,6 +4691,16 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"default_collation_provider", PGC_USERSET, CLIENT_CONN_LOCALE,
+ gettext_noop("Default collation provider for CREATE COLLATION."),
+ NULL
+ },
+ &default_collation_provider,
+ (int) COLLPROVIDER_LIBC, collation_provider_options,
+ NULL, NULL, NULL
+ },
+
{
{"icu_validation_level", PGC_USERSET, CLIENT_CONN_LOCALE,
gettext_noop("Log level for reporting invalid ICU locale strings."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 87bad8ecbf..b2b015b31f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -734,6 +734,10 @@
#lc_numeric = 'C' # locale for number formatting
#lc_time = 'C' # locale for time formatting
+#default_collation_provider = 'libc' # default collation provider
+ # for CREATE COLLATION
+ # (none, icu, libc)
+
#icu_validation_level = WARNING # report ICU locale validation
# errors at the given level
diff --git a/src/include/commands/collationcmds.h b/src/include/commands/collationcmds.h
index b76c7b3dc3..f54389525d 100644
--- a/src/include/commands/collationcmds.h
+++ b/src/include/commands/collationcmds.h
@@ -18,6 +18,8 @@
#include "catalog/objectaddress.h"
#include "parser/parse_node.h"
+extern PGDLLIMPORT int default_collation_provider;
+
extern ObjectAddress DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_exists);
extern void IsThereCollationInNamespace(const char *collname, Oid nspOid);
extern ObjectAddress AlterCollation(AlterCollationStmt *stmt);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 39f61ca281..d9da8d1310 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1038,6 +1038,23 @@ RESET client_min_messages;
-- uses "none" provider instead
CREATE COLLATION testc (provider = icu, locale='C');
NOTICE: using locale provider "none" for ICU locale "C"
+SET default_collation_provider = 'libc';
+CREATE COLLATION def_libc (LOCALE = 'C');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_libc';
+ collname | collprovider
+----------+--------------
+ def_libc | c
+(1 row)
+
+SET default_collation_provider = 'icu';
+CREATE COLLATION def_icu (LOCALE = 'und');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_icu';
+ collname | collprovider
+----------+--------------
+ def_icu | i
+(1 row)
+
+RESET default_collation_provider;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
ERROR: parameter "locale" must be specified
SET icu_validation_level = ERROR;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 63c29dfe2a..13089c7f8e 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -378,6 +378,16 @@ RESET client_min_messages;
-- uses "none" provider instead
CREATE COLLATION testc (provider = icu, locale='C');
+SET default_collation_provider = 'libc';
+CREATE COLLATION def_libc (LOCALE = 'C');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_libc';
+
+SET default_collation_provider = 'icu';
+CREATE COLLATION def_icu (LOCALE = 'und');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_icu';
+
+RESET default_collation_provider;
+
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
SET icu_validation_level = ERROR;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
--
2.34.1
Hello Jeff,
09.05.2023 00:59, Jeff Davis wrote:
The easiest thing to do is revert it for now, and after we sort out the
memcmp() path for the ICU provider, then I can commit it again (after
that point it would just be code cleanup and should have no functional
impact).
On the current master (after 455f948b0, and before f7faa9976, of course)
I get an ASAN-detected failure with the following query:
CREATE COLLATION col (provider = icu, locale = '123456789012');
==2929883==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc491be09c at pc 0x556e8571a260 bp 0x7
ffc491be020 sp 0x7ffc491bd7c8
READ of size 15 at 0x7ffc491be09c thread T0
#0 0x556e8571a25f in __interceptor_strcmp.part.0 (.../usr/local/pgsql/bin/postgres+0x2aa025f)
#1 0x556e86d77ee6 in icu_language_tag .../src/backend/utils/adt/pg_locale.c:2802
...
Address 0x7ffc491be09c is located in stack of thread T0 at offset 76 in frame
#0 0x556e86d77cfe in icu_language_tag .../src/backend/utils/adt/pg_locale.c:2782
This frame has 2 object(s):
[48, 52) 'status' (line 2784)
[64, 76) 'lang' (line 2785) <== Memory access at offset 76 overflows this variable
...
Here, uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status) returns
status = -124, i.e.,
U_STRING_NOT_TERMINATED_WARNING = -124,/**< An output string could not be NUL-terminated because output
length==destCapacity. */
(ULOC_LANG_CAPACITY = 12)
this value is not covered by U_FAILURE(status), and strcmp(), that follows,
goes out of the lang variable bounds.
Best regards,
Alexander
On 11.05.23 23:29, Jeff Davis wrote:
New patch series attached.
=== 0001: fix bug that allows creating hidden collations
Bug:
/messages/by-id/051c9395cf880307865ee8b17acdbf7f838c1e39.camel@j-davis.com
This is still being debated in the other thread. Not really related to
this thread, so I suggest dropping it from this patch series.
=== 0002: handle some kinds of libc-stlye locale strings
ICU used to handle libc locale strings like 'fr_FR@euro', but doesn't
in later versions. Handle them in postgres for consistency.
I tend to agree with ICU that these variants are obsolete, and we don't
need to support them anymore. If this were a tiny patch, then maybe ok,
but the way it's presented here the whole code is duplicated between
pg_locale.c and initdb.c, which is not great.
=== 0003: reduce icu_validation_level to WARNING
Given that we've seen some inconsistency in which locale names are
accepted in different ICU versions, it seems best not to be too strict.
Peter Eisentraut suggested that it be set to ERROR originally, but a
WARNING should be sufficient to see problems without introducing risks
migrating to version 16.
I'm not sure why this is the conclusion. Presumably, the detection
capabilities of ICU improve over time, so we want to take advantage of
that? What are some example scenarios where this change would help?
=== 0004-0006:
To solve the issues that have come up in this thread, we need CREATE
DATABASE (and createdb and initdb) to use LOCALE to mean the collation
locale regardless of which provider is in use (which is what 0006
does).0006 depends on ICU handling libc locale names. It already does a good
job for most libc locale names (though patch 0002 fixes a few cases
where it doesn't). There may be more cases, but for the most part libc
names are interpreted in a reasonable way. But one important case is
missing: ICU does not handle the "C" locale as we expect (that is,
using memcmp()).We've already allowed users to create ICU collations with the C locale
in the past, which uses the root collation (not memcmp()), and we need
to keep supporting that for upgraded clusters.
I'm not sure I agree that we need to keep supporting that. The only way
you could get that in past releases is if you specify explicitly, "give
me provider ICU and locale C", and then it wouldn't actually even work
correctly. So nobody should be using that in practice, and nobody
should have stumbled into that combination of settings by accident.
3. Introduce collation provider "none", which is always memcmp-based
(patch 0004). It's equivalent to the libc locale=C, but it allows
specifying the LC_COLLATE and LC_CTYPE independently. A command like
CREATE DATABASE ... LOCALE_PROVIDER='icu' ICU_LOCALE='C'
LC_COLLATE='en_US' would get changed (with a NOTICE) to provider "none"
(patch 0005), so you'd have datlocprovider=none, datcollate=en_US. For
the database default collation, that would always use memcmp(), but the
server environment LC_COLLATE would be set to en_US as the user
specified.
This seems most attractive, but I think it's quite invasive at this
point, especially given the dubious premise (see above).
=== 0007: Add a GUC to control the default collation provider
Having a GUC would make it easier to migrate to ICU without surprises.
This only affects the default for CREATE COLLATION, not CREATE DATABASE
(and obviously not initdb).
It's not clear to me why we would want that. Also not clear why it
should only affect CREATE COLLATION.
On Sat, 2023-05-13 at 13:00 +0300, Alexander Lakhin wrote:
On the current master (after 455f948b0, and before f7faa9976, of
course)
I get an ASAN-detected failure with the following query:
CREATE COLLATION col (provider = icu, locale = '123456789012');
Thank you for the report!
ICU source specifically says:
/**
* Useful constant for the maximum size of the language
part of a locale ID.
* (including the terminating NULL).
* @stable ICU 2.0
*/
#define ULOC_LANG_CAPACITY 12
So the fact that it returning success without nul-terminating the
result is an ICU bug in my opinion, and I reported it here:
https://unicode-org.atlassian.net/browse/ICU-22394
Unfortunately that means we need to be a bit more paranoid and always
check for that warning, even if we have a preallocated buffer of the
"correct" size. It also means that both U_STRING_NOT_TERMINATED_WARNING
and U_BUFFER_OVERFLOW_ERROR will be user-facing errors (potentially
scary), unless we check for those errors each time and report specific
errors for them.
Another approach is to always check the length and loop using dynamic
allocation so that we never overflow (and forget about any static
buffers). That seems like overkill given that the problem case is
obviously invalid input; I think as long as we catch it and throw an
ERROR it's fine. But I can do this if others think it's worthwhile.
Patch attached. It just checks for the U_STRING_NOT_TERMINATED_WARNING
in a few places and turns it into an ERROR. It also cleans up the loop
for uloc_getLanguageTag() to check for U_STRING_NOT_TERMINATED_WARNING
rather than (U_SUCCESS(status) && len >= buflen).
--
Jeff Davis
PostgreSQL Contributor Team - AWS
Attachments:
0001-ICU-check-for-U_STRING_NOT_TERMINATED_WARNING.patchtext/x-patch; charset=UTF-8; name=0001-ICU-check-for-U_STRING_NOT_TERMINATED_WARNING.patchDownload
From 9c8e9272ca807c9f75a7b32fa3190700cc600260 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 15 May 2023 13:35:07 -0700
Subject: [PATCH] ICU: check for U_STRING_NOT_TERMINATED_WARNING.
In some cases, ICU can fail to NUL-terminate a result string even if
using an appropriately-sized static buffer. The caller must either
check for len >= buflen or U_STRING_NOT_TERMINATED_WARNING.
The specific problem is related to uloc_getLanguage(), but add the
check in other cases as well.
Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/2098874d-c111-41e4-9063-30bcf135226b@gmail.com
---
src/backend/utils/adt/pg_locale.c | 29 +++++++++++------------------
src/bin/initdb/initdb.c | 15 ++++-----------
2 files changed, 15 insertions(+), 29 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index f0b6567da1..1cf93b2d20 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2468,7 +2468,7 @@ pg_ucol_open(const char *loc_str)
status = U_ZERO_ERROR;
uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
ereport(ERROR,
(errmsg("could not get language from locale \"%s\": %s",
@@ -2504,7 +2504,7 @@ pg_ucol_open(const char *loc_str)
* Pretend the error came from ucol_open(), for consistent error
* message across ICU versions.
*/
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
ucol_close(collator);
ereport(ERROR,
@@ -2639,7 +2639,8 @@ icu_from_uchar(char **result, const UChar *buff_uchar, int32_t len_uchar)
status = U_ZERO_ERROR;
len_result = ucnv_fromUChars(icu_converter, *result, len_result + 1,
buff_uchar, len_uchar, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) ||
+ status == U_STRING_NOT_TERMINATED_WARNING)
ereport(ERROR,
(errmsg("%s failed: %s", "ucnv_fromUChars",
u_errorName(status))));
@@ -2681,7 +2682,7 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
icu_locale_id = palloc(len + 1);
*status = U_ZERO_ERROR;
len = uloc_canonicalize(loc, icu_locale_id, len + 1, status);
- if (U_FAILURE(*status))
+ if (U_FAILURE(*status) || *status == U_STRING_NOT_TERMINATED_WARNING)
return;
lower_str = asc_tolower(icu_locale_id, strlen(icu_locale_id));
@@ -2765,7 +2766,6 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
pfree(lower_str);
}
-
#endif
/*
@@ -2789,7 +2789,7 @@ icu_language_tag(const char *loc_str, int elevel)
status = U_ZERO_ERROR;
uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
if (elevel > 0)
ereport(elevel,
@@ -2811,19 +2811,12 @@ icu_language_tag(const char *loc_str, int elevel)
langtag = palloc(buflen);
while (true)
{
- int32_t len;
-
status = U_ZERO_ERROR;
- len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+ uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
- /*
- * If the result fits in the buffer exactly (len == buflen),
- * uloc_toLanguageTag() will return success without nul-terminating
- * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
- * buflen and try again.
- */
+ /* try again if the buffer is not large enough */
if ((status == U_BUFFER_OVERFLOW_ERROR ||
- (U_SUCCESS(status) && len >= buflen)) &&
+ status == U_STRING_NOT_TERMINATED_WARNING) &&
buflen < MaxAllocSize)
{
buflen = Min(buflen * 2, MaxAllocSize);
@@ -2878,7 +2871,7 @@ icu_validate_locale(const char *loc_str)
/* validate that we can extract the language */
status = U_ZERO_ERROR;
uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
ereport(elevel,
(errmsg("could not get language from ICU locale \"%s\": %s",
@@ -2901,7 +2894,7 @@ icu_validate_locale(const char *loc_str)
status = U_ZERO_ERROR;
uloc_getLanguage(otherloc, otherlang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
continue;
if (strcmp(lang, otherlang) == 0)
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index e03d498b1e..30b576932f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2252,7 +2252,7 @@ icu_language_tag(const char *loc_str)
status = U_ZERO_ERROR;
uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status);
- if (U_FAILURE(status))
+ if (U_FAILURE(status) || status == U_STRING_NOT_TERMINATED_WARNING)
{
pg_fatal("could not get language from locale \"%s\": %s",
loc_str, u_errorName(status));
@@ -2272,19 +2272,12 @@ icu_language_tag(const char *loc_str)
langtag = pg_malloc(buflen);
while (true)
{
- int32_t len;
-
status = U_ZERO_ERROR;
- len = uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+ uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
- /*
- * If the result fits in the buffer exactly (len == buflen),
- * uloc_toLanguageTag() will return success without nul-terminating
- * the result. Check for either U_BUFFER_OVERFLOW_ERROR or len >=
- * buflen and try again.
- */
+ /* try again if the buffer is not large enough */
if (status == U_BUFFER_OVERFLOW_ERROR ||
- (U_SUCCESS(status) && len >= buflen))
+ status == U_STRING_NOT_TERMINATED_WARNING)
{
buflen = buflen * 2;
langtag = pg_realloc(langtag, buflen);
--
2.34.1
On Mon, 2023-05-08 at 14:59 -0700, Jeff Davis wrote:
The easiest thing to do is revert it for now, and after we sort out
the
memcmp() path for the ICU provider, then I can commit it again (after
that point it would just be code cleanup and should have no
functional
impact).
The conversion won't be entirely dead code even after we handle the "C"
locale with memcmp(): for a locale like "C.UTF-8", it will still be
passed to the collation provider (same as with libc), and in that case,
we should still convert that to a language tag consistently across ICU
versions.
For it to be entirely dead code, we would need to convert any locale
with language "C" (e.g. "C.UTF-8") to use the memcmp() path. I'm fine
with that, but that's not what the libc provider does today, and
perhaps we should be consistent between the two. If we do leave the
code in place, we can document that specific "en-US-u-va-posix" locale
so that it's not too surprising for users.
Regards,
Jeff Davis
Hi Jeff,
16.05.2023 00:03, Jeff Davis wrote:
On Sat, 2023-05-13 at 13:00 +0300, Alexander Lakhin wrote:
On the current master (after 455f948b0, and before f7faa9976, of
course)
I get an ASAN-detected failure with the following query:
CREATE COLLATION col (provider = icu, locale = '123456789012');Thank you for the report!
ICU source specifically says:
/**
* Useful constant for the maximum size of the language
part of a locale ID.
* (including the terminating NULL).
* @stable ICU 2.0
*/
#define ULOC_LANG_CAPACITY 12So the fact that it returning success without nul-terminating the
result is an ICU bug in my opinion, and I reported it here:https://unicode-org.atlassian.net/browse/ICU-22394
Unfortunately that means we need to be a bit more paranoid and always
check for that warning, even if we have a preallocated buffer of the
"correct" size. It also means that both U_STRING_NOT_TERMINATED_WARNING
and U_BUFFER_OVERFLOW_ERROR will be user-facing errors (potentially
scary), unless we check for those errors each time and report specific
errors for them.Another approach is to always check the length and loop using dynamic
allocation so that we never overflow (and forget about any static
buffers). That seems like overkill given that the problem case is
obviously invalid input; I think as long as we catch it and throw an
ERROR it's fine. But I can do this if others think it's worthwhile.Patch attached. It just checks for the U_STRING_NOT_TERMINATED_WARNING
in a few places and turns it into an ERROR. It also cleans up the loop
for uloc_getLanguageTag() to check for U_STRING_NOT_TERMINATED_WARNING
rather than (U_SUCCESS(status) && len >= buflen).
I'm not sure about the proposed change in icu_from_uchar(). It seems that
len_result + 1 bytes should always be enough for the result string terminated
with NUL. If that's not true (we want to protect from some ICU bug here),
then the change should be backpatched?
Best regards,
Alexander
On Tue, 2023-05-16 at 19:00 +0300, Alexander Lakhin wrote:
I'm not sure about the proposed change in icu_from_uchar(). It seems
that
len_result + 1 bytes should always be enough for the result string
terminated
with NUL. If that's not true (we want to protect from some ICU bug
here),
then the change should be backpatched?
I believe it's enough and I'm not aware of any bug there so no backport
is required.
I added checks in places that were (a) checking for U_FAILURE; and (b)
expecting the result to be NUL-terminated. That's mostly callers of
uloc_getLanguage(), where I was not quite paranoid enough.
There were a couple other places though, and I went ahead and added
checks there out of paranoia, too. One was ucnv_fromUChars(), and the
other was uloc_canonicalize().
Regards,
Jeff Davis
On 5/5/23 8:25 PM, Jeff Davis wrote:
On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:
On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql@j-davis.com> wrote:
Most of the complaints seem to be complaints about v15 as well, and
while those complaints may be a reason to not make ICU the default,
they are also an argument that we should continue to learn and try
to
fix those issues because they exist in an already-released version.
Leaving it the default for now will help us fix those issues rather
than hide them.It's still early, so we have plenty of time to revert the initdb
default if we need to.That's fair enough, but I really think it's important that some
energy
get invested in providing adequate documentation for this stuff. Just
patching the code is not enough.Attached a significant documentation patch.
I tried to make it comprehensive without trying to be exhaustive, and I
separated the explanation of language tags from what collation settings
you can include in a language tag, so hopefully that's more clear.I added quite a few examples spread throughout the various sections,
and I preserved the existing examples at the end. I also left all of
the external links at the bottom for those interested enough to go
beyond what's there.
[Personal hat, not RMT]
Thanks -- this is super helpful. A bunch of these examples I had
previously had to figure out by randomly searching blog posts /
trial-and-error, so I think this will help developers get started more
quickly.
Comments (and a lot are just little nits to tighten the language)
Commit message -- typo: "documentaiton"
+ If you see such a message, ensure that the
<symbol>PROVIDER</symbol> and
+ <symbol>LOCALE</symbol> are as you expect, and consider specifying
+ directly as the canonical language tag instead of relying on the
+ transformation.
+ </para>
I'd recommend make this more prescriptive:
"If you see this notice, ensure that the <symbol>PROVIDER</symbol> and
<symbol>LOCALE</symbol> are the expected result. For consistent results
when using the ICU provider, specify the canonical <link
linkend="icu-language-tag">language tag</link> instead of relying on the
transformation."
+ If there is some problem interpreting the locale name, or if it
represents
+ a language or region that ICU does not recognize, a message will
be reported:
This is passive voice, consider:
"If there is a problem interpreting the locale name, or if the locale
name represents a language or region that ICU does not recognize, you'll
see the following error:"
+ <sect3 id="icu-language-tag">
+ <title>Language Tag</title>
+ <para>
Before jumping in, I'd recommend a quick definition of what a language
tag is, e.g.:
"A language tag, defined in BCP 47, is a standardized identifier used to
identify languages in computer systems" or something similar.
(I did find a database that made it simpler to search for these, which
is one issue I've previously add, but I don't think we'd want to link to i)
+ To include this additional collation information in a language tag,
+ append <literal>-u</literal>, followed by one or more
My first question was "what's special about '-u'", so maybe we say:
"To include this additional collation information in a language tag,
append <literal>-u</literal>, which indicates there are additional
collation settings, followed by one or more..."
+ ICU locales are specified as a <link
linkend="icu-language-tag">Language
+ Tag</link>, but can also accept most libc-style locale names
(which will
+ be transformed into language tags if possible).
+ </para>
I'd recommend removing the parantheticals:
ICU locales are specified as a BCP 47 <link
linkend="icu-language-tag">Language
Tag</link>, but can also accept most libc-style locale names. If
possible, libc-style locale names are transformed into language tags.
+ <title>ICU Collation Levels</title>
Nothing to add here other than to say I'm extremely appreciative of this
section. Once upon a time I sunk a lot of time trying to figure out how
all of these levels worked.
+ Sensitivity when determining equality, with
+ <literal>level1</literal> the least sensitive and
+ <literal>identic</literal> the most sensitive. See <xref
+ linkend="icu-collation-levels"/> for details.
This discusses equality sensitivity, but I'm not sure if I understand
that term here. The ICU docs seem to call these "strengths"[1]https://unicode-org.github.io/icu/userguide/collation/concepts.html, maybe we
use that term to be consistent with upstream?
+ If set to <literal>upper</literal>, upper case sorts before lower
+ case. If set to <literal>lower</literal>, lower case sorts before
+ upper case. If set to <literal>false</literal>, it depends on the
+ locale.
Suggestion to tighten this up:
"If set to <literal>false</literal>, the sort depends on the rules of
the locale."
+ Defaults may depend on locale. The above table is not meant to be
+ complete. See <xref linkend="icu-external-references"/> for additinal
+ options and details.
Typo: additinal => "additional"
I didn't add additional documentation for ICU rules. There are so many
options for collations that it's hard for me to think of realistic
examples to specify the rules directly, unless someone wants to invent
a new language. Perhaps useful if working with an interesting text file
format with special treatment for delimiters?I asked the question about rules here:
/messages/by-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel@cybertec.at
and got some limited response about addressing sort complaints. That
sounds reasonable, but a lot of that can also be handled just by
specifying the right collation settings. Someone who understands the
use case better could add some more documentation.
I'm not too sure about this one -- from my experience, users want
predictability in sorts, but there are a variety of ways to get that
experience.
Thanks,
Jonathan
[1]: https://unicode-org.github.io/icu/userguide/collation/concepts.html
On Tue, 2023-05-16 at 15:35 -0400, Jonathan S. Katz wrote:
+ Sensitivity when determining equality, with + <literal>level1</literal> the least sensitive and + <literal>identic</literal> the most sensitive. See <xref + linkend="icu-collation-levels"/> for details.This discusses equality sensitivity, but I'm not sure if I understand
that term here. The ICU docs seem to call these "strengths"[1], maybe
we
use that term to be consistent with upstream?
"Sensitivity" comes from "case sensitivity" which is more clear to me
than "strength". I added the term "strength" to correspond to the
unicode terminology, but I kept sensitivity and I tried to make it
slightly more clear.
Other than that, and I took your suggestions almost verbatim. Patch
attached. Thank you!
I also made a few other changes:
* added paragraph transformation of '' or 'root' to the 'und'
language (root collation)
* added paragraph that the "identic" level still performs some basic
normalization
* added example for when full normalization matters
I should also say that I don't really understand the case when "kc" is
set to true and "ks" is level 2 or higher. If someone has an example of
where that matters, let me know.
Regards,
Jeff Davis
Attachments:
v2-0001-Doc-improvements-for-language-tags-and-custom-ICU.patchtext/x-patch; charset=UTF-8; name=v2-0001-Doc-improvements-for-language-tags-and-custom-ICU.patchDownload
From 8633ec205b0b0297910cef8f931092d0c05eb3ce Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 27 Apr 2023 14:43:46 -0700
Subject: [PATCH v2] Doc improvements for language tags and custom ICU
collations.
Separate the documentation for language tags themselves from the
available collation settings which can be included in a language tag.
Include tables of the available options, more details about the
effects of each option, and additional examples.
Also include an explanation of the "levels" of textual features and
how they relate to collation.
Discussion: https://postgr.es/m/25787ec7-4c04-9a8a-d241-4dc9be0b1ba3@postgresql.org
Reviewed-by: Jonathan S. Katz
---
doc/src/sgml/charset.sgml | 680 +++++++++++++++++++++++++++++++-------
1 file changed, 559 insertions(+), 121 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 6dd95b8966..ea43732ec9 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en
variants and customization options.
</para>
</sect2>
+ <sect2 id="icu-locales">
+ <title>ICU Locales</title>
+ <sect3 id="icu-locale-names">
+ <title>ICU Locale Names</title>
+ <para>
+ The ICU format for the locale name is a <link
+ linkend="icu-language-tag">Language Tag</link>.
+
+<programlisting>
+CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);
+CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr');
+</programlisting>
+ </para>
+ </sect3>
+ <sect3 id="icu-canonicalization">
+ <title>Locale Canonicalization and Validation</title>
+ <para>
+ When defining a new ICU collation object or database with ICU as the
+ provider, the given locale name is transformed ("canonicalized") into a
+ language tag if not already in that form. For instance,
+
+<screen>
+CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true');
+NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true"
+CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8');
+NOTICE: using standard form "de-DE" for locale "de_DE.utf8"
+</screen>
+
+ If you see this notice, ensure that the <symbol>PROVIDER</symbol> and
+ <symbol>LOCALE</symbol> are the expected result. For consistent results
+ when using the ICU provider, specify the canonical <link
+ linkend="icu-language-tag">language tag</link> instead of relying on the
+ transformation.
+ </para>
+ <para>
+ A locale with no language name, or the special language name
+ <literal>root</literal>, is transformed to have the language
+ <literal>und</literal> ("undefined").
+ </para>
+ <para>
+ ICU can transform most libc locale names, as well as some other formats,
+ into language tags for easier transition to ICU. If a libc locale name is
+ used in ICU, it may not have precisely the same behavior as in libc.
+ </para>
+ <para>
+ If there is a problem interpreting the locale name, or if the locale name
+ represents a language or region that ICU does not recognize, you will see
+ the following error:
+
+<screen>
+CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense');
+ERROR: ICU locale "nonsense" has unknown language "nonsense"
+HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+</screen>
+
+ <xref
+ linkend="guc-icu-validation-level"/> controls how the message is
+ reported. If set below <literal>ERROR</literal>, the collation will still
+ be created, but the behavior may not be what the user intended.
+ </para>
+ </sect3>
+ <sect3 id="icu-language-tag">
+ <title>Language Tag</title>
+ <para>
+ A language tag, defined in BCP 47, is a standardized identifier used to
+ identify languages, regions, and other information about a locale.
+ </para>
+ <para>
+ Basic language tags are simply
+ <replaceable>language</replaceable><literal>-</literal><replaceable>region</replaceable>;
+ or even just <replaceable>language</replaceable>. The
+ <replaceable>language</replaceable> is a language code
+ (e.g. <literal>fr</literal> for French), and
+ <replaceable>region</replaceable> is a region code
+ (e.g. <literal>CA</literal> for Canada). Examples:
+ <literal>ja-JP</literal>, <literal>de</literal>, or
+ <literal>fr-CA</literal>.
+ </para>
+ <para>
+ Collation settings may be included in the language tag to customize
+ collation behavior. ICU allows extensive customization, such as
+ sensitivity (or insensitivity) to accents, case, and punctuation;
+ treatment of digits within text; and many other options to satisfy a
+ variety of uses.
+ </para>
+ <para>
+ To include this additional collation information in a language tag,
+ append <literal>-u</literal>, which indicates there are additional
+ collation settings, followed by one or more
+ <literal>-</literal><replaceable>key</replaceable><literal>-</literal><replaceable>value</replaceable>
+ pairs. The <replaceable>key</replaceable> is the key for a <link
+ linkend="icu-collation-settings">collation setting</link> and
+ <replaceable>value</replaceable> is a valid value for that setting. For
+ boolean settings, the <literal>-</literal><replaceable>key</replaceable>
+ may be specified without a corresponding
+ <literal>-</literal><replaceable>value</replaceable>, which implies a
+ value of <literal>true</literal>.
+ </para>
+ <para>
+ For example, the language tag <literal>en-US-u-kn-ks-level2</literal>
+ means the locale with the English language in the US region, with
+ collation settings <literal>kn</literal> set to <literal>true</literal>
+ and <literal>ks</literal> set to <literal>level2</literal>. Those
+ settings mean the collation will be case-insensitive and treat a sequence
+ of digits as a single number:
+<screen>
+CREATE COLLATION mycollation5 (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'en-US-u-kn-ks-level2');
+SELECT 'aB' = 'Ab' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+
+SELECT 'N-45' < 'N-123' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+</screen>
+ </para>
+ <para>
+ See <xref linkend="icu-custom-collations"/> for details and additional
+ examples of using language tags with custom collation information for the
+ locale.
+ </para>
+ </sect3>
+ </sect2>
<sect2 id="locale-problems">
<title>Problems</title>
@@ -658,6 +785,13 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
code byte values.
</para>
+ <note>
+ <para>
+ The <literal>C</literal> and <literal>POSIX</literal> locales may behave
+ differently depending on the database encoding.
+ </para>
+ </note>
+
<para>
Additionally, two SQL standard collation names are available:
@@ -869,132 +1003,24 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
<sect4 id="collation-managing-create-icu">
<title>ICU Collations</title>
- <para>
- ICU allows collations to be customized beyond the basic language+country
- set that is preloaded by <command>initdb</command>. Users are encouraged
- to define their own collation objects that make use of these facilities to
- suit the sorting behavior to their requirements.
- See <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
- and <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink> for
- information on ICU locale naming. The set of acceptable names and
- attributes depends on the particular ICU version.
- </para>
-
- <para>
- Here are some examples:
-
- <variablelist>
- <varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
- <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
- <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');</literal></term>
- <listitem>
- <para>German collation with phone book collation type</para>
- <para>
- The first example selects the ICU locale using a <quote>language
- tag</quote> per BCP 47. The second example uses the traditional
- ICU-specific locale syntax. The first style is preferred going
- forward, and is used internally to store locales.
- </para>
- <para>
- Note that you can name the collation objects in the SQL environment
- anything you want. In this example, we follow the naming style that
- the predefined collations use, which in turn also follow BCP 47, but
- that is not required for user-defined collations.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
- <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
- <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');</literal></term>
- <listitem>
- <para>
- Root collation with Emoji collation type, per Unicode Technical Standard #51
- </para>
- <para>
- Observe how in the traditional ICU locale naming system, the root
- locale is selected by an empty string.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
- <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
- <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn');</literal></term>
- <listitem>
- <para>
- Sort Greek letters before Latin ones. (The default is Latin before Greek.)
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kf-upper">
- <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
- <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');</literal></term>
- <listitem>
- <para>
- Sort upper-case letters before lower-case letters. (The default is
- lower-case letters first.)
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
- <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
- <term><literal>CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn');</literal></term>
- <listitem>
- <para>
- Combines both of the above options.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kn-true">
- <term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');</literal></term>
- <term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');</literal></term>
- <listitem>
- <para>
- Numeric ordering, sorts sequences of digits by their numeric value,
- for example: <literal>A-21</literal> < <literal>A-123</literal>
- (also known as natural sort).
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
-
- See <ulink url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
- Technical Standard #35</ulink>
- and <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink> for
- details. The list of possible collation types (<literal>co</literal>
- subtag) can be found in
- the <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
- repository</ulink>.
- </para>
+ <para>
+ ICU collations can be created like:
- <para>
- Note that while this system allows creating collations that <quote>ignore
- case</quote> or <quote>ignore accents</quote> or similar (using the
- <literal>ks</literal> key), in order for such collations to act in a
- truly case- or accent-insensitive manner, they also need to be declared as not
- <firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>;
- see <xref linkend="collation-nondeterministic"/>.
- Otherwise, any strings that compare equal according to the collation but
- are not byte-wise equal will be sorted according to their byte values.
- </para>
+<programlisting>
+CREATE COLLATION german (provider = icu, locale = 'de-DE');
+</programlisting>
- <note>
+ ICU locales are specified as a BCP 47 <link
+ linkend="icu-language-tag">Language Tag</link>, but can also accept most
+ libc-style locale names. If possible, libc-style locale names are
+ transformed into language tags.
+ </para>
<para>
- By design, ICU will accept almost any string as a locale name and match
- it to the closest locale it can provide, using the fallback procedure
- described in its documentation. Thus, there will be no direct feedback
- if a collation specification is composed using features that the given
- ICU installation does not actually support. It is therefore recommended
- to create application-level test cases to check that the collation
- definitions satisfy one's requirements.
+ New ICU collations can customize collation behavior extensively by
+ including collation attributes in the langugage tag. See <xref
+ linkend="icu-custom-collations"/> for details and examples.
</para>
- </note>
</sect4>
-
<sect4 id="collation-copy">
<title>Copying Collations</title>
@@ -1072,6 +1098,418 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
</tip>
</sect3>
</sect2>
+ <sect2 id="icu-custom-collations">
+ <title>ICU Custom Collations</title>
+
+ <para>
+ ICU allows extensive control over collation behavior by defining new
+ collations with collation settings as a part of the language tag. These
+ settings can modify the collation order to suit a variety of needs. For
+ instance:
+
+<programlisting>
+-- ignore differences in accents and case
+CREATE COLLATION ignore_accent_case (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ks-level1');
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
+
+-- upper case letters sort before lower case.
+CREATE COLLATION upper_first (PROVIDER=icu, LOCALE = 'und-u-kf-upper');
+SELECT 'B' < 'b' COLLATE upper_first; -- true
+
+-- treat digits numerically and ignore punctuation
+CREATE COLLATION num_ignore_punct (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ka-shifted-kn');
+SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true
+SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
+</programlisting>
+
+ Many of the available options are described in <xref
+ linkend="icu-collation-settings"/>, or see <xref
+ linkend="icu-external-references"/> for more details.
+ </para>
+ <sect3 id="icu-collation-comparison-levels">
+ <title>ICU Comparison Levels</title>
+ <para>
+ Comparison of two strings (collation) in ICU is determined by a
+ multi-level process, where textual features are grouped into
+ "levels". Treatment of each level is controlled by the <link
+ linkend="icu-collation-settings-table">collation settings</link>. Higher
+ levels correspond to finer textual features.
+ </para>
+ <para>
+ <table id="icu-collation-levels">
+ <title>ICU Collation Levels</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Level</entry>
+ <entry>Description</entry>
+ <entry><literal>'f' = 'f'</literal></entry>
+ <entry><literal>'ab' = U&'a\2063b'</literal></entry>
+ <entry><literal>'x-y' = 'x_y'</literal></entry>
+ <entry><literal>'g' = 'G'</literal></entry>
+ <entry><literal>'n' = 'ñ'</literal></entry>
+ <entry><literal>'y' = 'z'</literal></entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>level1</entry>
+ <entry>Base Character</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>level2</entry>
+ <entry>Accents</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>level3</entry>
+ <entry>Case/Variants</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>level4</entry>
+ <entry>Punctuation</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>identic</entry>
+ <entry>All</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ The above table shows which textual feature differences are
+ considered significant when determining equality at the given level. The
+ unicode character <literal>U+2063</literal> is an invisible separator,
+ and as seen in the table, is ignored for at all levels of comparison less
+ than <literal>identic</literal>.
+ </para>
+ <para>
+ At every level, even with full normalization off, basic normalization is
+ performed. For example, <literal>'á'</literal> may be composed of the
+ code points <literal>U&'\0061\0301'</literal> or the single code
+ point <literal>U&'\00E1'</literal>, and those sequences will be
+ considered equal even at the <literal>identic</literal> level. To treat
+ any difference in code point representation as distinct, use a collation
+ created with <symbol>DETERMINISTIC</symbol> set to
+ <literal>false</literal>.
+ </para>
+ <para>
+ Examples:
+
+<programlisting>
+CREATE COLLATION level3 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level3');
+CREATE COLLATION level4 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level4');
+CREATE COLLATION identic (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-identic');
+
+-- invisible separator ignored at all levels except identic
+SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true
+SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false
+
+-- punctuation ignored at level3 but not at level 4
+SELECT 'x-y' = 'x_y' COLLATE level3; -- true
+SELECT 'x-y' = 'x_y' COLLATE level4; -- false
+</programlisting>
+
+ </para>
+ <note>
+ <para>
+ For many collation settings, you must create the collation with
+ <option>DETERMINISTIC</option> set to <literal>false</literal> for the
+ setting to have the desired effect. Additionally, some settings only
+ take effect when the key <literal>ka</literal> is set to
+ <literal>shifted</literal> (see <xref
+ linkend="icu-collation-settings-table"/>).
+ </para>
+ </note>
+ </sect3>
+ <sect3 id="icu-collation-settings">
+ <title>Collation Settings for an ICU Locale</title>
+ <para>
+ <table id="icu-collation-settings-table">
+ <title>ICU Collation Settings</title>
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Key</entry>
+ <entry>Values</entry>
+ <entry>Default</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>co</literal></entry>
+ <entry><literal>emoji</literal>, <literal>phonebk</literal>, <literal>standard</literal>, <replaceable>...</replaceable></entry>
+ <entry><literal>standard</literal></entry>
+ <entry>
+ Collation type. See <xref linkend="icu-external-references"/> for additional options and details.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>ks</literal></entry>
+ <entry><literal>level1</literal>, <literal>level2</literal>, <literal>level3</literal>, <literal>level4</literal>, <literal>identic</literal></entry>
+ <entry><literal>level3</literal></entry>
+ <entry>
+ Sensitivity (or "strength") when determining equality, with
+ <literal>level1</literal> the least sensitive to differences and
+ <literal>identic</literal> the most sensitive to differences. See
+ <xref linkend="icu-collation-levels"/> for details.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>ka</literal></entry>
+ <entry><literal>noignore</literal>, <literal>shifted</literal></entry>
+ <entry><literal>noignore</literal></entry>
+ <entry>
+ If set to <literal>shifted</literal>, causes some characters
+ (e.g. punctuation or space) to be ignored in comparison. Key
+ <literal>ks</literal> must be set to <literal>level3</literal> or
+ lower to take effect. Set key <literal>kv</literal> to control which
+ character classes are ignored.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kb</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ Backwards comparison for the level 2 differences. For example,
+ locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
+ before <literal>'aé'</literal>.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kk</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ <para>
+ Enable full normalization; may affect performance. Basic
+ normalization is performed even when set to
+ <literal>false</literal>. Locales for languages that require full
+ normalization typically enable it by default.
+ </para>
+ <para>
+ Full normalization is important in some cases, such as when
+ multiple accents are applied to a single character. For instance,
+ <literal>'ệ'</literal> can be composed of code points
+ <literal>U&'\0065\0323\0302'</literal> or
+ <literal>U&'\0065\0302\0323'</literal>. With full normalization
+ on, these code point sequences are treated as equal; otherwise they
+ are unequal.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kc</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ <para>
+ Separates case into a "level 2.5" that falls between accents and
+ other level 3 features.
+ </para>
+ <para>
+ If set to <literal>true</literal> and <literal>ks</literal> is set
+ to <literal>level1</literal>, will ignore accents but take case
+ into account.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kf</literal></entry>
+ <entry>
+ <literal>upper</literal>, <literal>lower</literal>,
+ <literal>false</literal>
+ </entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ If set to <literal>upper</literal>, upper case sorts before lower
+ case. If set to <literal>lower</literal>, lower case sorts before
+ upper case. If set to <literal>false</literal>, the sort depends on
+ the rules of the locale.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kn</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ If set to <literal>true</literal>, numbers within a string are
+ treated as a single numeric value rather than a sequence of
+ digits. For example, <literal>'id-45'</literal> sorts before
+ <literal>'id-123'</literal>.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kr</literal></entry>
+ <entry>
+ <literal>space</literal>, <literal>punct</literal>,
+ <literal>symbol</literal>, <literal>currency</literal>,
+ <literal>digit</literal>, <replaceable>script-id</replaceable>
+ </entry>
+ <entry></entry>
+ <entry>
+ <para>
+ Set to one or more of the valid values, or any BCP 47
+ <replaceable>script-id</replaceable>, e.g. <literal>latn</literal>
+ ("Latin") or <literal>grek</literal> ("Greek"). Multiple values are
+ separated by "<literal>-</literal>".
+ </para>
+ <para>
+ Redefines the ordering of classes of characters; those characters
+ belonging to a class earlier in the list sort before characters
+ belonging to a class later in the list. For instance, the value
+ <literal>digit-currency-space</literal> (as part of a language tag
+ like <literal>und-u-kr-digit-currency-space</literal>) sorts
+ punctuation before digits and spaces.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kv</literal></entry>
+ <entry>
+ <literal>space</literal>, <literal>punct</literal>,
+ <literal>symbol</literal>, <literal>currency</literal>
+ </entry>
+ <entry><literal>punct</literal></entry>
+ <entry>
+ Classes of characters ignored during comparison at level 3. Setting
+ to a later value includes earlier values;
+ e.g. <literal>symbol</literal> also includes
+ <literal>punct</literal> and <literal>space</literal> in the
+ characters to be ignored. Key <literal>ka</literal> must be set to
+ <literal>shifted</literal> and key <literal>ks</literal> must be set
+ to <literal>level3</literal> or lower to take effect.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ Defaults may depend on locale. The above table is not meant to be
+ complete. See <xref linkend="icu-external-references"/> for additional
+ options and details.
+ </para>
+ </sect3>
+ <sect3 id="icu-locale-examples">
+ <title>Examples</title>
+ <para>
+ <variablelist>
+ <varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
+ <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
+ <listitem>
+ <para>German collation with phone book collation type</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
+ <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
+ <listitem>
+ <para>
+ Root collation with Emoji collation type, per Unicode Technical Standard #51
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
+ <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
+ <listitem>
+ <para>
+ Sort Greek letters before Latin ones. (The default is Latin before Greek.)
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-en-u-kf-upper">
+ <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
+ <listitem>
+ <para>
+ Sort upper-case letters before lower-case letters. (The default is
+ lower-case letters first.)
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
+ <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
+ <listitem>
+ <para>
+ Combines both of the above options.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </sect3>
+ <sect3 id="icu-external-references">
+ <title>External References for ICU</title>
+ <para>
+ This section (<xref linkend="icu-custom-collations"/>) is only a brief
+ overview of ICU behavior and language tags. Refer to the following
+ documents for technical details, additional options, and new behavior:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <ulink
+ url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
+ Technical Standard #35</ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
+ repository</ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink>
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect3>
+ </sect2>
</sect1>
<sect1 id="multibyte">
--
2.34.1
On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote:
Other than that, and I took your suggestions almost verbatim. Patch
attached. Thank you!
Attached new patch with a typo fix and a few other edits. I plan to
commit soon.
Regards,
Jeff Davis
Attachments:
0001-Doc-improvements-for-language-tags-and-custom-ICU-co.patchtext/x-patch; charset=UTF-8; name=0001-Doc-improvements-for-language-tags-and-custom-ICU-co.patchDownload
From d0d2375fa55618b60f361f6bb64b2c49490125b9 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 27 Apr 2023 14:43:46 -0700
Subject: [PATCH] Doc improvements for language tags and custom ICU collations.
Separate the documentation for language tags themselves from the
available collation settings which can be included in a language tag.
Include tables of the available options, more details about the
effects of each option, and additional examples.
Also include an explanation of the "levels" of textual features and
how they relate to collation.
Discussion: https://postgr.es/m/25787ec7-4c04-9a8a-d241-4dc9be0b1ba3@postgresql.org
Reviewed-by: Jonathan S. Katz
---
doc/src/sgml/charset.sgml | 683 +++++++++++++++++++++++++++++++-------
1 file changed, 562 insertions(+), 121 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 6dd95b8966..6b9c323edd 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en
variants and customization options.
</para>
</sect2>
+ <sect2 id="icu-locales">
+ <title>ICU Locales</title>
+ <sect3 id="icu-locale-names">
+ <title>ICU Locale Names</title>
+ <para>
+ The ICU format for the locale name is a <link
+ linkend="icu-language-tag">Language Tag</link>.
+
+<programlisting>
+CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);
+CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr');
+</programlisting>
+ </para>
+ </sect3>
+ <sect3 id="icu-canonicalization">
+ <title>Locale Canonicalization and Validation</title>
+ <para>
+ When defining a new ICU collation object or database with ICU as the
+ provider, the given locale name is transformed ("canonicalized") into a
+ language tag if not already in that form. For instance,
+
+<screen>
+CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true');
+NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true"
+CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8');
+NOTICE: using standard form "de-DE" for locale "de_DE.utf8"
+</screen>
+
+ If you see this notice, ensure that the <symbol>PROVIDER</symbol> and
+ <symbol>LOCALE</symbol> are the expected result. For consistent results
+ when using the ICU provider, specify the canonical <link
+ linkend="icu-language-tag">language tag</link> instead of relying on the
+ transformation.
+ </para>
+ <para>
+ A locale with no language name, or the special language name
+ <literal>root</literal>, is transformed to have the language
+ <literal>und</literal> ("undefined").
+ </para>
+ <para>
+ ICU can transform most libc locale names, as well as some other formats,
+ into language tags for easier transition to ICU. If a libc locale name is
+ used in ICU, it may not have precisely the same behavior as in libc.
+ </para>
+ <para>
+ If there is a problem interpreting the locale name, or if the locale name
+ represents a language or region that ICU does not recognize, you will see
+ the following warning:
+
+<screen>
+CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense');
+WARNING: ICU locale "nonsense" has unknown language "nonsense"
+HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION
+</screen>
+
+ <xref linkend="guc-icu-validation-level"/> controls how the message is
+ reported. Unless set to <literal>ERROR</literal>, the collation will
+ still be created, but the behavior may not be what the user intended.
+ </para>
+ </sect3>
+ <sect3 id="icu-language-tag">
+ <title>Language Tag</title>
+ <para>
+ A language tag, defined in BCP 47, is a standardized identifier used to
+ identify languages, regions, and other information about a locale.
+ </para>
+ <para>
+ Basic language tags are simply
+ <replaceable>language</replaceable><literal>-</literal><replaceable>region</replaceable>;
+ or even just <replaceable>language</replaceable>. The
+ <replaceable>language</replaceable> is a language code
+ (e.g. <literal>fr</literal> for French), and
+ <replaceable>region</replaceable> is a region code
+ (e.g. <literal>CA</literal> for Canada). Examples:
+ <literal>ja-JP</literal>, <literal>de</literal>, or
+ <literal>fr-CA</literal>.
+ </para>
+ <para>
+ Collation settings may be included in the language tag to customize
+ collation behavior. ICU allows extensive customization, such as
+ sensitivity (or insensitivity) to accents, case, and punctuation;
+ treatment of digits within text; and many other options to satisfy a
+ variety of uses.
+ </para>
+ <para>
+ To include this additional collation information in a language tag,
+ append <literal>-u</literal>, which indicates there are additional
+ collation settings, followed by one or more
+ <literal>-</literal><replaceable>key</replaceable><literal>-</literal><replaceable>value</replaceable>
+ pairs. The <replaceable>key</replaceable> is the key for a <link
+ linkend="icu-collation-settings">collation setting</link> and
+ <replaceable>value</replaceable> is a valid value for that setting. For
+ boolean settings, the <literal>-</literal><replaceable>key</replaceable>
+ may be specified without a corresponding
+ <literal>-</literal><replaceable>value</replaceable>, which implies a
+ value of <literal>true</literal>.
+ </para>
+ <para>
+ For example, the language tag <literal>en-US-u-kn-ks-level2</literal>
+ means the locale with the English language in the US region, with
+ collation settings <literal>kn</literal> set to <literal>true</literal>
+ and <literal>ks</literal> set to <literal>level2</literal>. Those
+ settings mean the collation will be case-insensitive and treat a sequence
+ of digits as a single number:
+<screen>
+CREATE COLLATION mycollation5 (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'en-US-u-kn-ks-level2');
+SELECT 'aB' = 'Ab' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+
+SELECT 'N-45' < 'N-123' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+</screen>
+ </para>
+ <para>
+ See <xref linkend="icu-custom-collations"/> for details and additional
+ examples of using language tags with custom collation information for the
+ locale.
+ </para>
+ </sect3>
+ </sect2>
<sect2 id="locale-problems">
<title>Problems</title>
@@ -658,6 +785,13 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
code byte values.
</para>
+ <note>
+ <para>
+ The <literal>C</literal> and <literal>POSIX</literal> locales may behave
+ differently depending on the database encoding.
+ </para>
+ </note>
+
<para>
Additionally, two SQL standard collation names are available:
@@ -869,132 +1003,24 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
<sect4 id="collation-managing-create-icu">
<title>ICU Collations</title>
- <para>
- ICU allows collations to be customized beyond the basic language+country
- set that is preloaded by <command>initdb</command>. Users are encouraged
- to define their own collation objects that make use of these facilities to
- suit the sorting behavior to their requirements.
- See <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
- and <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink> for
- information on ICU locale naming. The set of acceptable names and
- attributes depends on the particular ICU version.
- </para>
-
- <para>
- Here are some examples:
-
- <variablelist>
- <varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
- <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
- <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');</literal></term>
- <listitem>
- <para>German collation with phone book collation type</para>
- <para>
- The first example selects the ICU locale using a <quote>language
- tag</quote> per BCP 47. The second example uses the traditional
- ICU-specific locale syntax. The first style is preferred going
- forward, and is used internally to store locales.
- </para>
- <para>
- Note that you can name the collation objects in the SQL environment
- anything you want. In this example, we follow the naming style that
- the predefined collations use, which in turn also follow BCP 47, but
- that is not required for user-defined collations.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
- <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
- <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');</literal></term>
- <listitem>
- <para>
- Root collation with Emoji collation type, per Unicode Technical Standard #51
- </para>
- <para>
- Observe how in the traditional ICU locale naming system, the root
- locale is selected by an empty string.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
- <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
- <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn');</literal></term>
- <listitem>
- <para>
- Sort Greek letters before Latin ones. (The default is Latin before Greek.)
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kf-upper">
- <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
- <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');</literal></term>
- <listitem>
- <para>
- Sort upper-case letters before lower-case letters. (The default is
- lower-case letters first.)
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
- <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
- <term><literal>CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn');</literal></term>
- <listitem>
- <para>
- Combines both of the above options.
- </para>
- </listitem>
- </varlistentry>
-
- <varlistentry id="collation-managing-create-icu-en-u-kn-true">
- <term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');</literal></term>
- <term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');</literal></term>
- <listitem>
- <para>
- Numeric ordering, sorts sequences of digits by their numeric value,
- for example: <literal>A-21</literal> < <literal>A-123</literal>
- (also known as natural sort).
- </para>
- </listitem>
- </varlistentry>
- </variablelist>
-
- See <ulink url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
- Technical Standard #35</ulink>
- and <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink> for
- details. The list of possible collation types (<literal>co</literal>
- subtag) can be found in
- the <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
- repository</ulink>.
- </para>
+ <para>
+ ICU collations can be created like:
- <para>
- Note that while this system allows creating collations that <quote>ignore
- case</quote> or <quote>ignore accents</quote> or similar (using the
- <literal>ks</literal> key), in order for such collations to act in a
- truly case- or accent-insensitive manner, they also need to be declared as not
- <firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>;
- see <xref linkend="collation-nondeterministic"/>.
- Otherwise, any strings that compare equal according to the collation but
- are not byte-wise equal will be sorted according to their byte values.
- </para>
+<programlisting>
+CREATE COLLATION german (provider = icu, locale = 'de-DE');
+</programlisting>
- <note>
+ ICU locales are specified as a BCP 47 <link
+ linkend="icu-language-tag">Language Tag</link>, but can also accept most
+ libc-style locale names. If possible, libc-style locale names are
+ transformed into language tags.
+ </para>
<para>
- By design, ICU will accept almost any string as a locale name and match
- it to the closest locale it can provide, using the fallback procedure
- described in its documentation. Thus, there will be no direct feedback
- if a collation specification is composed using features that the given
- ICU installation does not actually support. It is therefore recommended
- to create application-level test cases to check that the collation
- definitions satisfy one's requirements.
+ New ICU collations can customize collation behavior extensively by
+ including collation attributes in the langugage tag. See <xref
+ linkend="icu-custom-collations"/> for details and examples.
</para>
- </note>
</sect4>
-
<sect4 id="collation-copy">
<title>Copying Collations</title>
@@ -1072,6 +1098,421 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
</tip>
</sect3>
</sect2>
+ <sect2 id="icu-custom-collations">
+ <title>ICU Custom Collations</title>
+
+ <para>
+ ICU allows extensive control over collation behavior by defining new
+ collations with collation settings as a part of the language tag. These
+ settings can modify the collation order to suit a variety of needs. For
+ instance:
+
+<programlisting>
+-- ignore differences in accents and case
+CREATE COLLATION ignore_accent_case (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ks-level1');
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
+
+-- upper case letters sort before lower case.
+CREATE COLLATION upper_first (PROVIDER=icu, LOCALE = 'und-u-kf-upper');
+SELECT 'B' < 'b' COLLATE upper_first; -- true
+
+-- treat digits numerically and ignore punctuation
+CREATE COLLATION num_ignore_punct (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ka-shifted-kn');
+SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true
+SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
+</programlisting>
+
+ Many of the available options are described in <xref
+ linkend="icu-collation-settings"/>, or see <xref
+ linkend="icu-external-references"/> for more details.
+ </para>
+ <sect3 id="icu-collation-comparison-levels">
+ <title>ICU Comparison Levels</title>
+ <para>
+ Comparison of two strings (collation) in ICU is determined by a
+ multi-level process, where textual features are grouped into
+ "levels". Treatment of each level is controlled by the <link
+ linkend="icu-collation-settings-table">collation settings</link>. Higher
+ levels correspond to finer textual features.
+ </para>
+ <para>
+ <table id="icu-collation-levels">
+ <title>ICU Collation Levels</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Level</entry>
+ <entry>Description</entry>
+ <entry><literal>'f' = 'f'</literal></entry>
+ <entry><literal>'ab' = U&'a\2063b'</literal></entry>
+ <entry><literal>'x-y' = 'x_y'</literal></entry>
+ <entry><literal>'g' = 'G'</literal></entry>
+ <entry><literal>'n' = 'ñ'</literal></entry>
+ <entry><literal>'y' = 'z'</literal></entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>level1</entry>
+ <entry>Base Character</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>level2</entry>
+ <entry>Accents</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>level3</entry>
+ <entry>Case/Variants</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>level4</entry>
+ <entry>Punctuation</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ <row>
+ <entry>identic</entry>
+ <entry>All</entry>
+ <entry><literal>true</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ The above table shows which textual feature differences are
+ considered significant when determining equality at the given level. The
+ unicode character <literal>U+2063</literal> is an invisible separator,
+ and as seen in the table, is ignored for at all levels of comparison less
+ than <literal>identic</literal>.
+ </para>
+ <para>
+ At every level, even with full normalization off, basic normalization is
+ performed. For example, <literal>'á'</literal> may be composed of the
+ code points <literal>U&'\0061\0301'</literal> or the single code
+ point <literal>U&'\00E1'</literal>, and those sequences will be
+ considered equal even at the <literal>identic</literal> level. To treat
+ any difference in code point representation as distinct, use a collation
+ created with <symbol>DETERMINISTIC</symbol> set to
+ <literal>true</literal>.
+ </para>
+ <sect4 id="icu-collation-level-examples">
+ <title>Collation Level Examples</title>
+ <para>
+
+<programlisting>
+CREATE COLLATION level3 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level3');
+CREATE COLLATION level4 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level4');
+CREATE COLLATION identic (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-identic');
+
+-- invisible separator ignored at all levels except identic
+SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true
+SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false
+
+-- punctuation ignored at level3 but not at level 4
+SELECT 'x-y' = 'x_y' COLLATE level3; -- true
+SELECT 'x-y' = 'x_y' COLLATE level4; -- false
+</programlisting>
+
+ </para>
+ </sect4>
+ </sect3>
+ <sect3 id="icu-collation-settings">
+ <title>Collation Settings for an ICU Locale</title>
+ <para>
+ <table id="icu-collation-settings-table">
+ <title>ICU Collation Settings</title>
+ <tgroup cols="4">
+ <thead>
+ <row>
+ <entry>Key</entry>
+ <entry>Values</entry>
+ <entry>Default</entry>
+ <entry>Description</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>co</literal></entry>
+ <entry><literal>emoji</literal>, <literal>phonebk</literal>, <literal>standard</literal>, <replaceable>...</replaceable></entry>
+ <entry><literal>standard</literal></entry>
+ <entry>
+ Collation type. See <xref linkend="icu-external-references"/> for additional options and details.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>ks</literal></entry>
+ <entry><literal>level1</literal>, <literal>level2</literal>, <literal>level3</literal>, <literal>level4</literal>, <literal>identic</literal></entry>
+ <entry><literal>level3</literal></entry>
+ <entry>
+ Sensitivity (or "strength") when determining equality, with
+ <literal>level1</literal> the least sensitive to differences and
+ <literal>identic</literal> the most sensitive to differences. See
+ <xref linkend="icu-collation-levels"/> for details.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>ka</literal></entry>
+ <entry><literal>noignore</literal>, <literal>shifted</literal></entry>
+ <entry><literal>noignore</literal></entry>
+ <entry>
+ If set to <literal>shifted</literal>, causes some characters
+ (e.g. punctuation or space) to be ignored in comparison. Key
+ <literal>ks</literal> must be set to <literal>level3</literal> or
+ lower to take effect. Set key <literal>kv</literal> to control which
+ character classes are ignored.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kb</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ Backwards comparison for the level 2 differences. For example,
+ locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
+ before <literal>'aé'</literal>.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kk</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ <para>
+ Enable full normalization; may affect performance. Basic
+ normalization is performed even when set to
+ <literal>false</literal>. Locales for languages that require full
+ normalization typically enable it by default.
+ </para>
+ <para>
+ Full normalization is important in some cases, such as when
+ multiple accents are applied to a single character. For instance,
+ <literal>'ệ'</literal> can be composed of code points
+ <literal>U&'\0065\0323\0302'</literal> or
+ <literal>U&'\0065\0302\0323'</literal>. With full normalization
+ on, these code point sequences are treated as equal; otherwise they
+ are unequal.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kc</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ <para>
+ Separates case into a "level 2.5" that falls between accents and
+ other level 3 features.
+ </para>
+ <para>
+ If set to <literal>true</literal> and <literal>ks</literal> is set
+ to <literal>level1</literal>, will ignore accents but take case
+ into account.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kf</literal></entry>
+ <entry>
+ <literal>upper</literal>, <literal>lower</literal>,
+ <literal>false</literal>
+ </entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ If set to <literal>upper</literal>, upper case sorts before lower
+ case. If set to <literal>lower</literal>, lower case sorts before
+ upper case. If set to <literal>false</literal>, the sort depends on
+ the rules of the locale.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kn</literal></entry>
+ <entry><literal>true</literal>, <literal>false</literal></entry>
+ <entry><literal>false</literal></entry>
+ <entry>
+ If set to <literal>true</literal>, numbers within a string are
+ treated as a single numeric value rather than a sequence of
+ digits. For example, <literal>'id-45'</literal> sorts before
+ <literal>'id-123'</literal>.
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kr</literal></entry>
+ <entry>
+ <literal>space</literal>, <literal>punct</literal>,
+ <literal>symbol</literal>, <literal>currency</literal>,
+ <literal>digit</literal>, <replaceable>script-id</replaceable>
+ </entry>
+ <entry></entry>
+ <entry>
+ <para>
+ Set to one or more of the valid values, or any BCP 47
+ <replaceable>script-id</replaceable>, e.g. <literal>latn</literal>
+ ("Latin") or <literal>grek</literal> ("Greek"). Multiple values are
+ separated by "<literal>-</literal>".
+ </para>
+ <para>
+ Redefines the ordering of classes of characters; those characters
+ belonging to a class earlier in the list sort before characters
+ belonging to a class later in the list. For instance, the value
+ <literal>digit-currency-space</literal> (as part of a language tag
+ like <literal>und-u-kr-digit-currency-space</literal>) sorts
+ punctuation before digits and spaces.
+ </para>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>kv</literal></entry>
+ <entry>
+ <literal>space</literal>, <literal>punct</literal>,
+ <literal>symbol</literal>, <literal>currency</literal>
+ </entry>
+ <entry><literal>punct</literal></entry>
+ <entry>
+ Classes of characters ignored during comparison at level 3. Setting
+ to a later value includes earlier values;
+ e.g. <literal>symbol</literal> also includes
+ <literal>punct</literal> and <literal>space</literal> in the
+ characters to be ignored. Key <literal>ka</literal> must be set to
+ <literal>shifted</literal> and key <literal>ks</literal> must be set
+ to <literal>level3</literal> or lower to take effect.
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ Defaults may depend on locale. The above table is not meant to be
+ complete. See <xref linkend="icu-external-references"/> for additional
+ options and details.
+ </para>
+ <note>
+ <para>
+ For many collation settings, you must create the collation with
+ <option>DETERMINISTIC</option> set to <literal>false</literal> for the
+ setting to have the desired effect (see <xref
+ linkend="collation-nondeterministic"/>). Additionally, some settings
+ only take effect when the key <literal>ka</literal> is set to
+ <literal>shifted</literal> (see <xref
+ linkend="icu-collation-settings-table"/>).
+ </para>
+ </note>
+ </sect3>
+ <sect3 id="icu-locale-examples">
+ <title>Examples</title>
+ <para>
+ <variablelist>
+ <varlistentry id="collation-managing-create-icu-de-u-co-phonebk-x-icu">
+ <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');</literal></term>
+ <listitem>
+ <para>German collation with phone book collation type</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-und-u-co-emoji-x-icu">
+ <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');</literal></term>
+ <listitem>
+ <para>
+ Root collation with Emoji collation type, per Unicode Technical Standard #51
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-en-u-kr-grek-latn">
+ <term><literal>CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');</literal></term>
+ <listitem>
+ <para>
+ Sort Greek letters before Latin ones. (The default is Latin before Greek.)
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-en-u-kf-upper">
+ <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');</literal></term>
+ <listitem>
+ <para>
+ Sort upper-case letters before lower-case letters. (The default is
+ lower-case letters first.)
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry id="collation-managing-create-icu-en-u-kf-upper-kr-grek-latn">
+ <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');</literal></term>
+ <listitem>
+ <para>
+ Combines both of the above options.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </sect3>
+ <sect3 id="icu-external-references">
+ <title>External References for ICU</title>
+ <para>
+ This section (<xref linkend="icu-custom-collations"/>) is only a brief
+ overview of ICU behavior and language tags. Refer to the following
+ documents for technical details, additional options, and new behavior:
+ </para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <ulink
+ url="https://www.unicode.org/reports/tr35/tr35-collation.html">Unicode
+ Technical Standard #35</ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml">CLDR
+ repository</ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink>
+ </para>
+ </listitem>
+ </itemizedlist>
+ </sect3>
+ </sect2>
</sect1>
<sect1 id="multibyte">
--
2.34.1
On 5/17/23 6:59 PM, Jeff Davis wrote:
On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote:
Other than that, and I took your suggestions almost verbatim. Patch
attached. Thank you!Attached new patch with a typo fix and a few other edits. I plan to
commit soon.
I did a quicker read through this time. LGTM overall. I like what you
did with the explanations around sensitivity (now it makes sense).
Thanks,
Jonathan
On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote:
I did a quicker read through this time. LGTM overall. I like what you
did with the explanations around sensitivity (now it makes sense).
Committed, thank you.
There are a few things I don't understand that would be good to
document better:
* Rules. I still don't quite understand the use case: are these for
people inventing new languages? What is a plausible use case that isn't
covered by the existing locales and collation settings? Do rules make
sense for a database default collation? Are they for language experts
only or might an ordinary developer benefit from using them?
* The collation types "phonebk", "emoji", etc.: are these variants of
particular locales, or do they make sense in multiple locales? I don't
know where they fit in or how to document them.
* I don't understand what "kc" means if "ks" is not set to "level1".
Regards,
Jeff Davis
On 5/18/23 1:55 PM, Jeff Davis wrote:
On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote:
I did a quicker read through this time. LGTM overall. I like what you
did with the explanations around sensitivity (now it makes sense).Committed, thank you.
\o/
There are a few things I don't understand that would be good to
document better:* Rules. I still don't quite understand the use case: are these for
people inventing new languages? What is a plausible use case that isn't
covered by the existing locales and collation settings? Do rules make
sense for a database default collation? Are they for language experts
only or might an ordinary developer benefit from using them?
From my read of them, as an app developer I'd be very unlikely to use
this. Maybe there is something with building out some collation rules
vis-a-vis an extension, but I have trouble imagining the use-case. I may
also not be the target audience for this feature.
* The collation types "phonebk", "emoji", etc.: are these variants of
particular locales, or do they make sense in multiple locales? I don't
know where they fit in or how to document them.
I remember I had a exploratory use case for "phonebk" but I couldn't
figure out how to get it to work. AIUI from random searching, the idea
is that it provides the "phonebook" rules for ordering "names" in a
particular locale, but I couldn't get it to work.
* I don't understand what "kc" means if "ks" is not set to "level1".
Me neither, but I haven't stared at this as hard as others.
Thanks,
Jonathan
On Fri, 21 Apr 2023 at 22:46, Jeff Davis <pgsql@j-davis.com> wrote:
On Fri, 2023-04-21 at 19:00 +0100, Andrew Gierth wrote:
Also, somewhere along the line someone broke initdb --no-locale,
which
should result in C locale being the default everywhere, but when I
just
tested it it picked 'en' for an ICU locale, which is not the right
thing.Fixed, thank you.
As I complain about in [0]/messages/by-id/CAEze2WiZFQyyb-DcKwayUmE4rY42Bo6kuK9nBjvqRHYxUYJ-DA@mail.gmail.com, since 5cd1a5af --no-locale has been broken
/ bahiving outside it's description: Instead of being equivalent to
`--locale=C` it now also overrides `--locale-provider=libc`, resulting
in undocumented behaviour.
Kind regards,
Matthias van de Meent
Neon, Inc.
[0]: /messages/by-id/CAEze2WiZFQyyb-DcKwayUmE4rY42Bo6kuK9nBjvqRHYxUYJ-DA@mail.gmail.com
On Thu, 2023-05-18 at 13:58 -0400, Jonathan S. Katz wrote:
From my read of them, as an app developer I'd be very unlikely to
use
this. Maybe there is something with building out some collation rules
vis-a-vis an extension, but I have trouble imagining the use-case. I
may
also not be the target audience for this feature.
That's a problem for the ICU rules feature. I understand some features
may be for domain experts only, but we at least need to call that out
so that ordinary developers don't get confused. And we should hear from
some of those domain experts that they actually want it and it solves a
real problem.
For the features that can be described with collation
settings/attributes right in the locale name, the use cases are more
plausible and we've supported them since v10, so it's good to document
them as best we can. It's hard to expose only the particular ICU
collation settings we understand best (e.g. the "ks" setting that
allows case insensitive collation), so it's inevitable that there will
be some settings that are more obscure and harder to document.
But in the case of ICU rules, they are newly-supported in 16, so there
should be a clear reason we're adding them. Otherwise we're just
setting up users for confusion or problems, and creating backwards-
compatibility headaches for ourselves (and the last thing we want is to
fret over backwards compatibility for a feature with no users).
Beyond that, there seems to be some danger: if the syntax for rules is
not perfectly compatible between ICU versions, the user might run into
big problems.
Regards,
Jeff Davis
On Thu, 2023-05-18 at 20:11 +0200, Matthias van de Meent wrote:
As I complain about in [0], since 5cd1a5af --no-locale has been
broken
/ bahiving outside it's description: Instead of being equivalent to
`--locale=C` it now also overrides `--locale-provider=libc`,
resulting
in undocumented behaviour.
I agree that 5cd1a5af is incomplete.
Posting updated patches. Feedback on the approaches below would be
appreciated.
For context, in version 15:
$ initdb -D data --locale-provider=icu --icu-locale=en
=> create database clocale template template0 locale='C';
=> select datname, datlocprovider, daticulocale
from pg_database where datname='clocale';
datname | datlocprovider | daticulocale
---------+----------------+--------------
clocale | i | en
(1 row)
That behavior is confusing, and when I made ICU the default provider in
v16, the confusion was extended into more cases.
If we leave the CREATE DATABASE (and createdb and initdb) syntax in
place, such that LOCALE (and --locale) do not apply to ICU at all, then
I don't see a path to a good ICU user experience.
Therefore I conclude that we need LOCALE (and --locale) to apply to ICU
somehow. (The LOCALE option already applies to ICU during CREATE
COLLATION, just not CREATE DATABASE or initdb.)
Patch 0003 does this. It's fairly straightforward and I believe we need
this patch.
But to actually fix your complaint we also need --no-locale to be
equivalent to --locale=C and for those options to both use memcmp()
semantics. There are several approaches to accomplish this, and I think
this is the part where I most need some feedback. There are only so
many approaches, and each one has some potential downsides, but I
believe we need to select one:
(1) Give up and leave the existing CREATE DATABASE (and createdb, and
initdb) semantics in place, along with the confusing behavior in v15.
This is a last resort, in my opinion. It gives us no path toward a good
user experience with ICU, and leaves us with all of the problems of the
OS as a collation provider.
(2) Automatically change the provider to libc when locale=C.
Almost works, but it's not clear how we handle the case "provider=icu
lc_collate='fr_FR.utf8' locale=C".
If we change it to "provider=libc lc_collate=C", we've overridden the
specified lc_collate. If we ignore the locale=C, that would be
surprising to users. If we throw an error, that would be a backwards
compatibility issue.
One possible solution would be to change the catalog representation to
allow setting the default collation locale separately from datcollate
even for the libc provider. For instance, rename daticulocale to
datdeflocale, and store the default collation locale there for both
libc and ICU. Then, "provider=icu lc_collate='fr_FR.utf8' locale=C"
could be changed into "provider=libc lc_collate='fr_FR.utf8'
deflocale=C". It may be confusing that datcollate is a different
concept from datdeflocale; but then again they are different concepts
and it's confusing that they are currently combined into one.
(3) Support iculocale=C in the ICU provider using the memcmp() path.
In other words, if provider=icu and iculocale=C, lc_collate_is_c() and
lc_ctpye_is_c() would both return true.
There's a potential problem for users who've misused ICU in the past
(15 or earlier) by using provider=icu and iculocale=C. ICU would accept
such a locale name, but not recognize it and fall back to the root
locale, so it never worked as the user intended it. But if we redefine
C to be memcmp(), then such users will have broken indexes if they
upgrade.
We could add a check at pg_upgrade time for iculocale=C in versions 15
and earlier, and cause the check (and therefore the upgrade) to fail.
That may be reasonable considering that it never really worked in the
past, and perhaps very few users actually ever created such a
collation. But if some user runs into that problem, we'd have to resort
to a hack like telling them to "update pg_collation set iculocale='und'
where iculocale='C'" and then try the upgrade again, which is not a
great answer (as far as I can tell it would be a correct answer and
should not break their indexes, but it feels pretty dangerous).
There may be some other resolutions to this problem, such as catalog
hacks that allow for different representations of iculocale=C pre-16
and post-16. That doesn't sound great though, and we'd have to figure
out what to do with pg_dump.
(4) Create a new "none" provider (which has no locale and always memcmp
semantics), and automatically change the provider to "none" if
provider=icu and iculocale=C.
This solves the problem case in #2 and the potential upgrade problem in
#3. It also makes the documentation a bit more natural, in my opinion,
even if we retain the special case for provider=libc collate=C.
#4 is the approach I chose (patches 0001 and 0002), but I'd like to
hear what others think.
For historical reasons, users may assume that LC_COLLATE controls the
default collation order because that's true in libc. And if their
provider is ICU, they may be surprised that it doesn't. I believe we
could extend each of the above approaches to use LC_COLLATE as the
default for ICU_LOCALE if the former is specified and the latter is
not, and that may make things smoother.
--
Jeff Davis
PostgreSQL Contributor Team - AWS
Attachments:
v6-0002-ICU-for-locale-C-automatically-use-none-provider-.patchtext/x-patch; charset=UTF-8; name=v6-0002-ICU-for-locale-C-automatically-use-none-provider-.patchDownload
From dc7200153a9ac65c2518b32b789d1a9dc4454850 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 8 May 2023 13:48:01 -0700
Subject: [PATCH v6 2/5] ICU: for locale "C", automatically use "none" provider
instead.
Postgres expects locale C to be optimizable to simple locale-unaware
byte operations; while ICU does not recognize the locale "C" at all,
and falls back to the root locale.
If the user specifies locale "C" when creating a new collation or a
new database with the ICU provider, automatically switch it to the
"none" provider.
If provider is libc, behavior is unchanged.
---
doc/src/sgml/charset.sgml | 6 +++
doc/src/sgml/ref/create_collation.sgml | 6 +++
doc/src/sgml/ref/create_database.sgml | 5 +++
doc/src/sgml/ref/createdb.sgml | 5 +++
doc/src/sgml/ref/initdb.sgml | 5 +++
src/backend/commands/collationcmds.c | 17 ++++++++
src/backend/commands/dbcommands.c | 21 ++++++++++
src/bin/initdb/initdb.c | 10 +++++
src/bin/initdb/t/001_initdb.pl | 39 +++++++++++++++++++
src/bin/scripts/createdb.c | 11 ++++++
src/bin/scripts/t/020_createdb.pl | 12 ++++++
.../regress/expected/collate.icu.utf8.out | 14 +++++--
src/test/regress/sql/collate.icu.utf8.sql | 6 +++
13 files changed, 154 insertions(+), 3 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 7a791a2b7c..68bad646e9 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -405,6 +405,12 @@ initdb --locale-provider=icu --icu-locale=en
change in results. <literal>LC_COLLATE</literal> and
<literal>LC_CTYPE</literal> can be set independently of the ICU locale.
</para>
+ <para>
+ The ICU provider does not accept the <literal>C</literal>
+ locale. Commands that create collations or database with the
+ <literal>icu</literal> provider and ICU locale <literal>C</literal> use
+ the provider <literal>none</literal> instead.
+ </para>
<note>
<para>
For the ICU provider, results may depend on the version of the ICU
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index 5489ae7413..1ac41831d8 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -126,6 +126,12 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
<literal>libc</literal> is the default. See <xref
linkend="locale-providers"/> for details.
</para>
+ <para>
+ If the provider is <literal>icu</literal> and the locale is
+ <literal>C</literal> or <literal>POSIX</literal>, the provider is
+ automatically set to <literal>none</literal>; as the ICU provider
+ doesn't support an ICU locale of <literal>C</literal>.
+ </para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 60b9da0952..c730d02e15 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -190,6 +190,11 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
<para>
Specifies the ICU locale ID if the ICU locale provider is used.
</para>
+ <para>
+ If specified as <literal>C</literal> or <literal>POSIX</literal>, the
+ provider is automatically set to <literal>none</literal>, as the ICU
+ provider doesn't support an ICU locale of <literal>C</literal>.
+ </para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index 326a371d34..7c573e848a 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -154,6 +154,11 @@ PostgreSQL documentation
Specifies the ICU locale ID to be used in this database, if the
ICU locale provider is selected.
</para>
+ <para>
+ If specified as <literal>C</literal> or <literal>POSIX</literal>, the
+ provider is automatically set to <literal>none</literal>, as the ICU
+ provider doesn't support an ICU locale of <literal>C</literal>.
+ </para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index e604ab48b7..76993acdfe 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -250,6 +250,11 @@ PostgreSQL documentation
Specifies the ICU locale when the ICU provider is used. Locale support
is described in <xref linkend="locale"/>.
</para>
+ <para>
+ If specified as <literal>C</literal> or <literal>POSIX</literal>, the
+ provider is automatically set to <literal>none</literal>, as the ICU
+ provider doesn't support an ICU locale of <literal>C</literal>.
+ </para>
<para>
If this option is not specified, the locale is inherited from the
environment in which <command>initdb</command> runs. The environment's
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index aeaf6c419e..8bc6f8347d 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -254,6 +254,23 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
if (lcctypeEl)
collctype = defGetString(lcctypeEl);
+ /*
+ * Postgres defines the "C" (and equivalently, "POSIX") locales to be
+ * optimizable to byte operations (memcmp(), pg_ascii_tolower(),
+ * etc.); transform into the "none" provider. Don't transform during
+ * binary upgrade.
+ */
+ if (!IsBinaryUpgrade && collprovider == COLLPROVIDER_ICU &&
+ colliculocale && (pg_strcasecmp(colliculocale, "C") == 0 ||
+ pg_strcasecmp(colliculocale, "POSIX") == 0))
+ {
+ ereport(NOTICE,
+ (errmsg("using locale provider \"none\" for ICU locale \"%s\"",
+ colliculocale)));
+ colliculocale = NULL;
+ collprovider = COLLPROVIDER_NONE;
+ }
+
if (collprovider == COLLPROVIDER_LIBC)
{
if (!collcollate)
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 9e73f54803..6dc737aebb 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1043,6 +1043,27 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
check_encoding_locale_matches(encoding, dbcollate, dbctype);
+ /*
+ * Postgres defines the "C" (and equivalently, "POSIX") locales to be
+ * optimizable to byte operations (memcmp(), pg_ascii_tolower(), etc.);
+ * transform into the "none" provider.
+ *
+ * Don't transform during binary upgrade or when both the provider and ICU
+ * locale are unchanged from the template.
+ */
+ if (!IsBinaryUpgrade && dblocprovider == COLLPROVIDER_ICU &&
+ (src_locprovider != COLLPROVIDER_ICU ||
+ strcmp(dbiculocale, src_iculocale) != 0) &&
+ dbiculocale && (pg_strcasecmp(dbiculocale, "C") == 0 ||
+ pg_strcasecmp(dbiculocale, "POSIX") == 0))
+ {
+ ereport(NOTICE,
+ (errmsg("using locale provider \"none\" for ICU locale \"%s\"",
+ dbiculocale)));
+ dbiculocale = NULL;
+ dblocprovider = COLLPROVIDER_NONE;
+ }
+
if (dblocprovider == COLLPROVIDER_ICU)
{
if (!(is_encoding_supported_by_icu(encoding)))
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 4a6cad3cb9..e5ec2a243e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2440,6 +2440,16 @@ setlocales(void)
lc_messages = locale;
}
+ if (icu_locale && locale_provider == COLLPROVIDER_ICU &&
+ (pg_strcasecmp(icu_locale, "C") == 0 ||
+ pg_strcasecmp(icu_locale, "POSIX") == 0))
+ {
+ pg_log_info("using locale provider \"none\" for ICU locale \"%s\"",
+ icu_locale);
+ icu_locale = NULL;
+ locale_provider = COLLPROVIDER_NONE;
+ }
+
/*
* canonicalize locale names, and obtain any missing values from our
* current environment
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index fe6d224e5b..ea92b08511 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -111,6 +111,45 @@ if ($ENV{with_icu} eq 'yes')
],
'option --icu-locale');
+ # transformed to provider=none
+ command_ok(
+ [
+ 'initdb', '--no-sync',
+ '--locale-provider=icu', '--icu-locale=C',
+ "$tempdir/data4a"
+ ],
+ 'option --icu-locale=C');
+
+ # transformed to provider=none
+ command_ok(
+ [
+ 'initdb', '--no-sync',
+ '--locale-provider=icu', '--icu-locale=C',
+ '--locale=C',
+ "$tempdir/data4b"
+ ],
+ 'option --icu-locale=C --locale=C');
+
+ # transformed to provider=none
+ command_ok(
+ [
+ 'initdb', '--no-sync',
+ '--locale-provider=icu', '--icu-locale=C',
+ '--lc-collate=C',
+ "$tempdir/data4c"
+ ],
+ 'option --icu-locale=C --lc-collate=C');
+
+ # transformed to provider=none
+ command_ok(
+ [
+ 'initdb', '--no-sync',
+ '--locale-provider=icu', '--icu-locale=C',
+ '--lc-ctype=C',
+ "$tempdir/data4d"
+ ],
+ 'option --icu-locale=C --lc-ctype=C');
+
command_fails_like(
[
'initdb', '--no-sync',
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 79367d933b..9caf9190cf 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -172,6 +172,17 @@ main(int argc, char *argv[])
lc_collate = locale;
}
+ if (locale_provider && pg_strcasecmp(locale_provider, "icu") == 0 &&
+ icu_locale &&
+ (pg_strcasecmp(icu_locale, "C") == 0 ||
+ pg_strcasecmp(icu_locale, "POSIX") == 0))
+ {
+ pg_log_info("using locale provider \"none\" for ICU locale \"%s\"",
+ icu_locale);
+ icu_locale = NULL;
+ locale_provider = "none";
+ }
+
if (encoding)
{
if (pg_char_to_encoding(encoding) < 0)
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index 5aa658b671..eb3682f0fd 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -75,6 +75,18 @@ if ($ENV{with_icu} eq 'yes')
$node2->command_ok(
[ 'createdb', '-T', 'template0', '--icu-locale', 'en-US', 'foobar56' ],
'create database with icu locale from template database with icu provider');
+
+ # transformed into provider "none"
+ $node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=icu', '--icu-locale=C',
+ 'test_none_icu1' ],
+ 'create database with provider "icu" and ICU_LOCALE="C"');
+
+ # transformed into provider "none"
+ $node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=icu', '--icu-locale=C',
+ '--lc-ctype=C', 'test_none_icu_2' ],
+ 'create database with provider "icu" and ICU_LOCALE="C" and LC_CTYPE=C');
}
else
{
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index c658ee1404..7c186e9f69 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1035,6 +1035,9 @@ BEGIN
END
$$;
RESET client_min_messages;
+-- uses "none" provider instead
+CREATE COLLATION testc (provider = icu, locale='C');
+NOTICE: using locale provider "none" for ICU locale "C"
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
ERROR: parameter "locale" must be specified
SET icu_validation_level = ERROR;
@@ -1058,8 +1061,11 @@ SELECT collname FROM pg_collation WHERE collname LIKE 'test%' ORDER BY 1;
test0
test1
test5
-(3 rows)
+ testc
+(4 rows)
+DROP COLLATION test1;
+CREATE COLLATION test1 (provider = icu, locale = 'und');
ALTER COLLATION test1 RENAME TO test11;
ALTER COLLATION test0 RENAME TO test11; -- fail
ERROR: collation "test11" already exists in schema "collate_tests"
@@ -1079,7 +1085,8 @@ SELECT collname, nspname, obj_description(pg_collation.oid, 'pg_collation')
test0 | collate_tests | US English
test11 | test_schema |
test5 | collate_tests |
-(3 rows)
+ testc | collate_tests |
+(4 rows)
DROP COLLATION test0, test_schema.test11, test5;
DROP COLLATION test0; -- fail
@@ -1089,7 +1096,8 @@ NOTICE: collation "test0" does not exist, skipping
SELECT collname FROM pg_collation WHERE collname LIKE 'test%';
collname
----------
-(0 rows)
+ testc
+(1 row)
DROP SCHEMA test_schema;
DROP ROLE regress_test_role;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index 7bd0901281..e59200df9a 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -375,6 +375,9 @@ $$;
RESET client_min_messages;
+-- uses "none" provider instead
+CREATE COLLATION testc (provider = icu, locale='C');
+
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
SET icu_validation_level = ERROR;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
@@ -388,6 +391,9 @@ CREATE COLLATION test5 FROM test0;
SELECT collname FROM pg_collation WHERE collname LIKE 'test%' ORDER BY 1;
+DROP COLLATION test1;
+CREATE COLLATION test1 (provider = icu, locale = 'und');
+
ALTER COLLATION test1 RENAME TO test11;
ALTER COLLATION test0 RENAME TO test11; -- fail
ALTER COLLATION test1 RENAME TO test22; -- fail
--
2.34.1
v6-0003-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchtext/x-patch; charset=UTF-8; name=v6-0003-Make-LOCALE-apply-to-ICU_LOCALE-for-CREATE-DATABA.patchDownload
From c04053021eaa6db480143393a7de83525a8f4f7e Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 25 Apr 2023 15:01:55 -0700
Subject: [PATCH v6 3/5] Make LOCALE apply to ICU_LOCALE for CREATE DATABASE.
LOCALE is now an alias for LC_COLLATE, LC_CTYPE, and (if the provider
is ICU) ICU_LOCALE. The ICU provider accepts more locale names than
libc (e.g. language tags and locale names containing collation
attributes), so in some cases LC_COLLATE, LC_CTYPE, and ICU_LOCALE
will still need to be specified separately.
Previously, LOCALE applied only to LC_COLLATE and LC_CTYPE (and
similarly for --locale in initdb and createdb). That could lead to
confusion when the provider is implicit, such as when it is inherited
from the template database, or when ICU was made default at initdb
time in commit 27b62377b4.
Reverts incomplete fix 5cd1a5af4d.
Discussion: https://postgr.es/m/3391932.1682107209@sss.pgh.pa.us
---
doc/src/sgml/ref/create_database.sgml | 6 +++--
doc/src/sgml/ref/createdb.sgml | 5 +++-
doc/src/sgml/ref/initdb.sgml | 7 +++---
src/backend/commands/collationcmds.c | 2 +-
src/backend/commands/dbcommands.c | 15 ++++++++----
src/bin/initdb/initdb.c | 11 ++++++---
src/bin/scripts/createdb.c | 13 ++++-------
src/bin/scripts/t/020_createdb.pl | 4 ++--
src/test/icu/t/010_database.pl | 23 ++++++++++++-------
.../regress/expected/collate.icu.utf8.out | 22 +++++++++---------
10 files changed, 65 insertions(+), 43 deletions(-)
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index c730d02e15..dc57ba0c8b 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -145,8 +145,10 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
<term><replaceable class="parameter">locale</replaceable></term>
<listitem>
<para>
- This is a shortcut for setting <symbol>LC_COLLATE</symbol>
- and <symbol>LC_CTYPE</symbol> at once.
+ This is a shortcut for setting <symbol>LC_COLLATE</symbol>,
+ <symbol>LC_CTYPE</symbol> and <symbol>ICU_LOCALE</symbol> at
+ once. Some locales are only valid for ICU, and must be set separately
+ with <symbol>ICU_LOCALE</symbol>.
</para>
<tip>
<para>
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index 7c573e848a..7991153ecc 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -124,7 +124,10 @@ PostgreSQL documentation
<listitem>
<para>
Specifies the locale to be used in this database. This is equivalent
- to specifying both <option>--lc-collate</option> and <option>--lc-ctype</option>.
+ to specifying <option>--lc-collate</option>,
+ <option>--lc-ctype</option>, and <option>--icu-locale</option> to the
+ same value. Some locales are only valid for ICU and must be set with
+ <option>--icu-locale</option>.
</para>
</listitem>
</varlistentry>
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 76993acdfe..d9ef21c422 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -116,9 +116,10 @@ PostgreSQL documentation
<para>
To choose a different locale for the cluster, use the option
<option>--locale</option>. There are also individual options
- <option>--lc-*</option> (see below) to set values for the individual locale
- categories. Note that inconsistent settings for different locale
- categories can give nonsensical results, so this should be used with care.
+ <option>--lc-*</option> and <option>--icu-locale</option> (see below) to
+ set values for the individual locale categories. Note that inconsistent
+ settings for different locale categories can give nonsensical results, so
+ this should be used with care.
</para>
<para>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 8bc6f8347d..21615746f9 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -302,7 +302,7 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
if (langtag && strcmp(colliculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, colliculocale)));
colliculocale = langtag;
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 6dc737aebb..154f20573c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1019,7 +1019,12 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (dblocprovider == '\0')
dblocprovider = src_locprovider;
if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
- dbiculocale = src_iculocale;
+ {
+ if (dlocale && dlocale->arg)
+ dbiculocale = defGetString(dlocale);
+ else
+ dbiculocale = src_iculocale;
+ }
if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
dbicurules = src_icurules;
@@ -1033,12 +1038,14 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (!check_locale(LC_COLLATE, dbcollate, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbcollate)));
+ errmsg("invalid LC_COLLATE locale name: \"%s\"", dbcollate),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbcollate = canonname;
if (!check_locale(LC_CTYPE, dbctype, &canonname))
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
- errmsg("invalid locale name: \"%s\"", dbctype)));
+ errmsg("invalid LC_CTYPE locale name: \"%s\"", dbctype),
+ errhint("If the locale name is specific to ICU, use ICU_LOCALE.")));
dbctype = canonname;
check_encoding_locale_matches(encoding, dbcollate, dbctype);
@@ -1094,7 +1101,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
if (langtag && strcmp(dbiculocale, langtag) != 0)
{
ereport(NOTICE,
- (errmsg("using standard form \"%s\" for locale \"%s\"",
+ (errmsg("using standard form \"%s\" for ICU locale \"%s\"",
langtag, dbiculocale)));
dbiculocale = langtag;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index e5ec2a243e..f0827154cd 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2164,7 +2164,11 @@ check_locale_name(int category, const char *locale, char **canonname)
if (res == NULL)
{
if (*locale)
- pg_fatal("invalid locale name \"%s\"", locale);
+ {
+ pg_log_error("invalid locale name \"%s\"", locale);
+ pg_log_error_hint("If the locale name is specific to ICU, use --icu-locale.");
+ exit(1);
+ }
else
{
/*
@@ -2406,7 +2410,7 @@ setlocales(void)
{
char *canonname;
- /* set empty lc_* values to locale config if set */
+ /* set empty lc_* and iculocale values to locale config if set */
if (locale_provider == COLLPROVIDER_NONE)
{
@@ -2438,6 +2442,8 @@ setlocales(void)
lc_monetary = locale;
if (!lc_messages)
lc_messages = locale;
+ if (!icu_locale && locale_provider == COLLPROVIDER_ICU)
+ icu_locale = locale;
}
if (icu_locale && locale_provider == COLLPROVIDER_ICU &&
@@ -3331,7 +3337,6 @@ main(int argc, char *argv[])
break;
case 8:
locale = "C";
- locale_provider = COLLPROVIDER_LIBC;
break;
case 9:
pwfilename = pg_strdup(optarg);
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index 9caf9190cf..51c4bb3592 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -164,14 +164,6 @@ main(int argc, char *argv[])
exit(1);
}
- if (locale)
- {
- if (!lc_ctype)
- lc_ctype = locale;
- if (!lc_collate)
- lc_collate = locale;
- }
-
if (locale_provider && pg_strcasecmp(locale_provider, "icu") == 0 &&
icu_locale &&
(pg_strcasecmp(icu_locale, "C") == 0 ||
@@ -230,6 +222,11 @@ main(int argc, char *argv[])
appendPQExpBuffer(&sql, " STRATEGY %s", fmtId(strategy));
if (template)
appendPQExpBuffer(&sql, " TEMPLATE %s", fmtId(template));
+ if (locale)
+ {
+ appendPQExpBufferStr(&sql, " LOCALE ");
+ appendStringLiteralConn(&sql, locale, conn);
+ }
if (lc_collate)
{
appendPQExpBufferStr(&sql, " LC_COLLATE ");
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index eb3682f0fd..81a9931c09 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -167,7 +167,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_COLLATE locale name|^createdb: error: database creation failed: ERROR: new collation \(foo'; SELECT '1\) is incompatible with the collation of the template database/s
],
'createdb with incorrect --lc-collate');
$node->command_checks_all(
@@ -175,7 +175,7 @@ $node->command_checks_all(
1,
[qr/^$/],
[
- qr/^createdb: error: database creation failed: ERROR: invalid locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
+ qr/^createdb: error: database creation failed: ERROR: invalid LC_CTYPE locale name|^createdb: error: database creation failed: ERROR: new LC_CTYPE \(foo'; SELECT '1\) is incompatible with the LC_CTYPE of the template database/s
],
'createdb with incorrect --lc-ctype');
diff --git a/src/test/icu/t/010_database.pl b/src/test/icu/t/010_database.pl
index 715b1bffd6..df4af00afe 100644
--- a/src/test/icu/t/010_database.pl
+++ b/src/test/icu/t/010_database.pl
@@ -51,16 +51,23 @@ b),
'sort by explicit collation upper first');
-# Test error cases in CREATE DATABASE involving locale-related options
+# Test that LOCALE='C' works for ICU
-my ($ret, $stdout, $stderr) = $node1->psql('postgres',
- q{CREATE DATABASE dbicu LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
-isnt($ret, 0,
- "ICU locale must be specified for ICU provider: exit code not 0");
+my $ret1 = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu2 LOCALE_PROVIDER icu LOCALE 'C' TEMPLATE template0 ENCODING UTF8});
+is($ret1, 0,
+ "C locale works for ICU");
+
+# Test that ICU-specific locale string must be specified with ICU_LOCALE,
+# not LOCALE
+
+my ($ret2, $stdout, $stderr) = $node1->psql('postgres',
+ q{CREATE DATABASE dbicu3 LOCALE_PROVIDER icu LOCALE '@colStrength=primary' TEMPLATE template0 ENCODING UTF8});
+isnt($ret2, 0,
+ "ICU-specific locale must be specified with ICU_LOCALE: exit code not 0");
like(
$stderr,
- qr/ERROR: ICU locale must be specified/,
- "ICU locale must be specified for ICU provider: error message");
-
+ qr/ERROR: invalid LC_COLLATE locale name/,
+ "ICU-specific locale must be specified with ICU_LOCALE: error message");
done_testing();
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index 7c186e9f69..cf1852c89d 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1202,9 +1202,9 @@ SELECT 'coté' < 'côte' COLLATE "und-x-icu", 'coté' > 'côte' COLLATE testcoll
(1 row)
CREATE COLLATION testcoll_lower_first (provider = icu, locale = '@colCaseFirst=lower');
-NOTICE: using standard form "und-u-kf-lower" for locale "@colCaseFirst=lower"
+NOTICE: using standard form "und-u-kf-lower" for ICU locale "@colCaseFirst=lower"
CREATE COLLATION testcoll_upper_first (provider = icu, locale = '@colCaseFirst=upper');
-NOTICE: using standard form "und-u-kf-upper" for locale "@colCaseFirst=upper"
+NOTICE: using standard form "und-u-kf-upper" for ICU locale "@colCaseFirst=upper"
SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcoll_upper_first;
?column? | ?column?
----------+----------
@@ -1212,7 +1212,7 @@ SELECT 'aaa' < 'AAA' COLLATE testcoll_lower_first, 'aaa' > 'AAA' COLLATE testcol
(1 row)
CREATE COLLATION testcoll_shifted (provider = icu, locale = '@colAlternate=shifted');
-NOTICE: using standard form "und-u-ka-shifted" for locale "@colAlternate=shifted"
+NOTICE: using standard form "und-u-ka-shifted" for ICU locale "@colAlternate=shifted"
SELECT 'de-luge' < 'deanza' COLLATE "und-x-icu", 'de-luge' > 'deanza' COLLATE testcoll_shifted;
?column? | ?column?
----------+----------
@@ -1229,12 +1229,12 @@ SELECT 'A-21' > 'A-123' COLLATE "und-x-icu", 'A-21' < 'A-123' COLLATE testcoll_n
(1 row)
CREATE COLLATION testcoll_error1 (provider = icu, locale = '@colNumeric=lower');
-NOTICE: using standard form "und-u-kn-lower" for locale "@colNumeric=lower"
+NOTICE: using standard form "und-u-kn-lower" for ICU locale "@colNumeric=lower"
ERROR: could not open collator for locale "und-u-kn-lower": U_ILLEGAL_ARGUMENT_ERROR
-- test that attributes not handled by icu_set_collation_attributes()
-- (handled by ucol_open() directly) also work
CREATE COLLATION testcoll_de_phonebook (provider = icu, locale = 'de@collation=phonebook');
-NOTICE: using standard form "de-u-co-phonebk" for locale "de@collation=phonebook"
+NOTICE: using standard form "de-u-co-phonebk" for ICU locale "de@collation=phonebook"
SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE testcoll_de_phonebook;
?column? | ?column?
----------+----------
@@ -1243,7 +1243,7 @@ SELECT 'Goldmann' < 'Götz' COLLATE "de-x-icu", 'Goldmann' > 'Götz' COLLATE tes
-- rules
CREATE COLLATION testcoll_rules1 (provider = icu, locale = '', rules = '&a < g');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test7 (a text);
-- example from https://unicode-org.github.io/icu/userguide/collation/customization/#syntax
INSERT INTO test7 VALUES ('Abernathy'), ('apple'), ('bird'), ('Boston'), ('Graham'), ('green');
@@ -1271,13 +1271,13 @@ SELECT * FROM test7 ORDER BY a COLLATE testcoll_rules1;
DROP TABLE test7;
CREATE COLLATION testcoll_rulesx (provider = icu, locale = '', rules = '!!wrong!!');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
ERROR: could not open collator for locale "und" with rules "!!wrong!!": U_INVALID_FORMAT_ERROR
-- nondeterministic collations
CREATE COLLATION ctest_det (provider = icu, locale = '', deterministic = true);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION ctest_nondet (provider = icu, locale = '', deterministic = false);
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE TABLE test6 (a int, b text);
-- same string in different normal forms
INSERT INTO test6 VALUES (1, U&'\00E4bc');
@@ -1327,9 +1327,9 @@ SELECT * FROM test6a WHERE b = ARRAY['äbc'] COLLATE ctest_nondet;
(2 rows)
CREATE COLLATION case_sensitive (provider = icu, locale = '');
-NOTICE: using standard form "und" for locale ""
+NOTICE: using standard form "und" for ICU locale ""
CREATE COLLATION case_insensitive (provider = icu, locale = '@colStrength=secondary', deterministic = false);
-NOTICE: using standard form "und-u-ks-level2" for locale "@colStrength=secondary"
+NOTICE: using standard form "und-u-ks-level2" for ICU locale "@colStrength=secondary"
SELECT 'abc' <= 'ABC' COLLATE case_sensitive, 'abc' >= 'ABC' COLLATE case_sensitive;
?column? | ?column?
----------+----------
--
2.34.1
v6-0004-Add-default_collation_provider-GUC.patchtext/x-patch; charset=UTF-8; name=v6-0004-Add-default_collation_provider-GUC.patchDownload
From 2a857a2cb080dbc015c59b89acbb195ae7991a99 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Thu, 11 May 2023 12:54:31 -0700
Subject: [PATCH v6 4/5] Add default_collation_provider GUC.
Controls default collation provider for CREATE COLLATION. Does not
affect CREATE DATABASE, which gets its default from the template
database.
---
doc/src/sgml/config.sgml | 17 +++++++++++++
doc/src/sgml/ref/create_collation.sgml | 15 ++++++++---
src/backend/commands/collationcmds.c | 8 +++++-
src/backend/utils/misc/guc_tables.c | 18 +++++++++++++
src/backend/utils/misc/postgresql.conf.sample | 4 +++
src/include/commands/collationcmds.h | 2 ++
.../regress/expected/collate.icu.utf8.out | 25 +++++++++++++++++++
src/test/regress/sql/collate.icu.utf8.sql | 13 ++++++++++
8 files changed, 97 insertions(+), 5 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 18ce06729b..58a1046340 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9820,6 +9820,23 @@ SET XML OPTION { DOCUMENT | CONTENT };
</listitem>
</varlistentry>
+ <varlistentry id="guc-default-collation-provider" xreflabel="default_collation_provider">
+ <term><varname>default_collation_provider</varname> (<type>enum</type>)
+ <indexterm>
+ <primary><varname>default_collation_provider</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Default collation provider for <command>CREATE
+ COLLATION</command>. Does not affect <command>CREATE
+ DATABASE</command>, which gets the default collation provider from the
+ template database. Valid values are <literal>icu</literal> and
+ <literal>libc</literal>. The default is <literal>libc</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-icu-validation-level" xreflabel="icu_validation_level">
<term><varname>icu_validation_level</varname> (<type>enum</type>)
<indexterm>
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index 1ac41831d8..c9b3e6e218 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -121,10 +121,17 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
<para>
Specifies the provider to use for locale services associated with this
collation. Possible values are <literal>none</literal>,
- <literal>icu</literal><indexterm><primary>ICU</primary></indexterm>
- (if the server was built with ICU support) or <literal>libc</literal>.
- <literal>libc</literal> is the default. See <xref
- linkend="locale-providers"/> for details.
+ <literal>icu</literal><indexterm><primary>ICU</primary></indexterm> (if
+ the server was built with ICU support) or <literal>libc</literal>. See
+ <xref linkend="locale-providers"/> for details.
+ </para>
+ <para>
+ If <replaceable>provider</replaceable> is not specified, and
+ <replaceable>lc_collate</replaceable> or
+ <replaceable>lc_ctype</replaceable> is specified, the
+ <literal>libc</literal> provider is used. Otherwise, the default
+ provider is controlled by <xref
+ linkend="guc-default-collation-provider"/>.
</para>
<para>
If the provider is <literal>icu</literal> and the locale is
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 21615746f9..25e8d32fd9 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -47,6 +47,7 @@ typedef struct
int enc; /* encoding */
} CollAliasData;
+int default_collation_provider = (int) COLLPROVIDER_LIBC;
/*
* CREATE COLLATION
@@ -228,7 +229,12 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
collproviderstr)));
}
else
- collprovider = COLLPROVIDER_LIBC;
+ {
+ if (lccollateEl || lcctypeEl)
+ collprovider = COLLPROVIDER_LIBC;
+ else
+ collprovider = (char) default_collation_provider;
+ }
if (collprovider == COLLPROVIDER_NONE
&& (localeEl || lccollateEl || lcctypeEl))
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 844781a7f5..901cfda819 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -35,8 +35,10 @@
#include "access/xlogrecovery.h"
#include "archive/archive_module.h"
#include "catalog/namespace.h"
+#include "catalog/pg_collation.h"
#include "catalog/storage.h"
#include "commands/async.h"
+#include "commands/collationcmds.h"
#include "commands/tablespace.h"
#include "commands/trigger.h"
#include "commands/user.h"
@@ -166,6 +168,12 @@ static const struct config_enum_entry intervalstyle_options[] = {
{NULL, 0, false}
};
+static const struct config_enum_entry collation_provider_options[] = {
+ {"icu", (int) 'i', false},
+ {"libc", (int) 'c', false},
+ {NULL, 0, false}
+};
+
static const struct config_enum_entry icu_validation_level_options[] = {
{"disabled", -1, false},
{"debug5", DEBUG5, false},
@@ -4683,6 +4691,16 @@ struct config_enum ConfigureNamesEnum[] =
NULL, NULL, NULL
},
+ {
+ {"default_collation_provider", PGC_USERSET, CLIENT_CONN_LOCALE,
+ gettext_noop("Default collation provider for CREATE COLLATION."),
+ NULL
+ },
+ &default_collation_provider,
+ (int) COLLPROVIDER_LIBC, collation_provider_options,
+ NULL, NULL, NULL
+ },
+
{
{"icu_validation_level", PGC_USERSET, CLIENT_CONN_LOCALE,
gettext_noop("Log level for reporting invalid ICU locale strings."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c8018da04a..c1f247378d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -734,6 +734,10 @@
#lc_numeric = 'C' # locale for number formatting
#lc_time = 'C' # locale for time formatting
+#default_collation_provider = 'libc' # default collation provider
+ # for CREATE COLLATION
+ # (none, icu, libc)
+
#icu_validation_level = WARNING # report ICU locale validation
# errors at the given level
diff --git a/src/include/commands/collationcmds.h b/src/include/commands/collationcmds.h
index b76c7b3dc3..f54389525d 100644
--- a/src/include/commands/collationcmds.h
+++ b/src/include/commands/collationcmds.h
@@ -18,6 +18,8 @@
#include "catalog/objectaddress.h"
#include "parser/parse_node.h"
+extern PGDLLIMPORT int default_collation_provider;
+
extern ObjectAddress DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_exists);
extern void IsThereCollationInNamespace(const char *collname, Oid nspOid);
extern ObjectAddress AlterCollation(AlterCollationStmt *stmt);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index cf1852c89d..ea96e27f45 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1038,6 +1038,31 @@ RESET client_min_messages;
-- uses "none" provider instead
CREATE COLLATION testc (provider = icu, locale='C');
NOTICE: using locale provider "none" for ICU locale "C"
+SET default_collation_provider = 'libc';
+CREATE COLLATION def_libc (LOCALE = 'C');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_libc';
+ collname | collprovider
+----------+--------------
+ def_libc | c
+(1 row)
+
+DROP COLLATION def_libc;
+SET default_collation_provider = 'icu';
+CREATE COLLATION def_icu (LOCALE = 'und');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_icu';
+ collname | collprovider
+----------+--------------
+ def_icu | i
+(1 row)
+
+CREATE COLLATION def_libc (LC_COLLATE = 'C', LC_CTYPE='C');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_libc';
+ collname | collprovider
+----------+--------------
+ def_libc | c
+(1 row)
+
+RESET default_collation_provider;
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
ERROR: parameter "locale" must be specified
SET icu_validation_level = ERROR;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index e59200df9a..ee607ca3a5 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -378,6 +378,19 @@ RESET client_min_messages;
-- uses "none" provider instead
CREATE COLLATION testc (provider = icu, locale='C');
+SET default_collation_provider = 'libc';
+CREATE COLLATION def_libc (LOCALE = 'C');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_libc';
+DROP COLLATION def_libc;
+
+SET default_collation_provider = 'icu';
+CREATE COLLATION def_icu (LOCALE = 'und');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_icu';
+CREATE COLLATION def_libc (LC_COLLATE = 'C', LC_CTYPE='C');
+SELECT collname, collprovider FROM pg_collation WHERE collname='def_libc';
+
+RESET default_collation_provider;
+
CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, needs "locale"
SET icu_validation_level = ERROR;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
--
2.34.1
v6-0001-Introduce-collation-provider-none.patchtext/x-patch; charset=UTF-8; name=v6-0001-Introduce-collation-provider-none.patchDownload
From de37bfb02dcc41c2e932a788ba10a05e5a539870 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 1 May 2023 15:38:29 -0700
Subject: [PATCH v6 1/5] Introduce collation provider "none".
Provides locale-unaware semantics that are implemented as fast byte
operations in Postgres, independent of the operating system or any
provider libraries.
Equivalent (in semantics and implementation) to the libc provider with
locale "C", except that LC_COLLATE and LC_CTYPE can be set
independently.
Use provider "none" for built-in collation "ucs_basic" instead of
libc.
---
doc/src/sgml/charset.sgml | 87 +++++++++++++++++++++-----
doc/src/sgml/ref/create_collation.sgml | 2 +-
doc/src/sgml/ref/create_database.sgml | 2 +-
doc/src/sgml/ref/createdb.sgml | 2 +-
doc/src/sgml/ref/initdb.sgml | 2 +-
src/backend/catalog/pg_collation.c | 7 ++-
src/backend/commands/collationcmds.c | 84 +++++++++++++++++++++----
src/backend/commands/dbcommands.c | 69 +++++++++++++++++---
src/backend/utils/adt/pg_locale.c | 27 +++++++-
src/backend/utils/init/postinit.c | 10 ++-
src/bin/initdb/initdb.c | 33 +++++++++-
src/bin/initdb/t/001_initdb.pl | 29 +++++++++
src/bin/pg_dump/pg_dump.c | 8 ++-
src/bin/pg_upgrade/t/002_pg_upgrade.pl | 18 +++++-
src/bin/psql/describe.c | 2 +-
src/bin/scripts/createdb.c | 2 +-
src/bin/scripts/t/020_createdb.pl | 29 +++++++++
src/include/catalog/pg_collation.dat | 3 +-
src/include/catalog/pg_collation.h | 3 +
src/test/regress/expected/collate.out | 10 ++-
src/test/regress/sql/collate.sql | 6 ++
21 files changed, 373 insertions(+), 62 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 9db14649aa..7a791a2b7c 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -342,22 +342,14 @@ initdb --locale=sv_SE
<title>Locale Providers</title>
<para>
- <productname>PostgreSQL</productname> supports multiple <firstterm>locale
- providers</firstterm>. This specifies which library supplies the locale
- data. One standard provider name is <literal>libc</literal>, which uses
- the locales provided by the operating system C library. These are the
- locales used by most tools provided by the operating system. Another
- provider is <literal>icu</literal>, which uses the external
- ICU<indexterm><primary>ICU</primary></indexterm> library. ICU locales can
- only be used if support for ICU was configured when PostgreSQL was built.
+ A locale provider specifies which library defines the locale behavior for
+ collations and character classifications.
</para>
<para>
The commands and tools that select the locale settings, as described
- above, each have an option to select the locale provider. The examples
- shown earlier all use the <literal>libc</literal> provider, which is the
- default. Here is an example to initialize a database cluster using the
- ICU provider:
+ above, each have an option to select the locale provider. Here is an
+ example to initialize a database cluster using the ICU provider:
<programlisting>
initdb --locale-provider=icu --icu-locale=en
</programlisting>
@@ -370,12 +362,73 @@ initdb --locale-provider=icu --icu-locale=en
</para>
<para>
- Which locale provider to use depends on individual requirements. For most
- basic uses, either provider will give adequate results. For the libc
- provider, it depends on what the operating system offers; some operating
- systems are better than others. For advanced uses, ICU offers more locale
- variants and customization options.
+ Regardless of the locale provider, the operating system is still used to
+ provide some locale-aware behavior, such as messages (see <xref
+ linkend="guc-lc-messages"/>).
</para>
+
+ <para>
+ The available locale providers are listed below.
+ </para>
+
+ <sect3 id="locale-provider-none">
+ <title>None</title>
+ <para>
+ The <literal>none</literal> provider uses simple built-in operations
+ which are not locale-aware.
+ </para>
+ <para>
+ The collation and character classification behavior is equivalent to
+ using the <literal>libc</literal> provider with locale
+ <literal>C</literal>, except that <literal>LC_COLLATE</literal> and
+ <literal>LC_CTYPE</literal> can be set independently.
+ </para>
+ <note>
+ <para>
+ When using the <literal>none</literal> locale provider, behavior may
+ depend on the database encoding.
+ </para>
+ </note>
+ </sect3>
+ <sect3 id="locale-provider-icu">
+ <title>ICU</title>
+ <para>
+ The <literal>icu</literal> provider uses the external
+ ICU<indexterm><primary>ICU</primary></indexterm>
+ library. <productname>PostgreSQL</productname> must have been configured
+ with support.
+ </para>
+ <para>
+ ICU provides collation and character classification behavior that is
+ independent of the operating system and database encoding, which is
+ preferable if you expect to transition to other platforms without any
+ change in results. <literal>LC_COLLATE</literal> and
+ <literal>LC_CTYPE</literal> can be set independently of the ICU locale.
+ </para>
+ <note>
+ <para>
+ For the ICU provider, results may depend on the version of the ICU
+ library used, as it is updated to reflect changes in natural language
+ over time.
+ </para>
+ </note>
+ </sect3>
+ <sect3 id="locale-provider-libc">
+ <title>libc</title>
+ <para>
+ The <literal>libc</literal> provider uses the operating system's C
+ library. The collation and character classification behavior is
+ controlled by the settings <literal>LC_COLLATE</literal> and
+ <literal>LC_CTYPE</literal>, so they cannot be set independently.
+ </para>
+ <note>
+ <para>
+ The same locale name may have different behavior on different platforms
+ when using the libc provider.
+ </para>
+ </note>
+ </sect3>
+
</sect2>
<sect2 id="icu-locales">
<title>ICU Locales</title>
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index f6353da5c1..5489ae7413 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -120,7 +120,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
<listitem>
<para>
Specifies the provider to use for locale services associated with this
- collation. Possible values are
+ collation. Possible values are <literal>none</literal>,
<literal>icu</literal><indexterm><primary>ICU</primary></indexterm>
(if the server was built with ICU support) or <literal>libc</literal>.
<literal>libc</literal> is the default. See <xref
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index 13793bb6b7..60b9da0952 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -212,7 +212,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
<listitem>
<para>
Specifies the provider to use for the default collation in this
- database. Possible values are
+ database. Possible values are <literal>none</literal>,
<literal>icu</literal><indexterm><primary>ICU</primary></indexterm>
(if the server was built with ICU support) or <literal>libc</literal>.
By default, the provider is the same as that of the <xref
diff --git a/doc/src/sgml/ref/createdb.sgml b/doc/src/sgml/ref/createdb.sgml
index e23419ba6c..326a371d34 100644
--- a/doc/src/sgml/ref/createdb.sgml
+++ b/doc/src/sgml/ref/createdb.sgml
@@ -168,7 +168,7 @@ PostgreSQL documentation
</varlistentry>
<varlistentry>
- <term><option>--locale-provider={<literal>libc</literal>|<literal>icu</literal>}</option></term>
+ <term><option>--locale-provider={<literal>none</literal>|<literal>libc</literal>|<literal>icu</literal>}</option></term>
<listitem>
<para>
Specifies the locale provider for the database's default collation.
diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 87945b4b62..e604ab48b7 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -323,7 +323,7 @@ PostgreSQL documentation
</varlistentry>
<varlistentry id="app-initdb-option-locale-provider">
- <term><option>--locale-provider={<literal>libc</literal>|<literal>icu</literal>}</option></term>
+ <term><option>--locale-provider={<literal>none</literal>|<literal>libc</literal>|<literal>icu</literal>}</option></term>
<listitem>
<para>
This option sets the locale provider for databases created in the new
diff --git a/src/backend/catalog/pg_collation.c b/src/backend/catalog/pg_collation.c
index fd022e6fc2..86b6ba2375 100644
--- a/src/backend/catalog/pg_collation.c
+++ b/src/backend/catalog/pg_collation.c
@@ -68,7 +68,12 @@ CollationCreate(const char *collname, Oid collnamespace,
Assert(collname);
Assert(collnamespace);
Assert(collowner);
- Assert((collcollate && collctype) || colliculocale);
+ Assert((collprovider == COLLPROVIDER_NONE &&
+ !collcollate && !collctype && !colliculocale) ||
+ (collprovider == COLLPROVIDER_LIBC &&
+ collcollate && collctype && !colliculocale) ||
+ (collprovider == COLLPROVIDER_ICU &&
+ !collcollate && !collctype && colliculocale));
/*
* Make sure there is no existing collation of same name & encoding.
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index c91fe66d9b..aeaf6c419e 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -215,7 +215,9 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
if (collproviderstr)
{
- if (pg_strcasecmp(collproviderstr, "icu") == 0)
+ if (pg_strcasecmp(collproviderstr, "none") == 0)
+ collprovider = COLLPROVIDER_NONE;
+ else if (pg_strcasecmp(collproviderstr, "icu") == 0)
collprovider = COLLPROVIDER_ICU;
else if (pg_strcasecmp(collproviderstr, "libc") == 0)
collprovider = COLLPROVIDER_LIBC;
@@ -228,6 +230,13 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
else
collprovider = COLLPROVIDER_LIBC;
+ if (collprovider == COLLPROVIDER_NONE
+ && (localeEl || lccollateEl || lcctypeEl))
+ {
+ ereport(ERROR,
+ (errmsg("collation provider \"none\" does not support LOCALE, LC_COLLATE, or LC_CTYPE")));
+ }
+
if (localeEl)
{
if (collprovider == COLLPROVIDER_LIBC)
@@ -302,6 +311,16 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("ICU rules cannot be specified unless locale provider is ICU")));
+ if (collprovider == COLLPROVIDER_NONE)
+ {
+ /*
+ * Behavior may be different in different encodings, so set
+ * collencoding to the current database encoding. No validation is
+ * required, because the "none" provider is compatible with any
+ * encoding.
+ */
+ collencoding = GetDatabaseEncoding();
+ }
if (collprovider == COLLPROVIDER_ICU)
{
#ifdef USE_ICU
@@ -331,7 +350,18 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
}
if (!collversion)
- collversion = get_collation_actual_version(collprovider, collprovider == COLLPROVIDER_ICU ? colliculocale : collcollate);
+ {
+ char *locale;
+
+ if (collprovider == COLLPROVIDER_ICU)
+ locale = colliculocale;
+ else if (collprovider == COLLPROVIDER_LIBC)
+ locale = collcollate;
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ collversion = get_collation_actual_version(collprovider, locale);
+ }
newoid = CollationCreate(collName,
collNamespace,
@@ -406,6 +436,7 @@ AlterCollation(AlterCollationStmt *stmt)
Form_pg_collation collForm;
Datum datum;
bool isnull;
+ char *locale;
char *oldversion;
char *newversion;
ObjectAddress address;
@@ -430,8 +461,20 @@ AlterCollation(AlterCollationStmt *stmt)
datum = SysCacheGetAttr(COLLOID, tup, Anum_pg_collation_collversion, &isnull);
oldversion = isnull ? NULL : TextDatumGetCString(datum);
- datum = SysCacheGetAttrNotNull(COLLOID, tup, collForm->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate);
- newversion = get_collation_actual_version(collForm->collprovider, TextDatumGetCString(datum));
+ if (collForm->collprovider == COLLPROVIDER_ICU)
+ {
+ datum = SysCacheGetAttrNotNull(COLLOID, tup, Anum_pg_collation_colliculocale);
+ locale = TextDatumGetCString(datum);
+ }
+ else if (collForm->collprovider == COLLPROVIDER_LIBC)
+ {
+ datum = SysCacheGetAttrNotNull(COLLOID, tup, Anum_pg_collation_collcollate);
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ newversion = get_collation_actual_version(collForm->collprovider, locale);
/* cannot change from NULL to non-NULL or vice versa */
if ((!oldversion && newversion) || (oldversion && !newversion))
@@ -494,11 +537,18 @@ pg_collation_actual_version(PG_FUNCTION_ARGS)
provider = ((Form_pg_database) GETSTRUCT(dbtup))->datlocprovider;
- datum = SysCacheGetAttrNotNull(DATABASEOID, dbtup,
- provider == COLLPROVIDER_ICU ?
- Anum_pg_database_daticulocale : Anum_pg_database_datcollate);
-
- locale = TextDatumGetCString(datum);
+ if (provider == COLLPROVIDER_ICU)
+ {
+ datum = SysCacheGetAttrNotNull(DATABASEOID, dbtup, Anum_pg_database_daticulocale);
+ locale = TextDatumGetCString(datum);
+ }
+ else if (provider == COLLPROVIDER_LIBC)
+ {
+ datum = SysCacheGetAttrNotNull(DATABASEOID, dbtup, Anum_pg_database_datcollate);
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
ReleaseSysCache(dbtup);
}
@@ -514,11 +564,19 @@ pg_collation_actual_version(PG_FUNCTION_ARGS)
provider = ((Form_pg_collation) GETSTRUCT(colltp))->collprovider;
Assert(provider != COLLPROVIDER_DEFAULT);
- datum = SysCacheGetAttrNotNull(COLLOID, colltp,
- provider == COLLPROVIDER_ICU ?
- Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate);
- locale = TextDatumGetCString(datum);
+ if (provider == COLLPROVIDER_ICU)
+ {
+ datum = SysCacheGetAttrNotNull(COLLOID, colltp, Anum_pg_collation_colliculocale);
+ locale = TextDatumGetCString(datum);
+ }
+ else if (provider == COLLPROVIDER_LIBC)
+ {
+ datum = SysCacheGetAttrNotNull(COLLOID, colltp, Anum_pg_collation_collcollate);
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
ReleaseSysCache(colltp);
}
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 2e242eeff2..9e73f54803 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -909,7 +909,9 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
{
char *locproviderstr = defGetString(dlocprovider);
- if (pg_strcasecmp(locproviderstr, "icu") == 0)
+ if (pg_strcasecmp(locproviderstr, "none") == 0)
+ dblocprovider = COLLPROVIDER_NONE;
+ else if (pg_strcasecmp(locproviderstr, "icu") == 0)
dblocprovider = COLLPROVIDER_ICU;
else if (pg_strcasecmp(locproviderstr, "libc") == 0)
dblocprovider = COLLPROVIDER_LIBC;
@@ -1177,9 +1179,17 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
*/
if (src_collversion && !dcollversion)
{
- char *actual_versionstr;
+ char *actual_versionstr;
+ char *locale;
- actual_versionstr = get_collation_actual_version(dblocprovider, dblocprovider == COLLPROVIDER_ICU ? dbiculocale : dbcollate);
+ if (dblocprovider == COLLPROVIDER_ICU)
+ locale = dbiculocale;
+ else if (dblocprovider == COLLPROVIDER_LIBC)
+ locale = dbcollate;
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ actual_versionstr = get_collation_actual_version(dblocprovider, locale);
if (!actual_versionstr)
ereport(ERROR,
(errmsg("template database \"%s\" has a collation version, but no actual collation version could be determined",
@@ -1207,7 +1217,18 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
* collation version, which is normally only the case for template0.
*/
if (dbcollversion == NULL)
- dbcollversion = get_collation_actual_version(dblocprovider, dblocprovider == COLLPROVIDER_ICU ? dbiculocale : dbcollate);
+ {
+ char *locale;
+
+ if (dblocprovider == COLLPROVIDER_ICU)
+ locale = dbiculocale;
+ else if (dblocprovider == COLLPROVIDER_LIBC)
+ locale = dbcollate;
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ dbcollversion = get_collation_actual_version(dblocprovider, locale);
+ }
/* Resolve default tablespace for new database */
if (dtablespacename && dtablespacename->arg)
@@ -2403,6 +2424,7 @@ AlterDatabaseRefreshColl(AlterDatabaseRefreshCollStmt *stmt)
ObjectAddress address;
Datum datum;
bool isnull;
+ char *locale;
char *oldversion;
char *newversion;
@@ -2429,10 +2451,24 @@ AlterDatabaseRefreshColl(AlterDatabaseRefreshCollStmt *stmt)
datum = heap_getattr(tuple, Anum_pg_database_datcollversion, RelationGetDescr(rel), &isnull);
oldversion = isnull ? NULL : TextDatumGetCString(datum);
- datum = heap_getattr(tuple, datForm->datlocprovider == COLLPROVIDER_ICU ? Anum_pg_database_daticulocale : Anum_pg_database_datcollate, RelationGetDescr(rel), &isnull);
- if (isnull)
- elog(ERROR, "unexpected null in pg_database");
- newversion = get_collation_actual_version(datForm->datlocprovider, TextDatumGetCString(datum));
+ if (datForm->datlocprovider == COLLPROVIDER_ICU)
+ {
+ datum = heap_getattr(tuple, Anum_pg_database_daticulocale, RelationGetDescr(rel), &isnull);
+ if (isnull)
+ elog(ERROR, "unexpected null in pg_database");
+ locale = TextDatumGetCString(datum);
+ }
+ else if (datForm->datlocprovider == COLLPROVIDER_LIBC)
+ {
+ datum = heap_getattr(tuple, Anum_pg_database_datcollate, RelationGetDescr(rel), &isnull);
+ if (isnull)
+ elog(ERROR, "unexpected null in pg_database");
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ newversion = get_collation_actual_version(datForm->datlocprovider, locale);
/* cannot change from NULL to non-NULL or vice versa */
if ((!oldversion && newversion) || (oldversion && !newversion))
@@ -2617,6 +2653,7 @@ pg_database_collation_actual_version(PG_FUNCTION_ARGS)
HeapTuple tp;
char datlocprovider;
Datum datum;
+ char *locale;
char *version;
tp = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(dbid));
@@ -2627,8 +2664,20 @@ pg_database_collation_actual_version(PG_FUNCTION_ARGS)
datlocprovider = ((Form_pg_database) GETSTRUCT(tp))->datlocprovider;
- datum = SysCacheGetAttrNotNull(DATABASEOID, tp, datlocprovider == COLLPROVIDER_ICU ? Anum_pg_database_daticulocale : Anum_pg_database_datcollate);
- version = get_collation_actual_version(datlocprovider, TextDatumGetCString(datum));
+ if (datlocprovider == COLLPROVIDER_ICU)
+ {
+ datum = SysCacheGetAttrNotNull(DATABASEOID, tp, Anum_pg_database_daticulocale);
+ locale = TextDatumGetCString(datum);
+ }
+ else if (datlocprovider == COLLPROVIDER_LIBC)
+ {
+ datum = SysCacheGetAttrNotNull(DATABASEOID, tp, Anum_pg_database_datcollate);
+ locale = TextDatumGetCString(datum);
+ }
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ version = get_collation_actual_version(datlocprovider, locale);
ReleaseSysCache(tp);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index eea1d1ae0f..95eb5cf464 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1228,7 +1228,12 @@ lookup_collation_cache(Oid collation, bool set_flags)
elog(ERROR, "cache lookup failed for collation %u", collation);
collform = (Form_pg_collation) GETSTRUCT(tp);
- if (collform->collprovider == COLLPROVIDER_LIBC)
+ if (collform->collprovider == COLLPROVIDER_NONE)
+ {
+ cache_entry->collate_is_c = true;
+ cache_entry->ctype_is_c = true;
+ }
+ else if (collform->collprovider == COLLPROVIDER_LIBC)
{
Datum datum;
const char *collcollate;
@@ -1281,6 +1286,9 @@ lc_collate_is_c(Oid collation)
static int result = -1;
char *localeptr;
+ if (default_locale.provider == COLLPROVIDER_NONE)
+ return true;
+
if (default_locale.provider == COLLPROVIDER_ICU)
return false;
@@ -1334,6 +1342,9 @@ lc_ctype_is_c(Oid collation)
static int result = -1;
char *localeptr;
+ if (default_locale.provider == COLLPROVIDER_NONE)
+ return true;
+
if (default_locale.provider == COLLPROVIDER_ICU)
return false;
@@ -1487,8 +1498,10 @@ pg_newlocale_from_collation(Oid collid)
{
if (default_locale.provider == COLLPROVIDER_ICU)
return &default_locale;
- else
+ else if (default_locale.provider == COLLPROVIDER_LIBC)
return (pg_locale_t) 0;
+ else
+ elog(ERROR, "cannot open collation with provider \"none\"");
}
cache_entry = lookup_collation_cache(collid, false);
@@ -1513,7 +1526,11 @@ pg_newlocale_from_collation(Oid collid)
result.provider = collform->collprovider;
result.deterministic = collform->collisdeterministic;
- if (collform->collprovider == COLLPROVIDER_LIBC)
+ if (collform->collprovider == COLLPROVIDER_NONE)
+ {
+ elog(ERROR, "cannot open collation with provider \"none\"");
+ }
+ else if (collform->collprovider == COLLPROVIDER_LIBC)
{
#ifdef HAVE_LOCALE_T
const char *collcollate;
@@ -1599,6 +1616,7 @@ pg_newlocale_from_collation(Oid collid)
collversionstr = TextDatumGetCString(datum);
+ Assert(collform->collprovider != COLLPROVIDER_NONE);
datum = SysCacheGetAttrNotNull(COLLOID, tp, collform->collprovider == COLLPROVIDER_ICU ? Anum_pg_collation_colliculocale : Anum_pg_collation_collcollate);
actual_versionstr = get_collation_actual_version(collform->collprovider,
@@ -1650,6 +1668,9 @@ get_collation_actual_version(char collprovider, const char *collcollate)
{
char *collversion = NULL;
+ if (collprovider == COLLPROVIDER_NONE)
+ return NULL;
+
#ifdef USE_ICU
if (collprovider == COLLPROVIDER_ICU)
{
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 53420f4974..8053642fd3 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -461,10 +461,18 @@ CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connect
{
char *actual_versionstr;
char *collversionstr;
+ char *locale;
collversionstr = TextDatumGetCString(datum);
- actual_versionstr = get_collation_actual_version(dbform->datlocprovider, dbform->datlocprovider == COLLPROVIDER_ICU ? iculocale : collate);
+ if (dbform->datlocprovider == COLLPROVIDER_ICU)
+ locale = iculocale;
+ else if (dbform->datlocprovider == COLLPROVIDER_LIBC)
+ locale = collate;
+ else
+ locale = NULL; /* COLLPROVIDER_NONE */
+
+ actual_versionstr = get_collation_actual_version(dbform->datlocprovider, locale);
if (!actual_versionstr)
/* should not happen */
elog(WARNING,
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 30b576932f..4a6cad3cb9 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2408,6 +2408,22 @@ setlocales(void)
/* set empty lc_* values to locale config if set */
+ if (locale_provider == COLLPROVIDER_NONE)
+ {
+ if (!lc_ctype)
+ lc_ctype = "C";
+ if (!lc_collate)
+ lc_collate = "C";
+ if (!lc_numeric)
+ lc_numeric = "C";
+ if (!lc_time)
+ lc_time = "C";
+ if (!lc_monetary)
+ lc_monetary = "C";
+ if (!lc_messages)
+ lc_messages = "C";
+ }
+
if (locale)
{
if (!lc_ctype)
@@ -2502,7 +2518,7 @@ usage(const char *progname)
" set default locale in the respective category for\n"
" new databases (default taken from environment)\n"));
printf(_(" --no-locale equivalent to --locale=C\n"));
- printf(_(" --locale-provider={libc|icu}\n"
+ printf(_(" --locale-provider={none|libc|icu}\n"
" set default locale provider for new databases\n"));
printf(_(" --pwfile=FILE read password for the new superuser from file\n"));
printf(_(" -T, --text-search-config=CFG\n"
@@ -2652,7 +2668,15 @@ setup_locale_encoding(void)
{
setlocales();
- if (locale_provider == COLLPROVIDER_LIBC &&
+ if (locale_provider == COLLPROVIDER_NONE &&
+ strcmp(lc_ctype, "C") == 0 &&
+ strcmp(lc_collate, "C") == 0 &&
+ strcmp(lc_time, "C") == 0 &&
+ strcmp(lc_numeric, "C") == 0 &&
+ strcmp(lc_monetary, "C") == 0 &&
+ strcmp(lc_messages, "C") == 0)
+ printf(_("The database cluster will be initialized with no locale.\n"));
+ else if (locale_provider == COLLPROVIDER_LIBC &&
strcmp(lc_ctype, lc_collate) == 0 &&
strcmp(lc_ctype, lc_time) == 0 &&
strcmp(lc_ctype, lc_numeric) == 0 &&
@@ -3326,7 +3350,9 @@ main(int argc, char *argv[])
"-c debug_discard_caches=1");
break;
case 15:
- if (strcmp(optarg, "icu") == 0)
+ if (strcmp(optarg, "none") == 0)
+ locale_provider = COLLPROVIDER_NONE;
+ else if (strcmp(optarg, "icu") == 0)
locale_provider = COLLPROVIDER_ICU;
else if (strcmp(optarg, "libc") == 0)
locale_provider = COLLPROVIDER_LIBC;
@@ -3365,6 +3391,7 @@ main(int argc, char *argv[])
exit(1);
}
+
if (icu_locale && locale_provider != COLLPROVIDER_ICU)
pg_fatal("%s cannot be specified unless locale provider \"%s\" is chosen",
"--icu-locale", "icu");
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 17a444d80c..fe6d224e5b 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -154,6 +154,35 @@ else
'locale provider ICU fails since no ICU support');
}
+command_ok(
+ [ 'initdb', '--no-sync', '--locale-provider=none', "$tempdir/data6" ],
+ 'locale provider none');
+
+command_ok(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--locale=C',
+ "$tempdir/data7" ],
+ 'locale provider none with --locale');
+
+command_ok(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--lc-collate=C',
+ "$tempdir/data8" ],
+ 'locale provider none with --lc-collate');
+
+command_ok(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--lc-ctype=C',
+ "$tempdir/data9" ],
+ 'locale provider none with --lc-ctype');
+
+command_fails(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--icu-locale=en',
+ "$tempdir/dataX" ],
+ 'fails for locale provider none with ICU locale');
+
+command_fails(
+ [ 'initdb', '--no-sync', '--locale-provider=none', '--icu-rules=""',
+ "$tempdir/dataX" ],
+ 'fails for locale provider none with ICU rules');
+
command_fails(
[ 'initdb', '--no-sync', '--locale-provider=xyz', "$tempdir/dataX" ],
'fails for invalid locale provider');
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f9cbeb65ab..ddc8a5f71f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3070,7 +3070,9 @@ dumpDatabase(Archive *fout)
}
appendPQExpBufferStr(creaQry, " LOCALE_PROVIDER = ");
- if (datlocprovider[0] == 'c')
+ if (datlocprovider[0] == 'n')
+ appendPQExpBufferStr(creaQry, "none");
+ else if (datlocprovider[0] == 'c')
appendPQExpBufferStr(creaQry, "libc");
else if (datlocprovider[0] == 'i')
appendPQExpBufferStr(creaQry, "icu");
@@ -13429,7 +13431,9 @@ dumpCollation(Archive *fout, const CollInfo *collinfo)
fmtQualifiedDumpable(collinfo));
appendPQExpBufferStr(q, "provider = ");
- if (collprovider[0] == 'c')
+ if (collprovider[0] == 'n')
+ appendPQExpBufferStr(q, "none");
+ else if (collprovider[0] == 'c')
appendPQExpBufferStr(q, "libc");
else if (collprovider[0] == 'i')
appendPQExpBufferStr(q, "icu");
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 4a7895a756..6d58f6103e 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -114,12 +114,20 @@ my $original_locale = "C";
my $original_iculocale = "";
my $provider_field = "'c' AS datlocprovider";
my $iculocale_field = "NULL AS daticulocale";
-if ($oldnode->pg_version >= 15 && $ENV{with_icu} eq 'yes')
+if ($oldnode->pg_version >= 15)
{
$provider_field = "datlocprovider";
$iculocale_field = "daticulocale";
- $original_provider = "i";
- $original_iculocale = "fr-CA";
+
+ if ($ENV{with_icu} eq 'yes')
+ {
+ $original_provider = "i";
+ $original_iculocale = "fr-CA";
+ }
+ else
+ {
+ $original_provider = "n";
+ }
}
my @initdb_params = @custom_opts;
@@ -131,6 +139,10 @@ if ($original_provider eq "i")
push @initdb_params, ('--locale-provider', 'icu');
push @initdb_params, ('--icu-locale', 'fr-CA');
}
+elsif ($original_provider eq "n")
+{
+ push @initdb_params, ('--locale-provider', 'none');
+}
$node_params{extra} = \@initdb_params;
$oldnode->init(%node_params);
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index ab4279ed58..c842a62ae9 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -932,7 +932,7 @@ listAllDbs(const char *pattern, bool verbose)
gettext_noop("Encoding"));
if (pset.sversion >= 150000)
appendPQExpBuffer(&buf,
- " CASE d.datlocprovider WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu' END AS \"%s\",\n",
+ " CASE d.datlocprovider WHEN 'n' THEN 'none' WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu' END AS \"%s\",\n",
gettext_noop("Locale Provider"));
else
appendPQExpBuffer(&buf,
diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index b4205c4fa5..79367d933b 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -299,7 +299,7 @@ help(const char *progname)
printf(_(" --lc-ctype=LOCALE LC_CTYPE setting for the database\n"));
printf(_(" --icu-locale=LOCALE ICU locale setting for the database\n"));
printf(_(" --icu-rules=RULES ICU rules setting for the database\n"));
- printf(_(" --locale-provider={libc|icu}\n"
+ printf(_(" --locale-provider={none|libc|icu}\n"
" locale provider for the database's default collation\n"));
printf(_(" -O, --owner=OWNER database user to own the new database\n"));
printf(_(" -S, --strategy=STRATEGY database creation strategy wal_log or file_copy\n"));
diff --git a/src/bin/scripts/t/020_createdb.pl b/src/bin/scripts/t/020_createdb.pl
index af3b1492e3..5aa658b671 100644
--- a/src/bin/scripts/t/020_createdb.pl
+++ b/src/bin/scripts/t/020_createdb.pl
@@ -83,6 +83,35 @@ else
'create database with ICU fails since no ICU support');
}
+$node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', 'testnone1' ],
+ 'create database with provider "none"');
+
+$node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--locale=C',
+ 'testnone2' ],
+ 'create database with provider "none" and locale "C"');
+
+$node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--lc-collate=C',
+ 'testnone3' ],
+ 'create database with provider "none" and LC_COLLATE=C');
+
+$node->command_ok(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--lc-ctype=C',
+ 'testnone4' ],
+ 'create database with provider "none" and LC_CTYPE=C');
+
+$node->command_fails(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--icu-locale=en',
+ 'testnone5' ],
+ 'create database with provider "none" and ICU_LOCALE="en"');
+
+$node->command_fails(
+ [ 'createdb', '-T', 'template0', '--locale-provider=none', '--icu-rules=""',
+ 'testnone6' ],
+ 'create database with provider "none" and ICU_RULES=""');
+
$node->command_fails([ 'createdb', 'foobar1' ],
'fails if database already exists');
diff --git a/src/include/catalog/pg_collation.dat b/src/include/catalog/pg_collation.dat
index b6a69d1d42..40d62416ea 100644
--- a/src/include/catalog/pg_collation.dat
+++ b/src/include/catalog/pg_collation.dat
@@ -24,8 +24,7 @@
collname => 'POSIX', collprovider => 'c', collencoding => '-1',
collcollate => 'POSIX', collctype => 'POSIX' },
{ oid => '962', descr => 'sorts by Unicode code point',
- collname => 'ucs_basic', collprovider => 'c', collencoding => '6',
- collcollate => 'C', collctype => 'C' },
+ collname => 'ucs_basic', collprovider => 'n', collencoding => '6' },
{ oid => '963',
descr => 'sorts using the Unicode Collation Algorithm with default settings',
collname => 'unicode', collprovider => 'i', collencoding => '-1',
diff --git a/src/include/catalog/pg_collation.h b/src/include/catalog/pg_collation.h
index bfa3568451..29be3f8d94 100644
--- a/src/include/catalog/pg_collation.h
+++ b/src/include/catalog/pg_collation.h
@@ -64,6 +64,7 @@ DECLARE_UNIQUE_INDEX_PKEY(pg_collation_oid_index, 3085, CollationOidIndexId, on
#ifdef EXPOSE_TO_CLIENT_CODE
+#define COLLPROVIDER_NONE 'n'
#define COLLPROVIDER_DEFAULT 'd'
#define COLLPROVIDER_ICU 'i'
#define COLLPROVIDER_LIBC 'c'
@@ -73,6 +74,8 @@ collprovider_name(char c)
{
switch (c)
{
+ case COLLPROVIDER_NONE:
+ return "none";
case COLLPROVIDER_ICU:
return "icu";
case COLLPROVIDER_LIBC:
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index 0649564485..b7603c9f6c 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -650,6 +650,13 @@ EXPLAIN (COSTS OFF)
(3 rows)
-- CREATE/DROP COLLATION
+CREATE COLLATION none ( PROVIDER = none );
+CREATE COLLATION none2 ( PROVIDER = none, LOCALE="POSIX" ); -- fails
+ERROR: collation provider "none" does not support LOCALE, LC_COLLATE, or LC_CTYPE
+CREATE COLLATION none2 ( PROVIDER = none, LC_CTYPE="POSIX" ); -- fails
+ERROR: collation provider "none" does not support LOCALE, LC_COLLATE, or LC_CTYPE
+CREATE COLLATION none2 ( PROVIDER = none, LC_COLLATE="POSIX" ); -- fails
+ERROR: collation provider "none" does not support LOCALE, LC_COLLATE, or LC_CTYPE
CREATE COLLATION mycoll1 FROM "C";
CREATE COLLATION mycoll2 ( LC_COLLATE = "POSIX", LC_CTYPE = "POSIX" );
CREATE COLLATION mycoll3 FROM "default"; -- intentionally unsupported
@@ -754,7 +761,7 @@ DETAIL: FROM cannot be specified together with any other options.
-- must get rid of them.
--
DROP SCHEMA collate_tests CASCADE;
-NOTICE: drop cascades to 19 other objects
+NOTICE: drop cascades to 20 other objects
DETAIL: drop cascades to table collate_test1
drop cascades to table collate_test_like
drop cascades to table collate_test2
@@ -771,6 +778,7 @@ drop cascades to function dup(anyelement)
drop cascades to table collate_test20
drop cascades to table collate_test21
drop cascades to table collate_test22
+drop cascades to collation "none"
drop cascades to collation mycoll2
drop cascades to table collate_test23
drop cascades to view collate_on_int
diff --git a/src/test/regress/sql/collate.sql b/src/test/regress/sql/collate.sql
index c3d40fc195..e2dceb8dff 100644
--- a/src/test/regress/sql/collate.sql
+++ b/src/test/regress/sql/collate.sql
@@ -244,6 +244,12 @@ EXPLAIN (COSTS OFF)
-- CREATE/DROP COLLATION
+CREATE COLLATION none ( PROVIDER = none );
+
+CREATE COLLATION none2 ( PROVIDER = none, LOCALE="POSIX" ); -- fails
+CREATE COLLATION none2 ( PROVIDER = none, LC_CTYPE="POSIX" ); -- fails
+CREATE COLLATION none2 ( PROVIDER = none, LC_COLLATE="POSIX" ); -- fails
+
CREATE COLLATION mycoll1 FROM "C";
CREATE COLLATION mycoll2 ( LC_COLLATE = "POSIX", LC_CTYPE = "POSIX" );
CREATE COLLATION mycoll3 FROM "default"; -- intentionally unsupported
--
2.34.1
v6-0005-ICU-fix-up-old-libc-style-locale-strings.patchtext/x-patch; charset=UTF-8; name=v6-0005-ICU-fix-up-old-libc-style-locale-strings.patchDownload
From 274a887f8970647b2c932ee55c4783095719985d Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Fri, 28 Apr 2023 12:22:41 -0700
Subject: [PATCH v6 5/5] ICU: fix up old libc-style locale strings.
Before transforming a locale string into a language tag, fix up old
libc-style locale strings such as 'fr_FR@euro'. Older ICU versions did
this automatically, but ICU version 64 removed that support.
Discussion: https://postgr.es/m/654a49f7ff7461bcf47be4181430678d45f93858.camel%40j-davis.com
---
src/backend/utils/adt/pg_locale.c | 57 ++++++++++++++++-
src/bin/initdb/initdb.c | 61 ++++++++++++++++++-
.../regress/expected/collate.icu.utf8.out | 11 ++++
src/test/regress/sql/collate.icu.utf8.sql | 7 +++
4 files changed, 134 insertions(+), 2 deletions(-)
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 95eb5cf464..2ee81e9804 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -2787,6 +2787,58 @@ icu_set_collation_attributes(UCollator *collator, const char *loc,
pfree(lower_str);
}
+
+static const char *icu_variant_map[][2] = {
+ { "@EURO", "@currency=EUR" },
+ { "@PINYIN", "@collation=pinyin" },
+ { "@STROKE", "@collation=stroke" },
+};
+
+/*
+ * ICU version 64 removed the ability to transform locale strings of the form
+ * '...@VARIANT' into proper language tags. Perform the transformation from
+ * within Postgres so that ICU supports any libc locale name consistently,
+ * regardless of the ICU version.
+ */
+static char *
+icu_fix_variants(const char *loc_str)
+{
+ const char *old_variant = strrchr(loc_str, '@');
+
+ /*
+ * Extract a variant of the form '...@VARIANT', and replace with
+ * the appropriate '...@keyword=value' if found in the map.
+ */
+ if (old_variant)
+ {
+ size_t prefix_len = old_variant - loc_str; /* bytes before the '@' */
+
+ for (int i = 0; i < lengthof(icu_variant_map); i++)
+ {
+ const char *map_variant = icu_variant_map[i][0];
+ const char *map_replacement = icu_variant_map[i][1];
+
+ if (pg_strcasecmp(old_variant, map_variant) == 0)
+ {
+ size_t replacement_len = strlen(map_replacement);
+ size_t result_len;
+ char *result;
+
+ result_len = prefix_len + replacement_len + 1;
+ result = palloc(result_len);
+
+ memcpy(result, loc_str, prefix_len);
+ memcpy(result + prefix_len, map_replacement, replacement_len);
+ result[prefix_len + replacement_len] = '\0';
+
+ return result;
+ }
+ }
+ }
+
+ return pstrdup(loc_str);
+}
+
#endif
/*
@@ -2803,6 +2855,7 @@ icu_language_tag(const char *loc_str, int elevel)
{
#ifdef USE_ICU
UErrorCode status;
+ char *fixed_loc_str = icu_fix_variants(loc_str);
char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
@@ -2833,7 +2886,7 @@ icu_language_tag(const char *loc_str, int elevel)
while (true)
{
status = U_ZERO_ERROR;
- uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+ uloc_toLanguageTag(fixed_loc_str, langtag, buflen, strict, &status);
/* try again if the buffer is not large enough */
if ((status == U_BUFFER_OVERFLOW_ERROR ||
@@ -2848,6 +2901,8 @@ icu_language_tag(const char *loc_str, int elevel)
break;
}
+ pfree(fixed_loc_str);
+
if (U_FAILURE(status))
{
pfree(langtag);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f0827154cd..1304a235ce 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2240,6 +2240,61 @@ check_icu_locale_encoding(int user_enc)
return true;
}
+#ifdef USE_ICU
+
+static const char *icu_variant_map[][2] = {
+ { "@EURO", "@currency=EUR" },
+ { "@PINYIN", "@collation=pinyin" },
+ { "@STROKE", "@collation=stroke" },
+};
+
+/*
+ * ICU version 64 removed the ability to transform locale strings of the form
+ * '...@VARIANT' into proper language tags. Perform the transformation from
+ * within Postgres so that ICU supports any libc locale name consistently,
+ * regardless of the ICU version.
+ */
+static char *
+icu_fix_variants(const char *loc_str)
+{
+ const char *old_variant = strrchr(loc_str, '@');
+
+ /*
+ * Extract a variant of the form '...@VARIANT', and replace with
+ * the appropriate '...@keyword=value' if found in the map.
+ */
+ if (old_variant)
+ {
+ size_t prefix_len = old_variant - loc_str; /* bytes before the '@' */
+
+ for (int i = 0; i < lengthof(icu_variant_map); i++)
+ {
+ const char *map_variant = icu_variant_map[i][0];
+ const char *map_replacement = icu_variant_map[i][1];
+
+ if (pg_strcasecmp(old_variant, map_variant) == 0)
+ {
+ size_t replacement_len = strlen(map_replacement);
+ size_t result_len;
+ char *result;
+
+ result_len = prefix_len + replacement_len + 1;
+ result = pg_malloc(result_len);
+
+ memcpy(result, loc_str, prefix_len);
+ memcpy(result + prefix_len, map_replacement, replacement_len);
+ result[prefix_len + replacement_len] = '\0';
+
+ return result;
+ }
+ }
+ }
+
+ return pg_strdup(loc_str);
+}
+
+#endif
+
/*
* Convert to canonical BCP47 language tag. Must be consistent with
* icu_language_tag().
@@ -2249,6 +2304,7 @@ icu_language_tag(const char *loc_str)
{
#ifdef USE_ICU
UErrorCode status;
+ char *fixed_loc_str = icu_fix_variants(loc_str);
char lang[ULOC_LANG_CAPACITY];
char *langtag;
size_t buflen = 32; /* arbitrary starting buffer size */
@@ -2277,7 +2333,8 @@ icu_language_tag(const char *loc_str)
while (true)
{
status = U_ZERO_ERROR;
- uloc_toLanguageTag(loc_str, langtag, buflen, strict, &status);
+
+ uloc_toLanguageTag(fixed_loc_str, langtag, buflen, strict, &status);
/* try again if the buffer is not large enough */
if (status == U_BUFFER_OVERFLOW_ERROR ||
@@ -2291,6 +2348,8 @@ icu_language_tag(const char *loc_str)
break;
}
+ pg_free(fixed_loc_str);
+
if (U_FAILURE(status))
{
pg_free(langtag);
diff --git a/src/test/regress/expected/collate.icu.utf8.out b/src/test/regress/expected/collate.icu.utf8.out
index ea96e27f45..692e8cdf18 100644
--- a/src/test/regress/expected/collate.icu.utf8.out
+++ b/src/test/regress/expected/collate.icu.utf8.out
@@ -1071,12 +1071,23 @@ ERROR: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
ERROR: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
+ERROR: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
RESET icu_validation_level;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
WARNING: could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
WARNING: ICU locale "nonsense-nowhere" has unknown language "nonsense"
HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
+WARNING: could not convert locale name "@ASDF" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+-- test special variants
+CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-cu-eur" for ICU locale "@EURO"
+CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-co-pinyin" for ICU locale "@pinyin"
+CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
+NOTICE: using standard form "und-u-co-stroke" for ICU locale "@stroke"
CREATE COLLATION test4 FROM nonsense;
ERROR: collation "nonsense" for encoding "UTF8" does not exist
CREATE COLLATION test5 FROM test0;
diff --git a/src/test/regress/sql/collate.icu.utf8.sql b/src/test/regress/sql/collate.icu.utf8.sql
index ee607ca3a5..0b90e2a5b9 100644
--- a/src/test/regress/sql/collate.icu.utf8.sql
+++ b/src/test/regress/sql/collate.icu.utf8.sql
@@ -395,9 +395,16 @@ CREATE COLLATION test3 (provider = icu, lc_collate = 'en_US.utf8'); -- fail, nee
SET icu_validation_level = ERROR;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); -- fails
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); -- fails
RESET icu_validation_level;
CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@ASDF'); DROP COLLATION testx;
+
+-- test special variants
+CREATE COLLATION testx (provider = icu, locale = '@EURO'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@pinyin'); DROP COLLATION testx;
+CREATE COLLATION testx (provider = icu, locale = '@stroke'); DROP COLLATION testx;
CREATE COLLATION test4 FROM nonsense;
CREATE COLLATION test5 FROM test0;
--
2.34.1
Jeff Davis wrote:
2) Automatically change the provider to libc when locale=C.
Almost works, but it's not clear how we handle the case "provider=icu
lc_collate='fr_FR.utf8' locale=C".If we change it to "provider=libc lc_collate=C", we've overridden the
specified lc_collate. If we ignore the locale=C, that would be
surprising to users. If we throw an error, that would be a backwards
compatibility issue.
This thread started with a report illustrating that when users mention
the locale "C", they implicitly mean "C" from the libc provider, as
when libc was the default. The problem is that as soon as ICU is the
default, any reference to a libc collation should mention explicitly
that the provider is libc.
It seems what we're set on the idea to create an exception for "C"
(and I assume also "POSIX") to avoid too much confusion, and because
"C" is quite special anyway, and has no equivalent in ICU (the switch
in v16 to ICU as the default provider is based on the premise that the
locales with the same name will behave pretty much the same with ICU
as they did with libc, but it's absolutely not the case with "C").
ISTM that if we want to go that route, we need the make the minimum
changes at the user interface level and not any deeper, so that when
(locale="C" OR locale="POSIX") AND the provider has not been specified,
then the command (initdb and create database) act as if the user had
specified provider=libc.
(3) Support iculocale=C in the ICU provider using the memcmp() path.
In other words, if provider=icu and iculocale=C, lc_collate_is_c() and
lc_ctpye_is_c() would both return true.
ICU does not provide a locale that behaves like that, and it doesn't
feel right to pretend it does. It feels like attacking the problem
at the wrong level.
(4) Create a new "none" provider (which has no locale and always memcmp
semantics), and automatically change the provider to "none" if
provider=icu and iculocale=C.
It still uses libc/C for character classification and case changing,
so "no locale" is technically not true. Personally I don't see
the benefit of adding a "none" provider. C is a libc locale
and libc is not disappearing. I also think that when users explicitly
indicate provider=icu, they should get icu.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
Jeff Davis <pgsql@j-davis.com> writes:
Committed, thank you.
This commit has given the PDF docs build some indigestion:
Making portrait pages on A4 paper (210mmx297mm)
/home/postgres/bin/fop -fo postgres-A4.fo -pdf postgres-A4.pdf
[WARN] FOUserAgent - Font "Symbol,normal,700" not found. Substituting with "Symbol,normal,400".
[WARN] FOUserAgent - Font "ZapfDingbats,normal,700" not found. Substituting with "ZapfDingbats,normal,400".
[WARN] FOUserAgent - Hyphenation pattern not found. URI: en.
[WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by 3531 millipoints. (See position 55117:2388)
[WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by 1871 millipoints. (See position 55117:12998)
[WARN] FOUserAgent - Glyph "?" (0x323, dotbelowcmb) not available in font "Courier".
[WARN] FOUserAgent - Glyph "?" (0x302, circumflexcmb) not available in font "Courier".
[WARN] FOUserAgent - The contents of fo:block line 12 exceed the available area in the inline-progression direction by 20182 millipoints. (See position 55172:188)
[WARN] FOUserAgent - The contents of fo:block line 10 exceed the available area in the inline-progression direction by 17682 millipoints. (See position 55172:188)
[WARN] FOUserAgent - Glyph "?" (0x142, lslash) not available in font "Times-Roman".
[WARN] PropertyMaker - span="inherit" on fo:block, but no explicit value found on the parent FO.
(The first three and last one warnings are things we've been living
with, but the ones between are new.)
The first two "exceed the available area" complaints are in the "ICU
Collation Levels" table. We can silence them by providing some column
width hints to make the "Description" column a tad wider than the rest,
as in the proposed patch attached. The other two, as well as the first
two glyph-not-available complaints, are caused by this bit:
Full normalization is important in some cases, such as when
multiple accents are applied to a single character. For instance,
<literal>'ệ'</literal> can be composed of code points
<literal>U&'\0065\0323\0302'</literal> or
<literal>U&'\0065\0302\0323'</literal>. With full normalization
on, these code point sequences are treated as equal; otherwise they
are unequal.
which renders just abysmally (see attached screen shot), and even in HTML
where it's rendering about as intended, it really is an unintelligible
example. Few people are going to understand that the circumflex and the
dot-below are separately applied accents. How about we drop the explicit
example and write something like
Full normalization allows code point sequences such as
characters with multiple accent marks applied in different
orders to be seen as equal.
?
(The last missing-glyph complaint is evidently from the release notes;
I'll bug Bruce about that separately.)
regards, tom lane
Attachments:
add-column-width-hints.patchtext/x-diff; charset=us-ascii; name=add-column-width-hints.patchDownload
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 9db14649aa..96a23bf530 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1140,6 +1140,14 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<table id="icu-collation-levels">
<title>ICU Collation Levels</title>
<tgroup cols="8">
+ <colspec colname="col1" colwidth="1*"/>
+ <colspec colname="col2" colwidth="1.25*"/>
+ <colspec colname="col3" colwidth="1*"/>
+ <colspec colname="col4" colwidth="1*"/>
+ <colspec colname="col5" colwidth="1*"/>
+ <colspec colname="col6" colwidth="1*"/>
+ <colspec colname="col7" colwidth="1*"/>
+ <colspec colname="col8" colwidth="1*"/>
<thead>
<row>
<entry>Level</entry>
ICU-documentation-excerpt.pngimage/png; name=ICU-documentation-excerpt.pngDownload
�PNG
IHDR � � "3-� @iCCPICC Profile H��WXS��[��@h�K � "5��Z�ATB �A��.*�v��
]Q�bA�����bAEYv�M
���|o�o����3�9sfn �NpD�<T�|a�8.$�>6%�Nz
@� {�����11 ������u�H�+R�������p@b ��p�!> ^�� Jy�)�")�h�a�/��,9���9�+�I�cA���
�#�@���E�,������'�F��7?�t�m��b�>#����ifkr8Y�X>YQ
��8���t����'�a�J�84N:g�������X�>aFT4���d���lIh��5��`��:��� ���`a^T�����!�;�*(d'@��B~AP��f�xR����)f1�Y�X�W���$7�����g+�1����d�)[ �� V��� 7>\a3�8�5d#��I���8�/ ��cE���8�}Y~��|���v��/�N��k�rd���`��Bf���`l��\x�� ���g|ab�B���0 N>���b��?/D��A�ZP��'�
)��3E�1 �8��NX�<|� ,�@k�r�������{��A�34"Y�#��xP���
���z���_�Y��d�z�d#r���A8����(���$�2�x�������*������aB&B�H�<���,�A�@b(1�h����7���:��sh�� O]���k�n�����OQF�n���E���������@u�����w�~����Y�"niV�?i�m?�����DF��d���#U�T��U���1?�X3������?����`��%�;��c'�s�Q���� ���I���z,�]C��d��B�?�
��4�N�N�N_�}����w4`MM���L�E���B��H���� ������&V��@t:�s�� ��epp��w.��}��?���a�O�2 gs%�"9�K/��P�O�>0�����o��@� L��g�}.S�0��r����&��{�~������ .�k��==����gAH�!�� b��#��E��$IA��,D�H��<�Y��G� 5�>�0r9�t!��H/���b�
���V�(��2�p4�f���bt>�]�V����$z��v�/�`��f�9`��Ec�X&&�faeXV��a�p��`�X�'�4��;��'�\|2>_���w�
x~����T�!���E`��S��
�v�!�i�,���D�5�>�)��t�b�b=�����8@"��I�$R4�C*$����v�ZH�I=�J�J&J�J�J�JB��
�]J��.+=U�LV'[����dyy)y��|��C�L��XS|( ��\�ZJ�4�.��������r��@y��Z���g�(T�T�Sa���HT���P9�rK�
�J���SS���%��)�}�U���*[��:[�R�A���K5���Sm�Z�Z����j}�du+u�:G}�z��a��4�����5vi��x�I�����i����yJ�
���X4.mm�4�G��e�����*�������������=U�R��v��c�����Y��_���']#]�._w�n��e��z#����zez�z��>�����s���7��3�
�b
�l48m�7Bk������G�6D
���n5�0026
1�3:e�g�c�o�c����q� ���D`�����9]����������������-��������J�����S������[��-L,"-fX�Z��$[2,�-�X�[����J�Z`�h��Z��m]l]k}��j�g3�����-��a�k����j�f�mWiw��w��o��I�9R8�z�
�C�C��G���F���,F��Z>�}�7'7�<�mNwFk�]2�y�kg;g�s��U�K��l�&�W���|���7�hn�n�Z���{�����{=,<�=�<n0�1������ ���G=?z�{z������;�{���1�c�c��y�c�������K�M�����g�����{�o�������i��a�f�p
x��b�d��C�;�4����6�
�
�q�r"��<����e����<�f��������a!�h�D#�"WF����F5F�hv���{1�1�c��ccb+c������O���+�]B@���;�6����$����������+����;s���AJS*)5)u{����q��������]o=~��s&�M86Qm"g��tBzr���/�hN5g ��Q���eq�p_��y�x�|�
��L�����|�Vf�f�eWd� X���W9�9�r��F����K���W�O�?,��
�&O�:�Kd/*uO���zr�8\�� )_�T��;$6�_$�|�*�>LI�r`��T���iv�M{Z\��t|:wz��sg<����e2+cV�l���g�� ��s.en���K�JV����<�y���9���Km�j�������;�,Z��[��|�SyE���������_�d.�\��t�2�2�������\���x����+V�W��z�z��s���P�H�t��X���b��u_�g��VPY_eX��������7�m2�T���f���[B�4T[UWl%n-��d[������l7�^�������q;�j<jjv�ZZ��Jj{w����'pOS�C��z����`�d��}������z�q������C�Ce
H�������������a�[���q<������c�����|���e���D�����Z'��95������������ >s����r����s^��g�o��~�������n��t�l��q�������1]�/�]>y%�������E]���x�����7y7������v���w��%�-��~������?l���v�>� �A����wq�x\��K��'�'OM��<s~v�7����q�{^�^|�+�S����6/���WG����W�W�����������������}~_�A���������?=�<������_����;�?8(��9�_V43��; �� @��3�8��OV��U����3���P��c����
�n��/���@�O��������\)-Dx��5#?��"?s���-������Md|�8�� �eXIfMM * > F( �i N � � �� x� �� � ASCII Screenshot]~Z pHYs % %IR$� �iTXtXML:com.adobe.xmp <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 6.0.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:exif="http://ns.adobe.com/exif/1.0/">
<exif:PixelYDimension>494</exif:PixelYDimension>
<exif:PixelXDimension>470</exif:PixelXDimension>
<exif:UserComment>Screenshot</exif:UserComment>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
ax� iDOT � ( � � o>� � @ IDATx���TE��{u
k�������YsF��0` Q �U��(�D�����"�y�OE1fX0�b����������o=�xf���3���;����;���������S��]��B���%�����s���w��sN���`H�����?����7�I�b�&�V� V��
K�����,�3`���I���~ &��0`-,#�a�j���0`���I��X��r��o`�X��`(,��2���X(V�)�
K�����,���J���~
&��0`-,#�a�j���0`���I��X��r��o`�X��`(,��2���X(V�)�
K�����,���J���~
&��0`-,#�a�j���0`���I��X��r��o`�X��`(,��2���X(V�)�
K�����,���J���~
&��0`-,#�a�j���0`���I��X��r��o`�X��`(,��2���X(V�)�
K�����,���J ���7��s�=�d��N;��w��`����=�����p��?�����(�����|P�������*��2����g���{�9w�}����[�Y��u6K�JT�k�Y�ld32 �������^zi�$H?~�_z������������������O�����{�������f�M�����L��U�V������7�p��}��5�}��I�k.=_��Xc
���k���%�l����"�~���p2�,W������@.`
�O���#�I�'�|�?���2�
�|���J�Y�f������9s����_H��z��7��s�=�A�r����*L0`��Ak�&����+��|���[7���:*g�_z�%��7�|��x�G4 ��_�.��%���sg�6$������kn�Xj�~�m�����%�k�5������t�n���!Y�Y�~���?��O��,�����?�����O?u���?4m�Eu{���^��k����[oub
v�����S~�
7tb
v��M���?�f����r��{��G���m���3�}������_v����[���m����2�,X��=��S��t�<�������h�n�Eq?����E����K7}�t���k��6�,�!���g��u��Q:�+?��������6"1�Gy����&��^{m����^��#�6�x��oR��<��S�������o����m�l�ls��{��7�����b�-�e�A������ZK�(�������VZ)��j�m��j��y.�HF&��k�A,�4�u�]������w���v�j�2�j�7�x��r�)���2�v��X�U@����������0`��������K2������ ��g�}�I'��WYe/����5��/��=�P��Kxy��|����o�'�p�_~����+���� ��)S����3gzd@{H[a�� P�q���A��+����K��v�m�����(O��*/)�-��<y�����[����3&�s���4���eC���S^|�v���8���E]������L����1�ai�����k��C���I#/NZ������Z�#�<�7Sp����kIc
���v�i�a]�c�X!���>