ICU for global collation

Started by Peter Eisentrautover 6 years ago111 messageshackers
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

Here is an initial patch to add the option to use ICU as the global
collation provider, a long-requested feature.

To activate, use something like

initdb --collation-provider=icu --locale=...

A trick here is that since we need to also still set the normal POSIX
locales, the --locale value needs to be valid as both a POSIX locale and
a ICU locale. If that doesn't work out, there is also a way to specify
it separately, e.g.,

initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en

This complexity is unfortunate, but I don't see a way around it right now.

There are also options for createdb and CREATE DATABASE to do this for a
particular database only.

Besides this, the implementation is quite small: When starting up a
database, we create an ICU collator object, store it in a global
variable, and then use it when appropriate. All the ICU code for
creating and invoking those collators already exists of course.

For the version tracking, I use the pg_collation row for the "default"
collation. Again, this mostly reuses existing code and concepts.

Nondeterministic collations are not supported for the global collation,
because then LIKE and regular expressions don't work and that breaks
some system views. This needs some separate research.

To test, run the existing regression tests against a database
initialized with ICU. Perhaps some options for pg_regress could
facilitate that.

I fear that the Localization chapter in the documentation will need a
bit of a rewrite after this, because the hitherto separately treated
concepts of locale and collation are fusing together. I haven't done
that here yet, but that would be the plan for later.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v1-0001-Add-option-to-use-ICU-as-global-collation-provide.patchtext/plain; charset=UTF-8; name=v1-0001-Add-option-to-use-ICU-as-global-collation-provide.patch; x-mac-creator=0; x-mac-type=0Download+417-106
#2Andrey Borodin
amborodin@acm.org
In reply to: Peter Eisentraut (#1)
Re: ICU for global collation

Hi!

20 авг. 2019 г., в 19:21, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):

Here is an initial patch to add the option to use ICU as the global
collation provider, a long-requested feature.

To activate, use something like

initdb --collation-provider=icu --locale=...

A trick here is that since we need to also still set the normal POSIX
locales, the --locale value needs to be valid as both a POSIX locale and
a ICU locale. If that doesn't work out, there is also a way to specify
it separately, e.g.,

initdb --collation-provider=icu --locale=en_US.utf8 --icu-locale=en

Thanks! This is very awaited feature.

Seems like user cannot change locale for database if icu is already chosen?

postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Provider | Access privileges
-----------+-------+----------+---------+-------+----------+-------------------
postgres | x4mmm | UTF8 | ru_RU | ru_RU | icu |
template0 | x4mmm | UTF8 | ru_RU | ru_RU | icu | =c/x4mmm +
| | | | | | x4mmm=CTc/x4mmm
template1 | x4mmm | UTF8 | ru_RU | ru_RU | icu | =c/x4mmm +
| | | | | | x4mmm=CTc/x4mmm
(3 rows)

postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-08-21 11:43:40.379 +05 [41509] FATAL: collations with different collate and ctype values are not supported by ICU
FATAL: collations with different collate and ctype values are not supported by ICU
Previous connection kept

Am I missing something?

BTW, psql does not know about collation_provider.

Best regards, Andrey Borodin.

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Andrey Borodin (#2)
Re: ICU for global collation

On 2019-08-21 08:56, Andrey Borodin wrote:

postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-08-21 11:43:40.379 +05 [41509] FATAL: collations with different collate and ctype values are not supported by ICU
FATAL: collations with different collate and ctype values are not supported by ICU

Try

create database a template template0 collation_provider icu locale
'en_US.utf8';

which sets both lc_collate and lc_ctype. But 'en_US.utf8' is not a
valid ICU locale name. Perhaps use 'en' or 'en-US'.

I'm making a note that we should prevent creating a database with a
faulty locale configuration in the first place instead of failing when
we're connecting.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Andrey Borodin
amborodin@acm.org
In reply to: Peter Eisentraut (#3)
Re: ICU for global collation

21 авг. 2019 г., в 12:23, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):

On 2019-08-21 08:56, Andrey Borodin wrote:

postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-08-21 11:43:40.379 +05 [41509] FATAL: collations with different collate and ctype values are not supported by ICU
FATAL: collations with different collate and ctype values are not supported by ICU

Try

create database a template template0 collation_provider icu locale
'en_US.utf8';

which sets both lc_collate and lc_ctype. But 'en_US.utf8' is not a
valid ICU locale name. Perhaps use 'en' or 'en-US'.

I'm making a note that we should prevent creating a database with a
faulty locale configuration in the first place instead of failing when
we're connecting.

Yes, the problem is input with lc_collate is accepted
postgres=# create database a template template0 collation_provider icu lc_collate 'en_US.utf8';
CREATE DATABASE
postgres=# \c a
2019-09-11 10:01:00.373 +05 [56878] FATAL: collations with different collate and ctype values are not supported by ICU
FATAL: collations with different collate and ctype values are not supported by ICU
Previous connection kept
postgres=# create database b template template0 collation_provider icu locale 'en_US.utf8';
CREATE DATABASE
postgres=# \c b
You are now connected to database "b" as user "x4mmm".

I get same output with 'en' or 'en-US'.

Also, cluster initialized --with-icu started on binaries without icu just fine.
And only after some time, I've got that messages "ERROR: ICU is not supported in this build".
Is it expected behavior? Maybe we should refuse to start without icu?

Best regards, Andrey Borodin.

#5Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#1)
Re: ICU for global collation

Hi,

When trying databases defined with ICU locales, I see that backends
that serve such databases seem to have their LC_CTYPE inherited from
the environment (as opposed to a per-database fixed value).

That's a problem for the backend code that depends on libc functions
that themselves depend on LC_CTYPE, such as the full text search parser
and dictionaries.

For instance, if you start the instance with a C locale
(LC_ALL=C pg_ctl...) , and tries to use FTS in an ICU UTF-8 database,
it doesn't work:

template1=# create database "fr-utf8"
template 'template0' encoding UTF8
locale 'fr'
collation_provider 'icu';

template1=# \c fr-utf8
You are now connected to database "fr-utf8" as user "daniel".

fr-utf8=# show lc_ctype;
lc_ctype
----------
fr
(1 row)

fr-utf8=# select to_tsvector('été');
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.

If I peek into the "real" LC_CTYPE when connected to this database,
I can see it's "C":

fr-utf8=# create extension plperl;
CREATE EXTENSION

fr-utf8=# create function lc_ctype() returns text as '$ENV{LC_CTYPE};'
language plperl;
CREATE FUNCTION

fr-utf8=# select lc_ctype();
lc_ctype
----------
C

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

#6Timmer, Marius
marius.timmer@uni-muenster.de
In reply to: Daniel Verite (#5)
Re: ICU for global collation

Hi everyone,

like the others before me we (the university of Münster) are happy to
see this feature as well. Thank you this.

When I applied the patch two weeks ago I run into the issue that initdb
did not recognize the new parameters (collation-provider and icu-locale)
but I guess it was caused by my own stupidity.

When trying databases defined with ICU locales, I see that backends
that serve such databases seem to have their LC_CTYPE inherited from
the environment (as opposed to a per-database fixed value).

I am able to recreate the issue described by Daniel on my machine.

Now it works as expected. I just had to update the patch since commit
3f6b3be3 had modified two lines which resulted in conflicts. You find
the updated patch as attachement to this mail.

Best regards,

Marius Timmer

--
Westfälische Wilhelms-Universität Münster (WWU)
Zentrum für Informationsverarbeitung (ZIV)
Röntgenstraße 7-13
Besucheradresse: Einsteinstraße 60 - Raum 107
48149 Münster
+49 251 83 31158
marius.timmer@uni-muenster.de
https://www.uni-muenster.de/ZIV

Attachments:

v1-0002-Add-option-to-use-ICU-as-global-collation-provide_rebased.patchtext/x-patch; name=v1-0002-Add-option-to-use-ICU-as-global-collation-provide_rebased.patchDownload+417-106
smime.p7sapplication/pkcs7-signature; name=smime.p7sDownload
#7Thomas Munro
thomas.munro@gmail.com
In reply to: Timmer, Marius (#6)
Re: ICU for global collation

On Wed, Oct 9, 2019 at 12:16 AM Marius Timmer
<marius.timmer@uni-muenster.de> wrote:

like the others before me we (the university of Münster) are happy to
see this feature as well. Thank you this.

When I applied the patch two weeks ago I run into the issue that initdb
did not recognize the new parameters (collation-provider and icu-locale)
but I guess it was caused by my own stupidity.

When trying databases defined with ICU locales, I see that backends
that serve such databases seem to have their LC_CTYPE inherited from
the environment (as opposed to a per-database fixed value).

I am able to recreate the issue described by Daniel on my machine.

Now it works as expected. I just had to update the patch since commit
3f6b3be3 had modified two lines which resulted in conflicts. You find
the updated patch as attachement to this mail.

I rebased this patch, and tweaked get_collation_action_version() very
slightly so that you get collation version change detection (of the
ersatz kind provided by commit d5ac14f9) for the default collation
even when not using ICU. Please see attached.

+struct pg_locale_struct global_locale;

Why not "default_locale"? Where is the terminology "global" coming from?

+ Specifies the collation provider for the database.

"for the database's default collation"?

Attachments:

v2-0001-Add-option-to-use-ICU-as-global-collation-provide.patchapplication/octet-stream; name=v2-0001-Add-option-to-use-ICU-as-global-collation-provide.patchDownload+413-106
#8Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#7)
Re: ICU for global collation

On Thu, Oct 17, 2019 at 3:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I rebased this patch, and tweaked get_collation_action_version() very
slightly so that you get collation version change detection (of the
ersatz kind provided by commit d5ac14f9) for the default collation
even when not using ICU. Please see attached.

It should also remove the sentence I recently added to
alter_collation.sgml to say that the default collation doesn't have
version tracking. Rereading that section, it's also clear that the
query introduced with:

"The following query can be used to identify all collations in the current
database that need to be refreshed and the objects that depend on them:"

… is wrong with this patch applied. The right query is quite hard to
come up with, since we don't explicitly track dependencies on the
default collation. That is, there is no pg_depend entry pointing from
the index to the collation when you write CREATE INDEX ON t(x) for a
text column using the default collation, but there is one when you
write CREATE INDEX ON t(x COLLATE "fr_FR"), or when you write CREATE
INDEX ON t(x) for a text column that was explicitly defined to use
COLLATE "fr_FR". One solution is that we could start tracking those
dependencies explicitly too.

A preexisting problem with that query is that it doesn't report
transitive dependencies. An index on t(x) of a user defined type
defined with CREATE TYPE my_type AS (x text COLLATE "fr_FR") doesn't
result in a pg_depend row from index to collation, so the query fails
to report that as an index needing to be rebuilt. You could fix that
with a sprinkle of recursive magic, but you'd need a different kind of
magic to deal with transitive dependencies on the default collation
unless we start listing such dependencies explicitly. In that
example, my_type would need to depend on collation "default". You
can't just do some kind of search for transitive dependencies on type
"text", because they aren't tracked either.

In my longer term proposal to track per-dependency versions, either by
adding refobjversion to pg_depend or by creating another
pg_depend-like catalog, you'd almost certainly need to add an explicit
record for dependencies on the default collation.

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Daniel Verite (#5)
Re: ICU for global collation

On 2019-09-17 15:08, Daniel Verite wrote:

When trying databases defined with ICU locales, I see that backends
that serve such databases seem to have their LC_CTYPE inherited from
the environment (as opposed to a per-database fixed value).

fr-utf8=# select to_tsvector('ᅵtᅵ');
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.

I looked into this problem. The way to address this would be adding
proper collation support to the text search subsystem. See the TODO
markers in src/backend/tsearch/ts_locale.c for starting points. These
APIs spread out to a lot of places, so it will take some time to finish.
In the meantime, I'm pausing this thread and will set the CF entry as RwF.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#9)
Re: ICU for global collation

Peter Eisentraut wrote:

I looked into this problem. The way to address this would be adding
proper collation support to the text search subsystem. See the TODO
markers in src/backend/tsearch/ts_locale.c for starting points. These
APIs spread out to a lot of places, so it will take some time to finish.
In the meantime, I'm pausing this thread and will set the CF entry as RwF.

Even if the FTS code is improved in that matter, any extension code
with libc functions depending on LC_CTYPE is still going to be
potentially problematic. In particular when it happens to be set
to a different encoding than the database.

Couldn't we simply invent per-database GUC options, as in
ALTER DATABASE myicudb SET libc_lc_ctype TO 'value';
ALTER DATABASE myicudb SET libc_lc_collate TO 'value';

where libc_lc_ctype/libc_lc_collate would specifically set
the values in the LC_CTYPE and LC_COLLATE environment vars
of any backend serving the corresponding database"?

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

#11Peter Eisentraut
peter_e@gmx.net
In reply to: Daniel Verite (#10)
Re: ICU for global collation

On 2019-11-01 19:18, Daniel Verite wrote:

Even if the FTS code is improved in that matter, any extension code
with libc functions depending on LC_CTYPE is still going to be
potentially problematic. In particular when it happens to be set
to a different encoding than the database.

I think the answer here is that extension code must not do that, at
least in ways that potentially interact with other parts of the
(collation-aware) database system. For example, libc and ICU might have
different opinions about what is a letter, because of different versions
of Unicode data in use. That would then affect tokenization etc. in
text search and elsewhere. That's why things like isalpha have to go
though ICU instead, if that is the collation provider in a particular
context.

Couldn't we simply invent per-database GUC options, as in
ALTER DATABASE myicudb SET libc_lc_ctype TO 'value';
ALTER DATABASE myicudb SET libc_lc_collate TO 'value';

where libc_lc_ctype/libc_lc_collate would specifically set
the values in the LC_CTYPE and LC_COLLATE environment vars
of any backend serving the corresponding database"?

We could do that as a transition measure to support extensions like you
mention above. But our own internal code should not have to rely on that.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#11)
Re: ICU for global collation

There were a few inquiries about this topic recently, so I dug up the
old thread and patch. What we got stuck on last time was that we can't
just swap out all locale support in a database for ICU. We still need
to set the usual locale environment, otherwise some things that are not
ICU aware will break or degrade. I had initially anticipated fixing
that by converting everything that uses libc locales to ICU. But that
turned out to be tedious and ultimately not very useful as far as the
user-facing result is concerned, so I gave up.

So this is a different approach: If you choose ICU as the default locale
for a database, you still need to specify lc_ctype and lc_collate
settings, as before. Unlike in the previous patch, where the ICU
collation name was written in datcollate, there is now a third column
(daticucoll), so we can store all three values. This fixes the
described problem. Other than that, once you get all the initial
settings right, it basically just works: The places that have ICU
support now will use a database-wide ICU collation if appropriate, the
places that don't have ICU support continue to use the global libc
locale settings.

I changed the datcollate, datctype, and the new daticucoll fields to
type text (from name). That way, the daticucoll field can be set to
null if it's not applicable. Also, the limit of 63 characters can
actually be a problem if you want to use some combination of the options
that ICU locales offer. And for less extreme uses, having
variable-length fields will save some storage, since typical locale
names are much shorter.

For the same reasons and to keep things consistent, I also changed the
analogous pg_collation fields like that. This also removes some weird
code that has to check that colcollate and colctype have to be the same
for ICU, so it's overall cleaner.

Attachments:

v3-0001-Add-option-to-use-ICU-as-global-collation-provide.patchtext/plain; charset=UTF-8; name=v3-0001-Add-option-to-use-ICU-as-global-collation-provide.patchDownload+665-200
#13Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#12)
Re: ICU for global collation

Hi,

On Thu, Dec 30, 2021 at 01:07:21PM +0100, Peter Eisentraut wrote:

So this is a different approach: If you choose ICU as the default locale for
a database, you still need to specify lc_ctype and lc_collate settings, as
before. Unlike in the previous patch, where the ICU collation name was
written in datcollate, there is now a third column (daticucoll), so we can
store all three values. This fixes the described problem. Other than that,
once you get all the initial settings right, it basically just works: The
places that have ICU support now will use a database-wide ICU collation if
appropriate, the places that don't have ICU support continue to use the
global libc locale settings.

That looks sensible to me.

@@ -2774,6 +2776,7 @@ dumpDatabase(Archive *fout)
appendPQExpBuffer(dbQry, "SELECT tableoid, oid, datname, "
"(%s datdba) AS dba, "
"pg_encoding_to_char(encoding) AS encoding, "
+ "datcollprovider, "

This needs to be in a new pg 15+ branch, not in the pg 9.3+.

-	if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
-		mylocale = pg_newlocale_from_collation(collid);
+	if (!lc_collate_is_c(collid))
+	{
+		if (collid != DEFAULT_COLLATION_OID)
+			mylocale = pg_newlocale_from_collation(collid);
+		else if (default_locale.provider == COLLPROVIDER_ICU)
+			mylocale = &default_locale;
+	}

There are really a lot of places with this new code. Maybe it could be some
new function/macro to wrap that for the normal case (e.g. not formatting.c)?

#14Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#13)
Re: ICU for global collation

On 04.01.22 03:21, Julien Rouhaud wrote:

@@ -2774,6 +2776,7 @@ dumpDatabase(Archive *fout)
appendPQExpBuffer(dbQry, "SELECT tableoid, oid, datname, "
"(%s datdba) AS dba, "
"pg_encoding_to_char(encoding) AS encoding, "
+ "datcollprovider, "

This needs to be in a new pg 15+ branch, not in the pg 9.3+.

ok

-	if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
-		mylocale = pg_newlocale_from_collation(collid);
+	if (!lc_collate_is_c(collid))
+	{
+		if (collid != DEFAULT_COLLATION_OID)
+			mylocale = pg_newlocale_from_collation(collid);
+		else if (default_locale.provider == COLLPROVIDER_ICU)
+			mylocale = &default_locale;
+	}

There are really a lot of places with this new code. Maybe it could be some
new function/macro to wrap that for the normal case (e.g. not formatting.c)?

Right, we could just put this into pg_newlocale_from_collation(), but
the comment there says

* In fact, they shouldn't call this function at all when they are dealing
* with the default locale. That can save quite a bit in hotspots.

I don't know how to assess that.

We could pack this into a macro or inline function if we are concerned
about this.

#15Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#14)
Re: ICU for global collation

On Tue, Jan 04, 2022 at 05:03:10PM +0100, Peter Eisentraut wrote:

On 04.01.22 03:21, Julien Rouhaud wrote:

-	if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
-		mylocale = pg_newlocale_from_collation(collid);
+	if (!lc_collate_is_c(collid))
+	{
+		if (collid != DEFAULT_COLLATION_OID)
+			mylocale = pg_newlocale_from_collation(collid);
+		else if (default_locale.provider == COLLPROVIDER_ICU)
+			mylocale = &default_locale;
+	}

There are really a lot of places with this new code. Maybe it could be some
new function/macro to wrap that for the normal case (e.g. not formatting.c)?

Right, we could just put this into pg_newlocale_from_collation(), but the
comment there says

* In fact, they shouldn't call this function at all when they are dealing
* with the default locale. That can save quite a bit in hotspots.

I don't know how to assess that.

We could pack this into a macro or inline function if we are concerned about
this.

Yes that was my idea, just have a new function (inline function or a macro
then since pg_newlocale_from_collation() clearly warns about performance
concerns) that have the whole
is-not-c-collation-and-is-default-collation-or-icu-collation logic and calls
pg_newlocale_from_collation() only when needed.

#16Jim Finnerty
jfinnert@amazon.com
In reply to: Julien Rouhaud (#15)
Re: ICU for global collation

I didn't notice anything version-specific about the patch. Would any modifications be needed to backport it to pg13 and pg14?

After this patch goes in, the big next thing would be to support nondeterministic collations for LIKE, ILIKE and pattern matching operators in general. Is anyone interested in working on that?

On 1/5/22, 10:36 PM, "Julien Rouhaud" <rjuju123@gmail.com> wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Tue, Jan 04, 2022 at 05:03:10PM +0100, Peter Eisentraut wrote:

On 04.01.22 03:21, Julien Rouhaud wrote:

- if (!lc_collate_is_c(collid) && collid != DEFAULT_COLLATION_OID)
-         mylocale = pg_newlocale_from_collation(collid);
+ if (!lc_collate_is_c(collid))
+ {
+         if (collid != DEFAULT_COLLATION_OID)
+                 mylocale = pg_newlocale_from_collation(collid);
+         else if (default_locale.provider == COLLPROVIDER_ICU)
+                 mylocale = &default_locale;
+ }

There are really a lot of places with this new code. Maybe it could be some
new function/macro to wrap that for the normal case (e.g. not formatting.c)?

Right, we could just put this into pg_newlocale_from_collation(), but the
comment there says

* In fact, they shouldn't call this function at all when they are dealing
* with the default locale. That can save quite a bit in hotspots.

I don't know how to assess that.

We could pack this into a macro or inline function if we are concerned about
this.

Yes that was my idea, just have a new function (inline function or a macro
then since pg_newlocale_from_collation() clearly warns about performance
concerns) that have the whole
is-not-c-collation-and-is-default-collation-or-icu-collation logic and calls
pg_newlocale_from_collation() only when needed.

#17Julien Rouhaud
rjuju123@gmail.com
In reply to: Jim Finnerty (#16)
Re: ICU for global collation

On Thu, Jan 06, 2022 at 01:55:55PM +0000, Finnerty, Jim wrote:

I didn't notice anything version-specific about the patch. Would any
modifications be needed to backport it to pg13 and pg14?

This is a new feature so it can't be backported. The changes aren't big and
mostly touches places that didn't change in a long time so I don't think that
it would take much effort if you wanted to backport it on your own forks.

After this patch goes in, the big next thing would be to support
nondeterministic collations for LIKE, ILIKE and pattern matching operators in
general. Is anyone interested in working on that?

As far as I know you're the last person that seemed to be working on that topic
back in March :)

#18Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#12)
Re: ICU for global collation

Hi,

I looked a bit more in this patch and I have some additional remarks.

On Thu, Dec 30, 2021 at 01:07:21PM +0100, Peter Eisentraut wrote:

So this is a different approach: If you choose ICU as the default locale for
a database, you still need to specify lc_ctype and lc_collate settings, as
before. Unlike in the previous patch, where the ICU collation name was
written in datcollate, there is now a third column (daticucoll), so we can
store all three values. This fixes the described problem. Other than that,
once you get all the initial settings right, it basically just works: The
places that have ICU support now will use a database-wide ICU collation if
appropriate, the places that don't have ICU support continue to use the
global libc locale settings.

So just to confirm a database can now have 2 different *default* collations: a
libc-based one for everything that doesn't work with ICU and a ICU-based (if
specified) for everything else, and the ICU-based is optional, so if not
provided everything works as before, using the libc based default collation.

As I mentioned I think this approach is sensible. However, should we document
what are the things that are not ICU-aware?

I changed the datcollate, datctype, and the new daticucoll fields to type
text (from name). That way, the daticucoll field can be set to null if it's
not applicable. Also, the limit of 63 characters can actually be a problem
if you want to use some combination of the options that ICU locales offer.
And for less extreme uses, having variable-length fields will save some
storage, since typical locale names are much shorter.

I understand the need to have daticucoll as text, however it's not clear to me
why this has to be changed for datcollate and datctype? IIUC those will only
ever contain libc-based collation and are still mandatory?

For the same reasons and to keep things consistent, I also changed the
analogous pg_collation fields like that.

The respective fields in pg_collation are now nullable, so the changes there
sounds ok.

Digging a bit more in the patch here are some things that looks problematic.

- pg_upgrade

It checks (in check_locale_and_encoding()) the compatibility for each database,
and it looks like the daticucoll field should also be verified. Other than
that I don't think there is anything else needed for the pg_upgrade part as
everything else should be handled by pg_dump (I didn't look at the changes yet
given the below problems).

- CREATE DATABASE

There's a new COLLATION_PROVIDER option, but the overall behavior seems quite
unintuitive. As far as I can see the idea is to use LOCALE for the ICU default
collation, but as it's providing a default for the other values it's quite
annoying. For instance:

=# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'fr-x-icu' LC_COLLATE 'en_GB.UTF-8';;
ERROR: 42809: invalid locale name: "fr-x-icu"
LOCATION: createdb, dbcommands.c:397

Looking at the code it's actually complaining about LC_CTYPE. If you want a
database with an ICU default collation the lc_collate and lc_ctype should
inherit what's in the template database and not what was provided in the
LOCALE I think. You could still probably overload them in some scenario, but
without a list of what isn't ICU-aware I can't really be sure of how often one
might have to do it.

Now, if I specify everything as needed it looks like it's missing some checks
on the ICU default collation when not using template0:

=# CREATE DATABASE db COLLATION_PROVIDER icu LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';;
CREATE DATABASE

=# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ;
datname | datcollate | datctype | daticucoll
-----------+-------------+-------------+------------
postgres | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
db | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu
template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
(4 rows)

Unless I'm missing something the same concerns about collation incompatibility
with objects in the source database should also apply for the ICU collation?

While at it, I'm not exactly sure of what the COLLATION_PROVIDER is supposed to
mean, as the same commands but with a libc provider is accepted and has
the exact same result:

=# CREATE DATABASE db2 COLLATION_PROVIDER libc LOCALE 'en-x-icu' LC_COLLATE 'en_GB.UTF-8' LC_CTYPE 'en_GB.UTF-8';;
CREATE DATABASE

=# SELECT datname, datcollate, datctype, daticucoll FROM pg_database ;
datname | datcollate | datctype | daticucoll
-----------+-------------+-------------+------------
postgres | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
db | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu
template1 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
template0 | en_GB.UTF-8 | en_GB.UTF-8 | fr-x-icu
db2 | en_GB.UTF-8 | en_GB.UTF-8 | en-x-icu
(5 rows)

Shouldn't db2 have a NULL daticucoll, and if so also complain about
incompatibility for it?

- initdb

I don't think that initdb --collation-provider icu should be allowed without
--icu-locale, same for --collation-provider libc *with* --icu-locale.

When trying that, I can also see that the NULL handling for daticucoll is
broken in the BKI:

$ initdb -k --collation-provider icu
[...]
Success. You can now start the database server using:

=# select datname, datcollate, datctype, daticucoll from pg_database ;
datname | datcollate | datctype | daticucoll
-----------+-------------+-------------+------------
postgres | en_GB.UTF-8 | en_GB.UTF-8 | en_GB
template1 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB
template0 | en_GB.UTF-8 | en_GB.UTF-8 | en_GB
(3 rows)

There's a fallback on my LANG/LC_* env settings, but I don't think it can ever
be correct given the different naming convention in ICU (at least the s/_/-/).

And

$ initdb -k --collation-provider libc --icu-locale test
[...]
Success. You can now start the database server using:

=# select datname, datcollate, datctype, daticucoll from pg_database ;
datname | datcollate | datctype | daticucoll
-----------+-------------+-------------+------------
postgres | en_GB.UTF-8 | en_GB.UTF-8 | _null_
template1 | en_GB.UTF-8 | en_GB.UTF-8 | _null_
template0 | en_GB.UTF-8 | en_GB.UTF-8 | _null_
(3 rows)

#19Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#14)
Re: ICU for global collation

On 04.01.22 17:03, Peter Eisentraut wrote:

There are really a lot of places with this new code.  Maybe it could
be some
new function/macro to wrap that for the normal case (e.g. not
formatting.c)?

Right, we could just put this into pg_newlocale_from_collation(), but
the comment there says

 * In fact, they shouldn't call this function at all when they are dealing
 * with the default locale.  That can save quite a bit in hotspots.

I don't know how to assess that.

I tested this a bit. I used the following setup:

create table t1 (a text);
insert into t1 select md5(generate_series(1, 10000000)::text);
select count(*) from t1 where a > '';

And then I changed in varstr_cmp():

if (collid != DEFAULT_COLLATION_OID)
mylocale = pg_newlocale_from_collation(collid);

to just

mylocale = pg_newlocale_from_collation(collid);

I find that the \timing results are indistinguishable. (I used locale
"en_US.UTF-8" and made sure that that code path is actually hit.)

Does anyone have other insights?

#20Daniel Verite
daniel@manitou-mail.org
In reply to: Julien Rouhaud (#18)
Re: ICU for global collation

Julien Rouhaud wrote:

If you want a database with an ICU default collation the lc_collate
and lc_ctype should inherit what's in the template database and not
what was provided in the LOCALE I think. You could still probably
overload them in some scenario, but without a list of what isn't
ICU-aware I can't really be sure of how often one might have to do
it.

I guess we'd need that when creating a database with a different
encoding than the template databases, at least.

About what's not ICU-aware, I believe the most significant part in
core is the Full Text Search parser.
It doesn't care about sorting strings, but it relies on the functions
from POSIX <ctype.h>, which depend on LC_CTYPE
(it looks however that this could be improved by following
what has been done in backend/regex/regc_pg_locale.c, which has
comparable needs and calls ICU functions when applicable).

Also, any extension is potentially concerned. Surely many
extensions call functions from ctype.h assuming that
the current value of LC_CTYPE works with the data they handle.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

#21Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#19)
#22Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#21)
#23Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#12)
#24Julien Rouhaud
rjuju123@gmail.com
In reply to: Daniel Verite (#23)
#25Daniel Verite
daniel@manitou-mail.org
In reply to: Julien Rouhaud (#24)
#26Julien Rouhaud
rjuju123@gmail.com
In reply to: Daniel Verite (#25)
#27Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#22)
#28Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#27)
#29Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#18)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Daniel Verite (#23)
#31Daniel Verite
daniel@manitou-mail.org
In reply to: Julien Rouhaud (#26)
#32Julien Rouhaud
rjuju123@gmail.com
In reply to: Daniel Verite (#31)
#33Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#28)
#34Jim Finnerty
jfinnert@amazon.com
In reply to: Peter Eisentraut (#33)
#35Jim Finnerty
jfinnert@amazon.com
In reply to: Peter Eisentraut (#30)
#36Julien Rouhaud
rjuju123@gmail.com
In reply to: Jim Finnerty (#35)
#37Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#33)
#38Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#37)
#39Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#38)
#40Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#39)
#41Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#40)
#42Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#41)
#43Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#42)
#44Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#43)
#45Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#44)
#46Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#45)
#47Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#46)
#48Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#47)
#49Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#46)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#49)
#51Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#49)
#52Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#50)
#53Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#52)
#54Jim Finnerty
jfinnert@amazon.com
In reply to: Robert Haas (#53)
#55Daniel Verite
daniel@manitou-mail.org
In reply to: Julien Rouhaud (#51)
#56Daniel Verite
daniel@manitou-mail.org
In reply to: Jim Finnerty (#54)
#57Peter Eisentraut
peter_e@gmx.net
In reply to: Robert Haas (#53)
#58Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#51)
#59Shinoda, Noriyoshi (PN Japan FSIP)
noriyoshi.shinoda@hpe.com
In reply to: Peter Eisentraut (#58)
#60Peter Eisentraut
peter_e@gmx.net
In reply to: Shinoda, Noriyoshi (PN Japan FSIP) (#59)
In reply to: Peter Eisentraut (#60)
#62Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#60)
#63Julien Rouhaud
rjuju123@gmail.com
In reply to: Peter Eisentraut (#60)
#64Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#60)
#65Justin Pryzby
pryzby@telsasoft.com
In reply to: Peter Eisentraut (#44)
#66Julien Rouhaud
rjuju123@gmail.com
In reply to: Justin Pryzby (#65)
#67Michael Paquier
michael@paquier.xyz
In reply to: Julien Rouhaud (#66)
#68Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#66)
#69Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#68)
#70Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#69)
#71Justin Pryzby
pryzby@telsasoft.com
In reply to: Peter Eisentraut (#70)
#72Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Justin Pryzby (#71)
#73Julien Rouhaud
rjuju123@gmail.com
In reply to: Marina Polyakova (#72)
#74Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Julien Rouhaud (#73)
#75Peter Eisentraut
peter_e@gmx.net
In reply to: Marina Polyakova (#72)
#76Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Peter Eisentraut (#75)
#77Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#75)
#78Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Michael Paquier (#77)
#79Michael Paquier
michael@paquier.xyz
In reply to: Marina Polyakova (#78)
#80Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#79)
#81Peter Eisentraut
peter_e@gmx.net
In reply to: Julien Rouhaud (#80)
#82Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#81)
#83Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Michael Paquier (#82)
#84Justin Pryzby
pryzby@telsasoft.com
In reply to: Marina Polyakova (#83)
#85Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Justin Pryzby (#84)
#86Peter Eisentraut
peter_e@gmx.net
In reply to: Marina Polyakova (#85)
#87Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Peter Eisentraut (#86)
#88Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Marina Polyakova (#87)
#89Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Marina Polyakova (#88)
#90Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Kyotaro Horiguchi (#89)
#91Michael Paquier
michael@paquier.xyz
In reply to: Marina Polyakova (#88)
#92Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Marina Polyakova (#90)
#93Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Kyotaro Horiguchi (#89)
#94Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Kyotaro Horiguchi (#92)
#95Peter Eisentraut
peter_e@gmx.net
In reply to: Marina Polyakova (#90)
#96Peter Eisentraut
peter_e@gmx.net
In reply to: Marina Polyakova (#94)
#97Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Marina Polyakova (#93)
#98Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Peter Eisentraut (#95)
#99Peter Eisentraut
peter_e@gmx.net
In reply to: Marina Polyakova (#93)
#100Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Peter Eisentraut (#96)
#101Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Kyotaro Horiguchi (#97)
#102Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Peter Eisentraut (#99)
#103Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Peter Eisentraut (#95)
#104Peter Eisentraut
peter_e@gmx.net
In reply to: Marina Polyakova (#102)
#105Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Peter Eisentraut (#104)
#106Peter Eisentraut
peter_e@gmx.net
In reply to: Marina Polyakova (#105)
#107Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Peter Eisentraut (#106)
#108Peter Eisentraut
peter_e@gmx.net
In reply to: Marina Polyakova (#107)
#109Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Peter Eisentraut (#108)
#110Marina Polyakova
m.polyakova@postgrespro.ru
In reply to: Marina Polyakova (#109)
#111Michael Paquier
michael@paquier.xyz
In reply to: Marina Polyakova (#110)