pgsql: Allow tailoring of ICU locales with custom rules

Started by Peter Eisentrautalmost 3 years ago7 messages
#1Peter Eisentraut
peter@eisentraut.org

Allow tailoring of ICU locales with custom rules

This exposes the ICU facility to add custom collation rules to a
standard collation.

New options are added to CREATE COLLATION, CREATE DATABASE, createdb,
and initdb to set the rules.

Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at>
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Discussion: /messages/by-id/821c71a4-6ef0-d366-9acf-bb8e367f739f@enterprisedb.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/30a53b792959b36f07200dae246067b3adbcc0b9

Modified Files
--------------
doc/src/sgml/catalogs.sgml | 18 +++++
doc/src/sgml/ref/create_collation.sgml | 22 ++++++
doc/src/sgml/ref/create_database.sgml | 14 ++++
doc/src/sgml/ref/createdb.sgml | 10 +++
doc/src/sgml/ref/initdb.sgml | 10 +++
src/backend/catalog/pg_collation.c | 5 ++
src/backend/commands/collationcmds.c | 23 +++++-
src/backend/commands/dbcommands.c | 51 ++++++++++++-
src/backend/utils/adt/pg_locale.c | 41 +++++++++-
src/backend/utils/init/postinit.c | 11 ++-
src/bin/initdb/initdb.c | 15 +++-
src/bin/pg_dump/pg_dump.c | 37 +++++++++
src/bin/psql/describe.c | 100 ++++++++++++++++---------
src/bin/scripts/createdb.c | 11 +++
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_collation.h | 2 +
src/include/catalog/pg_database.dat | 2 +-
src/include/catalog/pg_database.h | 3 +
src/include/utils/pg_locale.h | 1 +
src/test/regress/expected/collate.icu.utf8.out | 30 ++++++++
src/test/regress/expected/psql.out | 18 ++---
src/test/regress/sql/collate.icu.utf8.sql | 13 ++++
22 files changed, 380 insertions(+), 59 deletions(-)

#2Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#1)
Re: pgsql: Allow tailoring of ICU locales with custom rules

On Wed, 2023-03-08 at 16:03 +0000, Peter Eisentraut wrote:

Allow tailoring of ICU locales with custom rules

Late review:

* Should throw error when provider != icu and rules != NULL

* Explain what the example means. By itself, users might get confused
wondering why someone would want to do that.

* Also consider a more practical example?

* It appears rules IS NULL behaves differently from rules=''. Is that
desired? For instance:
create collation c1(provider=icu,
locale='und-u-ka-shifted-ks-level1',
deterministic=false);
create collation c2(provider=icu,
locale='und-u-ka-shifted-ks-level1',
rules='',
deterministic=false);
select 'a b' collate c1 = 'ab' collate c1; -- true
select 'a b' collate c2 = 'ab' collate c2; -- false

* Can you document the interaction between locale keywords
("@colStrength=primary") and a rule like '[strength 2]'?

Regards,
Jeff Davis

#3Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Jeff Davis (#2)
Re: pgsql: Allow tailoring of ICU locales with custom rules

On 08.03.23 21:57, Jeff Davis wrote:

On Wed, 2023-03-08 at 16:03 +0000, Peter Eisentraut wrote:

Allow tailoring of ICU locales with custom rules

Late review:

* Should throw error when provider != icu and rules != NULL

I have fixed that.

* Explain what the example means. By itself, users might get confused
wondering why someone would want to do that.

* Also consider a more practical example?

I have added a more practical example with explanation.

* It appears rules IS NULL behaves differently from rules=''. Is that
desired? For instance:
create collation c1(provider=icu,
locale='und-u-ka-shifted-ks-level1',
deterministic=false);
create collation c2(provider=icu,
locale='und-u-ka-shifted-ks-level1',
rules='',
deterministic=false);
select 'a b' collate c1 = 'ab' collate c1; -- true
select 'a b' collate c2 = 'ab' collate c2; -- false

I'm puzzled by this. The general behavior is, extract the rules of the
original locale, append the custom rules, use that. If the custom rules
are the empty string, that should match using the original rules
untouched. Needs further investigation.

* Can you document the interaction between locale keywords
("@colStrength=primary") and a rule like '[strength 2]'?

I'll look into that.

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Eisentraut (#3)
Re: pgsql: Allow tailoring of ICU locales with custom rules

On Fri, Mar 10, 2023 at 3:24 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

On 08.03.23 21:57, Jeff Davis wrote:

* It appears rules IS NULL behaves differently from rules=''. Is that
desired? For instance:
create collation c1(provider=icu,
locale='und-u-ka-shifted-ks-level1',
deterministic=false);
create collation c2(provider=icu,
locale='und-u-ka-shifted-ks-level1',
rules='',
deterministic=false);
select 'a b' collate c1 = 'ab' collate c1; -- true
select 'a b' collate c2 = 'ab' collate c2; -- false

I'm puzzled by this. The general behavior is, extract the rules of the
original locale, append the custom rules, use that. If the custom rules
are the empty string, that should match using the original rules
untouched. Needs further investigation.

* Can you document the interaction between locale keywords
("@colStrength=primary") and a rule like '[strength 2]'?

I'll look into that.

This thread is listed on PostgreSQL 16 Open Items list. This is a
gentle reminder to see if there is a plan to move forward with respect
to open points.

--
With Regards,
Amit Kapila.

#5Peter Eisentraut
peter@eisentraut.org
In reply to: Amit Kapila (#4)
Re: pgsql: Allow tailoring of ICU locales with custom rules

On 24.07.23 04:46, Amit Kapila wrote:

On Fri, Mar 10, 2023 at 3:24 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

On 08.03.23 21:57, Jeff Davis wrote:

* It appears rules IS NULL behaves differently from rules=''. Is that
desired? For instance:
create collation c1(provider=icu,
locale='und-u-ka-shifted-ks-level1',
deterministic=false);
create collation c2(provider=icu,
locale='und-u-ka-shifted-ks-level1',
rules='',
deterministic=false);
select 'a b' collate c1 = 'ab' collate c1; -- true
select 'a b' collate c2 = 'ab' collate c2; -- false

I'm puzzled by this. The general behavior is, extract the rules of the
original locale, append the custom rules, use that. If the custom rules
are the empty string, that should match using the original rules
untouched. Needs further investigation.

* Can you document the interaction between locale keywords
("@colStrength=primary") and a rule like '[strength 2]'?

I'll look into that.

This thread is listed on PostgreSQL 16 Open Items list. This is a
gentle reminder to see if there is a plan to move forward with respect
to open points.

I have investigated this. My assessment is that how PostgreSQL
interfaces with ICU is correct. Whether what ICU does is correct might
be debatable. I have filed a bug with ICU about this:
https://unicode-org.atlassian.net/browse/ICU-22456 , but there is no
response yet.

You can work around this by including the desired attributes in the
rules string, for example

create collation c3 (provider=icu,
locale='und-u-ka-shifted-ks-level1',
rules='[alternate shifted][strength 1]',
deterministic=false);

So I don't think there is anything we need to do here for PostgreSQL 16.

#6Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#5)
Re: pgsql: Allow tailoring of ICU locales with custom rules

On Mon, 2023-08-14 at 10:34 +0200, Peter Eisentraut wrote:

I have investigated this.  My assessment is that how PostgreSQL
interfaces with ICU is correct.  Whether what ICU does is correct
might
be debatable.  I have filed a bug with ICU about this:
https://unicode-org.atlassian.net/browse/ICU-22456 , but there is no
response yet.

Is everything other than the language and region simply discarded when
a rules string is present, or are some attributes preserved, or is
there some other nuance?

You can work around this by including the desired attributes in the
rules string, for example

     create collation c3 (provider=icu,
       locale='und-u-ka-shifted-ks-level1',
       rules='[alternate shifted][strength 1]',
       deterministic=false);

So I don't think there is anything we need to do here for PostgreSQL
16.

Is there some way we can warn a user that some attributes will be
discarded, or improve the documentation? Letting the user figure this
out for themselves doesn't seem right.

Are you sure we want to allow rules for the database default collation
in 16, or should we start with just allowing them in CREATE COLLATION
and then expand to the database default collation later? I'm still a
bit concerned about users getting too fancy with daticurules, and
ending up not being able to connect to their database anymore.

Regards,
Jeff Davis

#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Jeff Davis (#6)
Re: pgsql: Allow tailoring of ICU locales with custom rules

On Tue, Aug 22, 2023 at 10:55 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2023-08-14 at 10:34 +0200, Peter Eisentraut wrote:

I have investigated this. My assessment is that how PostgreSQL
interfaces with ICU is correct. Whether what ICU does is correct
might
be debatable. I have filed a bug with ICU about this:
https://unicode-org.atlassian.net/browse/ICU-22456 , but there is no
response yet.

Is everything other than the language and region simply discarded when
a rules string is present, or are some attributes preserved, or is
there some other nuance?

You can work around this by including the desired attributes in the
rules string, for example

create collation c3 (provider=icu,
locale='und-u-ka-shifted-ks-level1',
rules='[alternate shifted][strength 1]',
deterministic=false);

So I don't think there is anything we need to do here for PostgreSQL
16.

Is there some way we can warn a user that some attributes will be
discarded, or improve the documentation? Letting the user figure this
out for themselves doesn't seem right.

Are you sure we want to allow rules for the database default collation
in 16, or should we start with just allowing them in CREATE COLLATION
and then expand to the database default collation later? I'm still a
bit concerned about users getting too fancy with daticurules, and
ending up not being able to connect to their database anymore.

There is still an Open Item corresponding to this. Does anyone else
want to weigh in?

--
With Regards,
Amit Kapila.