Allow tailoring of ICU locales with custom rules

Started by Peter Eisentrautover 3 years ago18 messageshackers
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

This patch exposes the ICU facility to add custom collation rules to a
standard collation. This would allow users to customize any ICU
collation to whatever they want. A very simple example from the
documentation/tests:

CREATE COLLATION en_custom
(provider = icu, locale = 'en', rules = '&a < g');

This places "g" after "a" before "b". Details about the syntax can be
found at
<https://unicode-org.github.io/icu/userguide/collation/customization/&gt;.

The code is pretty straightforward. It mainly just records these rules
in the catalog and feeds them to ICU when creating the collator object.

Attachments:

0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+219-10
#2Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Eisentraut (#1)
Re: Allow tailoring of ICU locales with custom rules

Patch needed a rebase; no functionality changes.

Show quoted text

On 14.12.22 10:26, Peter Eisentraut wrote:

This patch exposes the ICU facility to add custom collation rules to a
standard collation.  This would allow users to customize any ICU
collation to whatever they want.  A very simple example from the
documentation/tests:

CREATE COLLATION en_custom
    (provider = icu, locale = 'en', rules = '&a < g');

This places "g" after "a" before "b".  Details about the syntax can be
found at
<https://unicode-org.github.io/icu/userguide/collation/customization/&gt;.

The code is pretty straightforward.  It mainly just records these rules
in the catalog and feeds them to ICU when creating the collator object.

Attachments:

v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+220-11
#3vignesh C
vignesh21@gmail.com
In reply to: Peter Eisentraut (#2)
Re: Allow tailoring of ICU locales with custom rules

On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

Patch needed a rebase; no functionality changes.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_4075.log, please post a rebased patch:

=== Applying patches on top of PostgreSQL commit ID
d952373a987bad331c0e499463159dd142ced1ef ===
=== applying patch
./v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patch
patching file doc/src/sgml/catalogs.sgml
patching file doc/src/sgml/ref/create_collation.sgml
patching file doc/src/sgml/ref/create_database.sgml
Hunk #1 FAILED at 192.
1 out of 1 hunk FAILED -- saving rejects to file
doc/src/sgml/ref/create_database.sgml.rej

[1]: http://cfbot.cputube.org/patch_41_4075.log

Regards,
Vignesh

#4Peter Eisentraut
peter_e@gmx.net
In reply to: vignesh C (#3)
Re: Allow tailoring of ICU locales with custom rules

On 11.01.23 03:50, vignesh C wrote:

On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

Patch needed a rebase; no functionality changes.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:

Updated patch attached.

Attachments:

v3-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v3-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+220-11
#5Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Peter Eisentraut (#4)
Re: Allow tailoring of ICU locales with custom rules

On Mon, 2023-01-16 at 12:18 +0100, Peter Eisentraut wrote:

Updated patch attached.

I like that patch. It applies and passes regression tests.

I played with it:

CREATE COLLATION german_phone (LOCALE = 'de-AT', PROVIDER = icu, RULES = '&oe < ö');

SELECT * FROM (VALUES ('od'), ('oe'), ('of'), ('p'), ('ö')) AS q(c)
ORDER BY c COLLATE german_phone;

c
════
od
oe
ö
of
p
(5 rows)

Cool so far. Now I created a database with that locale:

CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
LOCALE "de_AT.utf8" TEMPLATE template0;

Now the rules are not in "pg_database":

SELECT datcollate, daticulocale, daticurules FROM pg_database WHERE datname = 'teutsch';

datcollate │ daticulocale │ daticurules
════════════╪══════════════╪═════════════
de_AT.utf8 │ german_phone │ ∅
(1 row)

I connect to the database and try:

SELECT * FROM (VALUES ('od'), ('oe'), ('of'), ('p'), ('ö')) AS q(c)
ORDER BY c COLLATE german_phone;

ERROR: collation "german_phone" for encoding "UTF8" does not exist
LINE 1: ... ('oe'), ('of'), ('p'), ('ö')) AS q(c) ORDER BY c COLLATE ge...
^

Indeed, the collation isn't there...

I guess that it is not the fault of this patch that the collation isn't there,
but I think it is surprising. What good is a database collation that does not
exist in the database?

What might be the fault of this patch, however, is that "daticurules" is not
set in "pg_database". Looking at the code, that column seems to be copied
from the template database, but cannot be overridden.

Perhaps this only needs more documentation, but I am confused.

Yours,
Laurenz Albe

#6Daniel Verite
daniel@manitou-mail.org
In reply to: Laurenz Albe (#5)
Re: Allow tailoring of ICU locales with custom rules

Laurenz Albe wrote:

Cool so far. Now I created a database with that locale:

CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
LOCALE "de_AT.utf8" TEMPLATE template0;

Now the rules are not in "pg_database":

The parameter after ICU_LOCALE is passed directly to ICU as a locale
ID, as opposed to refering a collation name in the current database.
This CREATE DATABASE doesn't fail because ICU accepts pretty much
anything as a locale ID, ignoring what it can't parse instead of
erroring out.

I think the way to express what you want should be:

CREATE DATABASE teutsch
LOCALE_PROVIDER icu
ICU_LOCALE 'de_AT'
LOCALE 'de_AT.utf8'
ICU_RULES '&a < g';

However it still leaves "daticurules" empty in the destination db,
because of an actual bug in the current patch.

Looking at createdb() in commands.c, it creates this variable:

@@ -711,6 +714,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
char *dbcollate = NULL;
char *dbctype = NULL;
char *dbiculocale = NULL;
+ char *dbicurules = NULL;
char dblocprovider = '\0';
char *canonname;
int encoding = -1;

and then reads it later

@@ -1007,6 +1017,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
		dblocprovider = src_locprovider;
	if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
		dbiculocale = src_iculocale;
+	if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
+		dbicurules = src_icurules;

/* Some encodings are client only */
if (!PG_VALID_BE_ENCODING(encoding))

but it forgets to assign it in between, so it stays NULL and src_icurules
is taken instead.

I guess that it is not the fault of this patch that the collation
isn't there, but I think it is surprising. What good is a database
collation that does not exist in the database?

Even if the above invocation of CREATE DATABASE worked as you
intuitively expected, by getting the characteristics from the
user-defined collation for the destination db, it still wouldn't work to
refer
to COLLATE "german_phone" in the destination database.
That's because there would be no "german_phone" entry in the
pg_collation of the destination db, as it's cloned from the template
db, which has no reason to have this collation.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#7Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Daniel Verite (#6)
Re: Allow tailoring of ICU locales with custom rules

On Sat, 2023-02-04 at 14:41 +0100, Daniel Verite wrote:

        Laurenz Albe wrote:

Cool so far.  Now I created a database with that locale:

 CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
    LOCALE "de_AT.utf8" TEMPLATE template0;

Now the rules are not in "pg_database":

The parameter after ICU_LOCALE is passed directly to ICU as a locale
ID, as opposed to refering a collation name in the current database.
This CREATE DATABASE doesn't fail because ICU accepts pretty much
anything as a locale ID, ignoring what it can't parse instead of
erroring out.

I think the way to express what you want should be:

CREATE DATABASE teutsch
 LOCALE_PROVIDER icu
 ICU_LOCALE 'de_AT'
 LOCALE 'de_AT.utf8'
 ICU_RULES '&a < g';

However it still leaves "daticurules" empty in the destination db,
because of an actual bug in the current patch.

I see. Thanks for the explanation.

I guess that it is not the fault of this patch that the collation
isn't there, but I think it is surprising.  What good is a database
collation that does not exist in the database?

Even if the above invocation of CREATE DATABASE worked as you
intuitively expected, by getting the characteristics from the
user-defined collation for the destination db, it still wouldn't work to
refer
to COLLATE "german_phone" in the destination database.
That's because there would be no "german_phone" entry in the
pg_collation of the destination db, as it's cloned from the template
db, which has no reason to have this collation.

That makes sense. Then I think that the current behavior is buggy:
You should not be allowed to specify a collation that does not exist in
the template database. Otherwise you end up with something weird.

This is not the fault of this patch though.

Yours,
Laurenz Albe

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Daniel Verite (#6)
Re: Allow tailoring of ICU locales with custom rules

On 04.02.23 14:41, Daniel Verite wrote:

However it still leaves "daticurules" empty in the destination db,
because of an actual bug in the current patch.

Looking at createdb() in commands.c, it creates this variable:

@@ -711,6 +714,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
char *dbcollate = NULL;
char *dbctype = NULL;
char *dbiculocale = NULL;
+ char *dbicurules = NULL;
char dblocprovider = '\0';
char *canonname;
int encoding = -1;

and then reads it later

@@ -1007,6 +1017,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
dblocprovider = src_locprovider;
if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
dbiculocale = src_iculocale;
+	if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
+		dbicurules = src_icurules;

/* Some encodings are client only */
if (!PG_VALID_BE_ENCODING(encoding))

but it forgets to assign it in between, so it stays NULL and src_icurules
is taken instead.

Right. Here is a new patch with this fixed.

Attachments:

v4-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v4-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+222-11
#9Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Peter Eisentraut (#8)
Re: Allow tailoring of ICU locales with custom rules

On Mon, 2023-02-06 at 22:16 +0100, Peter Eisentraut wrote:

Right.  Here is a new patch with this fixed.

Thanks. I played some more with it, and still are still some missing
odds and ends:

- There is a new option ICU_RULES to CREATE DATABASE, but it is not
reflected in \h CREATE DATABASE. sql_help_CREATE_DATABASE() needs to
be amended.

- There is no way to show the rules except by querying "pg_collation" or
"pg_database". I think it would be good to show the rules with
\dO+ and \l+.

- If I create a collation "x" with RULES and then create a database
with "ICU_LOCALE x", the rules are not copied over.

I don't know if that is intended or not, but it surprises me.
Should that be a WARNING? Or, since creating a database with a collation
that does not exist in "template0" doesn't make much sense (or does it?),
is there a way to forbid that?

Yours,
Laurenz Albe

#10Peter Eisentraut
peter_e@gmx.net
In reply to: Laurenz Albe (#9)
Re: Allow tailoring of ICU locales with custom rules

On 14.02.23 17:53, Laurenz Albe wrote:

On Mon, 2023-02-06 at 22:16 +0100, Peter Eisentraut wrote:

Right.  Here is a new patch with this fixed.

Thanks. I played some more with it, and still are still some missing
odds and ends:

- There is a new option ICU_RULES to CREATE DATABASE, but it is not
reflected in \h CREATE DATABASE. sql_help_CREATE_DATABASE() needs to
be amended.

Fixed.

- There is no way to show the rules except by querying "pg_collation" or
"pg_database". I think it would be good to show the rules with
\dO+ and \l+.

Fixed. I adjusted the order of the columns a bit, to make the overall
picture more sensible. The locale provider column is now earlier, since
it indicates which of the subsequent columns are applicable.

- If I create a collation "x" with RULES and then create a database
with "ICU_LOCALE x", the rules are not copied over.

I don't know if that is intended or not, but it surprises me.
Should that be a WARNING? Or, since creating a database with a collation
that does not exist in "template0" doesn't make much sense (or does it?),
is there a way to forbid that?

This is a misunderstanding of how things work. The value of the
database ICU_LOCALE attribute is passed to the ICU library. It does not
refer to a PostgreSQL collation object.

Attachments:

v5-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v5-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+295-54
#11Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#10)
Re: Allow tailoring of ICU locales with custom rules

Peter Eisentraut wrote:

[patch v5]

Two quick comments:

- pg_dump support need to be added for CREATE COLLATION / DATABASE

- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#12Peter Eisentraut
peter_e@gmx.net
In reply to: Daniel Verite (#11)
Re: Allow tailoring of ICU locales with custom rules

On 20.02.23 17:30, Daniel Verite wrote:

- pg_dump support need to be added for CREATE COLLATION / DATABASE

I have added that.

- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?

Right, that would be an initdb option. Is that too many initdb options
then? It would be easy to add, if we think it's worth it.

Attachments:

v6-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v6-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+332-54
#13Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Peter Eisentraut (#12)
Re: Allow tailoring of ICU locales with custom rules

On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote:

- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
  From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?

Right, that would be an initdb option.  Is that too many initdb options
then?  It would be easy to add, if we think it's worth it.

An alternative would be to document that you can drop "template1" and
create it again using the ICU collation rules you need.

But I'd prefer an "initdb" option.

Yours,
Laurenz Albe

#14Peter Eisentraut
peter_e@gmx.net
In reply to: Laurenz Albe (#13)
Re: Allow tailoring of ICU locales with custom rules

On 02.03.23 16:39, Laurenz Albe wrote:

On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote:

- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
  From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?

Right, that would be an initdb option.  Is that too many initdb options
then?  It would be easy to add, if we think it's worth it.

An alternative would be to document that you can drop "template1" and
create it again using the ICU collation rules you need.

But I'd prefer an "initdb" option.

Ok, here is a version with an initdb option and also a createdb option
(to expose the CREATE DATABASE option).

You can mess with people by setting up your databases like this:

initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'

;-)

Attachments:

v7-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v7-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+379-56
#15Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#14)
Re: Allow tailoring of ICU locales with custom rules

On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:

You can mess with people by setting up your databases like this:

initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'

;-)

Would we be the first major database to support custom collation rules?
This sounds useful for testing, experimentation, hacking, etc.

What are some of the use cases? Is it helpful to comply with unusual or
outdated standards or formats? Maybe there are people using special
delimiters/terminators and they need them to be treated a certain way
during comparisons?

Regards,
Jeff Davis

#16Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Jeff Davis (#15)
Re: Allow tailoring of ICU locales with custom rules

On Tue, 2023-03-07 at 22:06 -0800, Jeff Davis wrote:

On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:

You can mess with people by setting up your databases like this:

initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'

;-)

Would we be the first major database to support custom collation rules?
This sounds useful for testing, experimentation, hacking, etc.

What are some of the use cases? Is it helpful to comply with unusual or
outdated standards or formats? Maybe there are people using special
delimiters/terminators and they need them to be treated a certain way
during comparisons?

I regularly see complaints about the sort order; recently this one:
/messages/by-id/CAFCRh--xt-J8awOavhB216kom6TQnaP35TTVEQQS5bHH7gMemQ@mail.gmail.com

So being able to influence the sort order is useful.

Yours,
Laurenz Albe

#17Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Peter Eisentraut (#14)
Re: Allow tailoring of ICU locales with custom rules

On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:

Ok, here is a version with an initdb option and also a createdb option
(to expose the CREATE DATABASE option).

You can mess with people by setting up your databases like this:

initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'

Looks good. I cannot get it to misbehave, "make check-world" is successful
(the regression tests misbehave in interesting ways when running
"make installcheck" on a cluster created with non-standard ICU rules, but
that can be expected).

I checked the documentation, tested "pg_dump" support, everything fine.

I'll mark it as "ready for committer".

Yours,
Laurenz Albe

#18Peter Eisentraut
peter_e@gmx.net
In reply to: Laurenz Albe (#17)
Re: Allow tailoring of ICU locales with custom rules

On 08.03.23 15:18, Laurenz Albe wrote:

On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:

Ok, here is a version with an initdb option and also a createdb option
(to expose the CREATE DATABASE option).

You can mess with people by setting up your databases like this:

initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'

Looks good. I cannot get it to misbehave, "make check-world" is successful
(the regression tests misbehave in interesting ways when running
"make installcheck" on a cluster created with non-standard ICU rules, but
that can be expected).

I checked the documentation, tested "pg_dump" support, everything fine.

I'll mark it as "ready for committer".

committed