Allow tailoring of ICU locales with custom rules
This patch exposes the ICU facility to add custom collation rules to a
standard collation. This would allow users to customize any ICU
collation to whatever they want. A very simple example from the
documentation/tests:
CREATE COLLATION en_custom
(provider = icu, locale = 'en', rules = '&a < g');
This places "g" after "a" before "b". Details about the syntax can be
found at
<https://unicode-org.github.io/icu/userguide/collation/customization/>.
The code is pretty straightforward. It mainly just records these rules
in the catalog and feeds them to ICU when creating the collator object.
Attachments:
0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+219-10
Patch needed a rebase; no functionality changes.
Show quoted text
On 14.12.22 10:26, Peter Eisentraut wrote:
This patch exposes the ICU facility to add custom collation rules to a
standard collation. This would allow users to customize any ICU
collation to whatever they want. A very simple example from the
documentation/tests:CREATE COLLATION en_custom
(provider = icu, locale = 'en', rules = '&a < g');This places "g" after "a" before "b". Details about the syntax can be
found at
<https://unicode-org.github.io/icu/userguide/collation/customization/>.The code is pretty straightforward. It mainly just records these rules
in the catalog and feeds them to ICU when creating the collator object.
Attachments:
v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+220-11
On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
Patch needed a rebase; no functionality changes.
The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_4075.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
d952373a987bad331c0e499463159dd142ced1ef ===
=== applying patch
./v2-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patch
patching file doc/src/sgml/catalogs.sgml
patching file doc/src/sgml/ref/create_collation.sgml
patching file doc/src/sgml/ref/create_database.sgml
Hunk #1 FAILED at 192.
1 out of 1 hunk FAILED -- saving rejects to file
doc/src/sgml/ref/create_database.sgml.rej
[1]: http://cfbot.cputube.org/patch_41_4075.log
Regards,
Vignesh
On 11.01.23 03:50, vignesh C wrote:
On Thu, 5 Jan 2023 at 20:45, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:Patch needed a rebase; no functionality changes.
The patch does not apply on top of HEAD as in [1], please post a rebased patch:
Updated patch attached.
Attachments:
v3-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v3-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+220-11
On Mon, 2023-01-16 at 12:18 +0100, Peter Eisentraut wrote:
Updated patch attached.
I like that patch. It applies and passes regression tests.
I played with it:
CREATE COLLATION german_phone (LOCALE = 'de-AT', PROVIDER = icu, RULES = '&oe < ö');
SELECT * FROM (VALUES ('od'), ('oe'), ('of'), ('p'), ('ö')) AS q(c)
ORDER BY c COLLATE german_phone;
c
════
od
oe
ö
of
p
(5 rows)
Cool so far. Now I created a database with that locale:
CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
LOCALE "de_AT.utf8" TEMPLATE template0;
Now the rules are not in "pg_database":
SELECT datcollate, daticulocale, daticurules FROM pg_database WHERE datname = 'teutsch';
datcollate │ daticulocale │ daticurules
════════════╪══════════════╪═════════════
de_AT.utf8 │ german_phone │ ∅
(1 row)
I connect to the database and try:
SELECT * FROM (VALUES ('od'), ('oe'), ('of'), ('p'), ('ö')) AS q(c)
ORDER BY c COLLATE german_phone;
ERROR: collation "german_phone" for encoding "UTF8" does not exist
LINE 1: ... ('oe'), ('of'), ('p'), ('ö')) AS q(c) ORDER BY c COLLATE ge...
^
Indeed, the collation isn't there...
I guess that it is not the fault of this patch that the collation isn't there,
but I think it is surprising. What good is a database collation that does not
exist in the database?
What might be the fault of this patch, however, is that "daticurules" is not
set in "pg_database". Looking at the code, that column seems to be copied
from the template database, but cannot be overridden.
Perhaps this only needs more documentation, but I am confused.
Yours,
Laurenz Albe
Laurenz Albe wrote:
Cool so far. Now I created a database with that locale:
CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
LOCALE "de_AT.utf8" TEMPLATE template0;Now the rules are not in "pg_database":
The parameter after ICU_LOCALE is passed directly to ICU as a locale
ID, as opposed to refering a collation name in the current database.
This CREATE DATABASE doesn't fail because ICU accepts pretty much
anything as a locale ID, ignoring what it can't parse instead of
erroring out.
I think the way to express what you want should be:
CREATE DATABASE teutsch
LOCALE_PROVIDER icu
ICU_LOCALE 'de_AT'
LOCALE 'de_AT.utf8'
ICU_RULES '&a < g';
However it still leaves "daticurules" empty in the destination db,
because of an actual bug in the current patch.
Looking at createdb() in commands.c, it creates this variable:
@@ -711,6 +714,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
char *dbcollate = NULL;
char *dbctype = NULL;
char *dbiculocale = NULL;
+ char *dbicurules = NULL;
char dblocprovider = '\0';
char *canonname;
int encoding = -1;
and then reads it later
@@ -1007,6 +1017,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
dblocprovider = src_locprovider;
if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU)
dbiculocale = src_iculocale;
+ if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU)
+ dbicurules = src_icurules;
/* Some encodings are client only */
if (!PG_VALID_BE_ENCODING(encoding))
but it forgets to assign it in between, so it stays NULL and src_icurules
is taken instead.
I guess that it is not the fault of this patch that the collation
isn't there, but I think it is surprising. What good is a database
collation that does not exist in the database?
Even if the above invocation of CREATE DATABASE worked as you
intuitively expected, by getting the characteristics from the
user-defined collation for the destination db, it still wouldn't work to
refer
to COLLATE "german_phone" in the destination database.
That's because there would be no "german_phone" entry in the
pg_collation of the destination db, as it's cloned from the template
db, which has no reason to have this collation.
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
On Sat, 2023-02-04 at 14:41 +0100, Daniel Verite wrote:
Laurenz Albe wrote:
Cool so far. Now I created a database with that locale:
CREATE DATABASE teutsch LOCALE_PROVIDER icu ICU_LOCALE german_phone
LOCALE "de_AT.utf8" TEMPLATE template0;Now the rules are not in "pg_database":
The parameter after ICU_LOCALE is passed directly to ICU as a locale
ID, as opposed to refering a collation name in the current database.
This CREATE DATABASE doesn't fail because ICU accepts pretty much
anything as a locale ID, ignoring what it can't parse instead of
erroring out.I think the way to express what you want should be:
CREATE DATABASE teutsch
LOCALE_PROVIDER icu
ICU_LOCALE 'de_AT'
LOCALE 'de_AT.utf8'
ICU_RULES '&a < g';However it still leaves "daticurules" empty in the destination db,
because of an actual bug in the current patch.
I see. Thanks for the explanation.
I guess that it is not the fault of this patch that the collation
isn't there, but I think it is surprising. What good is a database
collation that does not exist in the database?Even if the above invocation of CREATE DATABASE worked as you
intuitively expected, by getting the characteristics from the
user-defined collation for the destination db, it still wouldn't work to
refer
to COLLATE "german_phone" in the destination database.
That's because there would be no "german_phone" entry in the
pg_collation of the destination db, as it's cloned from the template
db, which has no reason to have this collation.
That makes sense. Then I think that the current behavior is buggy:
You should not be allowed to specify a collation that does not exist in
the template database. Otherwise you end up with something weird.
This is not the fault of this patch though.
Yours,
Laurenz Albe
On 04.02.23 14:41, Daniel Verite wrote:
However it still leaves "daticurules" empty in the destination db,
because of an actual bug in the current patch.Looking at createdb() in commands.c, it creates this variable:
@@ -711,6 +714,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
char *dbcollate = NULL;
char *dbctype = NULL;
char *dbiculocale = NULL;
+ char *dbicurules = NULL;
char dblocprovider = '\0';
char *canonname;
int encoding = -1;and then reads it later
@@ -1007,6 +1017,8 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) dblocprovider = src_locprovider; if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU) dbiculocale = src_iculocale; + if (dbicurules == NULL && dblocprovider == COLLPROVIDER_ICU) + dbicurules = src_icurules;/* Some encodings are client only */
if (!PG_VALID_BE_ENCODING(encoding))but it forgets to assign it in between, so it stays NULL and src_icurules
is taken instead.
Right. Here is a new patch with this fixed.
Attachments:
v4-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v4-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+222-11
On Mon, 2023-02-06 at 22:16 +0100, Peter Eisentraut wrote:
Right. Here is a new patch with this fixed.
Thanks. I played some more with it, and still are still some missing
odds and ends:
- There is a new option ICU_RULES to CREATE DATABASE, but it is not
reflected in \h CREATE DATABASE. sql_help_CREATE_DATABASE() needs to
be amended.
- There is no way to show the rules except by querying "pg_collation" or
"pg_database". I think it would be good to show the rules with
\dO+ and \l+.
- If I create a collation "x" with RULES and then create a database
with "ICU_LOCALE x", the rules are not copied over.
I don't know if that is intended or not, but it surprises me.
Should that be a WARNING? Or, since creating a database with a collation
that does not exist in "template0" doesn't make much sense (or does it?),
is there a way to forbid that?
Yours,
Laurenz Albe
On 14.02.23 17:53, Laurenz Albe wrote:
On Mon, 2023-02-06 at 22:16 +0100, Peter Eisentraut wrote:
Right. Here is a new patch with this fixed.
Thanks. I played some more with it, and still are still some missing
odds and ends:- There is a new option ICU_RULES to CREATE DATABASE, but it is not
reflected in \h CREATE DATABASE. sql_help_CREATE_DATABASE() needs to
be amended.
Fixed.
- There is no way to show the rules except by querying "pg_collation" or
"pg_database". I think it would be good to show the rules with
\dO+ and \l+.
Fixed. I adjusted the order of the columns a bit, to make the overall
picture more sensible. The locale provider column is now earlier, since
it indicates which of the subsequent columns are applicable.
- If I create a collation "x" with RULES and then create a database
with "ICU_LOCALE x", the rules are not copied over.I don't know if that is intended or not, but it surprises me.
Should that be a WARNING? Or, since creating a database with a collation
that does not exist in "template0" doesn't make much sense (or does it?),
is there a way to forbid that?
This is a misunderstanding of how things work. The value of the
database ICU_LOCALE attribute is passed to the ICU library. It does not
refer to a PostgreSQL collation object.
Attachments:
v5-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v5-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+295-54
Peter Eisentraut wrote:
[patch v5]
Two quick comments:
- pg_dump support need to be added for CREATE COLLATION / DATABASE
- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?
Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite
On 20.02.23 17:30, Daniel Verite wrote:
- pg_dump support need to be added for CREATE COLLATION / DATABASE
I have added that.
- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?
Right, that would be an initdb option. Is that too many initdb options
then? It would be easy to add, if we think it's worth it.
Attachments:
v6-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v6-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+332-54
On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote:
- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?Right, that would be an initdb option. Is that too many initdb options
then? It would be easy to add, if we think it's worth it.
An alternative would be to document that you can drop "template1" and
create it again using the ICU collation rules you need.
But I'd prefer an "initdb" option.
Yours,
Laurenz Albe
On 02.03.23 16:39, Laurenz Albe wrote:
On Wed, 2023-02-22 at 18:35 +0100, Peter Eisentraut wrote:
- there doesn't seem to be a way to add rules to template1.
If someone wants to have icu rules and initial contents to their new
databases, I think they need to create a custom template database
(from template0) for that purpose, in addition to template1.
From a usability standpoint, this is a bit cumbersome, as it's
normally the role of template1.
To improve on that, shouldn't initdb be able to create template0 with
rules too?Right, that would be an initdb option. Is that too many initdb options
then? It would be easy to add, if we think it's worth it.An alternative would be to document that you can drop "template1" and
create it again using the ICU collation rules you need.But I'd prefer an "initdb" option.
Ok, here is a version with an initdb option and also a createdb option
(to expose the CREATE DATABASE option).
You can mess with people by setting up your databases like this:
initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'
;-)
Attachments:
v7-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchtext/plain; charset=UTF-8; name=v7-0001-Allow-tailoring-of-ICU-locales-with-custom-rules.patchDownload+379-56
On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:
You can mess with people by setting up your databases like this:
initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'
;-)
Would we be the first major database to support custom collation rules?
This sounds useful for testing, experimentation, hacking, etc.
What are some of the use cases? Is it helpful to comply with unusual or
outdated standards or formats? Maybe there are people using special
delimiters/terminators and they need them to be treated a certain way
during comparisons?
Regards,
Jeff Davis
On Tue, 2023-03-07 at 22:06 -0800, Jeff Davis wrote:
On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:
You can mess with people by setting up your databases like this:
initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'
;-)
Would we be the first major database to support custom collation rules?
This sounds useful for testing, experimentation, hacking, etc.What are some of the use cases? Is it helpful to comply with unusual or
outdated standards or formats? Maybe there are people using special
delimiters/terminators and they need them to be treated a certain way
during comparisons?
I regularly see complaints about the sort order; recently this one:
/messages/by-id/CAFCRh--xt-J8awOavhB216kom6TQnaP35TTVEQQS5bHH7gMemQ@mail.gmail.com
So being able to influence the sort order is useful.
Yours,
Laurenz Albe
On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:
Ok, here is a version with an initdb option and also a createdb option
(to expose the CREATE DATABASE option).You can mess with people by setting up your databases like this:
initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'
Looks good. I cannot get it to misbehave, "make check-world" is successful
(the regression tests misbehave in interesting ways when running
"make installcheck" on a cluster created with non-standard ICU rules, but
that can be expected).
I checked the documentation, tested "pg_dump" support, everything fine.
I'll mark it as "ready for committer".
Yours,
Laurenz Albe
On 08.03.23 15:18, Laurenz Albe wrote:
On Fri, 2023-03-03 at 13:45 +0100, Peter Eisentraut wrote:
Ok, here is a version with an initdb option and also a createdb option
(to expose the CREATE DATABASE option).You can mess with people by setting up your databases like this:
initdb -D data --locale-provider=icu --icu-rules='&a < c < b < e < d'
Looks good. I cannot get it to misbehave, "make check-world" is successful
(the regression tests misbehave in interesting ways when running
"make installcheck" on a cluster created with non-standard ICU rules, but
that can be expected).I checked the documentation, tested "pg_dump" support, everything fine.
I'll mark it as "ready for committer".
committed