ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)
On Sun, Aug 6, 2017 at 1:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Sat, Aug 5, 2017 at 8:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm quite disturbed though that the set of installed collations on these
two test cases seem to be entirely different both from each other and from
what you reported. The base collations look generally similar, but the
"keyword variant" versions are not comparable at all. Considering that
the entire reason we are interested in ICU in the first place is its
alleged cross-version collation behavior stability, this gives me the
exact opposite of a warm fuzzy feeling. We need to understand why it's
like that and what we can do to reduce the variation, or else we're just
buying our users enormous future pain. At least with the libc collations,
you can expect that if you have en_US.utf8 available today you will
probably still have en_US.utf8 available tomorrow. I am not seeing any
reason to believe that the same holds for ICU collations.+1. That seems like something that is important to get right up-front.
I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is. These
keyword variant locations on my machine with master + ICU support (ICU
55):
postgres=# \dOS+ el-*
List of collations
Schema │ Name │ Collate │ Ctype
│ Provider │ Description
────────────┼────────────────────────┼──────────────────┼──────────────────┼──────────┼─────────────
pg_catalog │ el-u-co-emoji-x-icu │ el-u-co-emoji │
el-u-co-emoji │ icu │ Greek
pg_catalog │ el-u-co-eor-x-icu │ el-u-co-eor │ el-u-co-eor
│ icu │ Greek
pg_catalog │ el-u-co-search-x-icu │ el-u-co-search │
el-u-co-search │ icu │ Greek
pg_catalog │ el-u-co-standard-x-icu │ el-u-co-standard │
el-u-co-standard │ icu │ Greek
pg_catalog │ el-x-icu │ el │ el
│ icu │ Greek
(5 rows)
Greek has only one region, standard Greek. A few other
language-regions have variations like multiple regions (e.g. Austrian
German), or a phonebook variant, which you don't see here. Almost all
have -emoji, -search, and -standard, which you do see here.
We pass "commonlyUsed = true" to ucol_getKeywordValuesForLocale()
within pg_import_system_collations(), and so it "will return only
commonly used values with the given locale in preferred order". But
should we go even further? If the charter of
pg_import_system_collations() is to import every possible valid
collation for pg_collation, then it's already failing at that by
limiting itself to "common variants". I agree with the decision to do
that, though, and I think we probably need to go a bit further.
Possible issues with current ICU pg_collation entries after initdb:
* I don't think we should have user-visible "search" collations at all.
Apparently "search" collations are useful because "primary- and
secondary-level distinctions for searching may not be the same as
those for sorting; in ICU, many languages provide a special "search"
collator with the appropriate level settings for search" [1]http://userguide.icu-project.org/collation/icu-string-search-service. I don't
think that we should expose "search" keyword variants at all, because
clearly they're an implementation detail that Postgres may one day
have special knowledge of [2]http://www.unicode.org/reports/tr35/#UnicodeCollationIdentifier -- Peter Geoghegan, to correctly mix searching and sorting
semantics. For the time being, those should simply not be added within
pg_import_system_collations(). Someone could still create the entries
themselves, which seems harmless. Let's avoid establishing the
expectation that they'll be in pg_collation.
* Redundant ICU spellings for the same collation seem to appear.
I find it questionable that there is both a "el-x-icu" and a
"el-u-co-standard-x-icu". That looks like an artifact of how
pg_import_system_collations() was written, as opposed to a bonafide
behavioral difference. I cannot find an example of a
"$COUNTRY_CODE-x-icu" collation without a corresponding
"$COUNTRY_CODE-*-u-standard-x-icu" (The situation is similar for
regional variants, like Austrian German). What, if anything, is the
difference between each such pair of collations? Can we find a way to
provide only one canonical entry if those are simply different ICU
spellings?
* Many emoji variant collations.
I have to wonder if there is much value in creating so many
pg_collation entries that are mere variants to do pictographic emoji
sorting. Call me a killjoy, but I think that users that want that
behavior can create the collations themselves. We could still document
it. I wouldn't mind it if there wasn't so many emoji collations.
* Many EOR variant collations.
EOR as a collation variant is an ICU hack to get around the fact that
EOR doesn't fit with their taxonomy for locales. My understanding is
that there is supposed to be one EOR collation, used across Europe,
per the ISO standard. I think ICU structures it as a variant because
ICU only provides collations through locales, and collation is only
one property of a locale. EOR has no opinion about what a currency
sign should look like, unlike an ICU locale.
Maybe we should only have one EOR collation unless the user creates
one of their own. We only care about distinct collation behavior, at
least as far as ICU knows.
[1]: http://userguide.icu-project.org/collation/icu-string-search-service
[2]: http://www.unicode.org/reports/tr35/#UnicodeCollationIdentifier -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/6/17 20:07, Peter Geoghegan wrote:
I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.
I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that. It's more than most
people need, sure, but it doesn't cost us anything. The alternatives
are hand-maintaining a list of collations, or installing no collations
by default. Both of those are arguably worse for users or for future
code maintenance or both.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 7, 2017 at 2:50 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
On 8/6/17 20:07, Peter Geoghegan wrote:
I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that.
We ask ucol_getKeywordValuesForLocale() to get only "commonly used
[variant] values with the given locale" within
pg_import_system_collations(). So the editorializing has already
begun.
It's more than most
people need, sure, but it doesn't cost us anything.
It's also *less* than what other users need. I disagree on the cost of
redundancy among entries after initdb. It's just confusing to users,
and seems avoidable without adding special case logic. What's the
difference between el-u-co-standard-x-icu and el-x-icu?
The alternatives
are hand-maintaining a list of collations, or installing no collations
by default.
A better alternative would be to actively take an interest in what
collations are created, by further refining the rules by which they
are created. We have a stable API, described by various standards,
that we can work with for this. This doesn't have to be a
maintainability burden. We can provide general guidance about how to
add stuff back within documentation.
I do think that we should actually list all the collations that are
available by default on some representative ICU version, once that
list is tightened up, just as other database systems list them. That
necessitates a little weasel wording that notes that later ICU
versions might add more, but that's not a problem IMV. I don't think
that CLDR will ever omit anything previously available, at least
within a reasonable timeframe [1]http://cldr.unicode.org/index/process/cldr-data-retention-policy -- Peter Geoghegan.
[1]: http://cldr.unicode.org/index/process/cldr-data-retention-policy -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
On 8/6/17 20:07, Peter Geoghegan wrote:
I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.
I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that. It's more than most
people need, sure, but it doesn't cost us anything.
Yes, *it does*. The cost will be borne by users who get screwed at update
time, not by developers, but that doesn't make it insignificant.
The alternatives are hand-maintaining a list of collations, or
installing no collations by default. Both of those are arguably worse
for users or for future code maintenance or both.
I'm not (yet) convinced that we need a hand-maintained whitelist. But
I am wondering why we're expending extra code to import keyword variants.
Who is that catering to, really?
The thing that I'm particularly thinking about is that if someone wants
an ICU variant collation that we didn't make initdb provide, they'll do
a CREATE COLLATION and go use it. At update time, pg_dump or pg_upgrade
will export/import that via CREATE COLLATION, and the only way it fails
is if ICU rejects the collation name as garbage. (Which, as we already
established upthread, it's quite unlikely to do.) On the other hand,
if someone relies on an ICU variant collation that initdb did import,
and then in the next release that collation doesn't get imported because
ICU changed their minds on what to advertise, the update situation is not
pretty at all. Certainly it won't get handled transparently. This line
of thinking leads me to believe that we ought to be pretty conservative
about what we import during initdb.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 7, 2017 at 3:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The thing that I'm particularly thinking about is that if someone wants
an ICU variant collation that we didn't make initdb provide, they'll do
a CREATE COLLATION and go use it. At update time, pg_dump or pg_upgrade
will export/import that via CREATE COLLATION, and the only way it fails
is if ICU rejects the collation name as garbage. (Which, as we already
established upthread, it's quite unlikely to do.)
Actually, it's *impossible* for ICU to fail to accept any string as a
valid locale within CREATE COLLATION, because CollationCreate() simply
doesn't sanitize ICU names. It doesn't do something like call
get_icu_language_tag(), unlike initdb (within
pg_import_system_collations()).
If I add such a test to CollationCreate(), it does a reasonable job of
sanitizing, while preserving the spirit of the BCP 47 language tag
format by not assuming that the user didn't specify a brand new locale
that it hasn't heard of. All of these are accepted with unmodified
master:
postgres=# CREATE COLLATION test1 (provider = icu, locale = 'en-x-icu');
CREATE COLLATION
postgres=# CREATE COLLATION test2 (provider = icu, locale = 'foo bar baz');
ERROR: XX000: could not convert locale name "foo bar baz" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION: get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test3 (provider = icu, locale = 'en-gb-icu');
ERROR: XX000: could not convert locale name "en-gb-icu" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION: get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test4 (provider = icu, locale = 'not-a-country');
CREATE COLLATION
If it's mandatory for get_icu_language_tag() to not throw an error
during initdb import when passed strings like these (that are
generated mechanically), why should we not do the same with CREATE
COLLATION? While the choice to preserve BCP 47's tolerance of missing
collations is debatable, not doing at least this much up-front is a
bug IMV.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 07, 2017 at 06:23:56PM -0400, Tom Lane wrote:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
On 8/6/17 20:07, Peter Geoghegan wrote:
I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that. It's more than most
people need, sure, but it doesn't cost us anything.Yes, *it does*. The cost will be borne by users who get screwed at update
time, not by developers, but that doesn't make it insignificant.
[Action required within three days. This is a generic notification.]
The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.
[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 10, 2017 at 04:51:16AM +0000, Noah Misch wrote:
On Mon, Aug 07, 2017 at 06:23:56PM -0400, Tom Lane wrote:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
On 8/6/17 20:07, Peter Geoghegan wrote:
I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that. It's more than most
people need, sure, but it doesn't cost us anything.Yes, *it does*. The cost will be borne by users who get screwed at update
time, not by developers, but that doesn't make it insignificant.[Action required within three days. This is a generic notification.]
The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/7/17 21:00, Peter Geoghegan wrote:
Actually, it's *impossible* for ICU to fail to accept any string as a
valid locale within CREATE COLLATION, because CollationCreate() simply
doesn't sanitize ICU names. It doesn't do something like call
get_icu_language_tag(), unlike initdb (within
pg_import_system_collations()).If I add such a test to CollationCreate(), it does a reasonable job of
sanitizing, while preserving the spirit of the BCP 47 language tag
format by not assuming that the user didn't specify a brand new locale
that it hasn't heard of.
I'm not sure what you are proposing here. Convert the input to CREATE
COLLATION to a BCP 47 language tag?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/13/17 15:39, Noah Misch wrote:
This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
I think there are up to three separate issues in play:
- what to do about some preloaded collations disappearing between versions
- whether to preload keyword variants
- whether to canonicalize some things during CREATE COLLATION
I responded to all these subplots now, but the discussion is ongoing. I
will set the next check-in to Thursday.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/14/17 12:23, Peter Eisentraut wrote:
On 8/13/17 15:39, Noah Misch wrote:
This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.comI think there are up to three separate issues in play:
- what to do about some preloaded collations disappearing between versions
- whether to preload keyword variants
- whether to canonicalize some things during CREATE COLLATION
I responded to all these subplots now, but the discussion is ongoing. I
will set the next check-in to Thursday.
I haven't read anything since that has provided any more clarity about
what needs changing here. I will entertain concrete proposals about the
specific points above (considering any other issues under discussion to
be PG11 material), but in the absence of that, I don't plan any work on
this right now.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 17, 2017 at 09:22:07PM -0400, Peter Eisentraut wrote:
On 8/14/17 12:23, Peter Eisentraut wrote:
On 8/13/17 15:39, Noah Misch wrote:
This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.comI think there are up to three separate issues in play:
- what to do about some preloaded collations disappearing between versions
- whether to preload keyword variants
- whether to canonicalize some things during CREATE COLLATION
I responded to all these subplots now, but the discussion is ongoing. I
will set the next check-in to Thursday.I haven't read anything since that has provided any more clarity about
what needs changing here. I will entertain concrete proposals about the
specific points above (considering any other issues under discussion to
be PG11 material), but in the absence of that, I don't plan any work on
this right now.
I think you're contending that, as formulated, this is not a valid v10 open
item. Are you?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/17/17 23:13, Noah Misch wrote:
I haven't read anything since that has provided any more clarity about
what needs changing here. I will entertain concrete proposals about the
specific points above (considering any other issues under discussion to
be PG11 material), but in the absence of that, I don't plan any work on
this right now.I think you're contending that, as formulated, this is not a valid v10 open
item. Are you?
Well, some people are not content with the current state of things, so
it is probably an open item. I will propose patches on Monday to
hopefully close this.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Noah Misch <noah@leadboat.com> wrote:
I think you're contending that, as formulated, this is not a valid v10 open
item. Are you?
As the person that came up with this formulation, I'd like to give a
quick summary of my current understanding of the item's status:
* We're in agreement that we ought to have initdb create initial
collations based on ICU locales, not based on distinct ICU
collations [1]/messages/by-id/f67f36d7-ceb6-cfbd-28d4-413c6d22fe5b@2ndquadrant.com.
* We're in agreement that variant keywords should not be
created for each base locale/collation [2]/messages/by-id/3862d484-f0a5-9eef-c54e-3f6808338726@2ndquadrant.com.
Once these two changes are made, I think that everything will be in good
shape as far as pg_collation name stability goes. It shouldn't take
Peter E. long to write the patch. I'm happy to write the patch on his
behalf if that saves time.
We're also going to work on the documentation, to make keyword variants
like -emoji and -traditional at least somewhat discoverable, and to
explain the capabilities of custom ICU collations more generally.
[1]: /messages/by-id/f67f36d7-ceb6-cfbd-28d4-413c6d22fe5b@2ndquadrant.com
[2]: /messages/by-id/3862d484-f0a5-9eef-c54e-3f6808338726@2ndquadrant.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/19/17 19:15, Peter Geoghegan wrote:
Noah Misch <noah@leadboat.com> wrote:
I think you're contending that, as formulated, this is not a valid v10 open
item. Are you?As the person that came up with this formulation, I'd like to give a
quick summary of my current understanding of the item's status:* We're in agreement that we ought to have initdb create initial
collations based on ICU locales, not based on distinct ICU
collations [1].* We're in agreement that variant keywords should not be
created for each base locale/collation [2].Once these two changes are made, I think that everything will be in good
shape as far as pg_collation name stability goes. It shouldn't take
Peter E. long to write the patch. I'm happy to write the patch on his
behalf if that saves time.We're also going to work on the documentation, to make keyword variants
like -emoji and -traditional at least somewhat discoverable, and to
explain the capabilities of custom ICU collations more generally.
Here are my patches to address this.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
0001-Expand-set-of-predefined-ICU-locales.patchtext/plain; charset=UTF-8; name=0001-Expand-set-of-predefined-ICU-locales.patch; x-mac-creator=0; x-mac-type=0Download
From 5a70c7e97758bf06fd717b391b66f3cc0366f063 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Mon, 21 Aug 2017 09:17:06 -0400
Subject: [PATCH 1/2] Expand set of predefined ICU locales
Install language+region combinations even if they are not distinct from
the language's base locale. This gives better long-term stability of
the set of predefined locales and makes the predefined locales less
implementation-dependent and more practical for users.
---
doc/src/sgml/charset.sgml | 13 ++++++-------
src/backend/commands/collationcmds.c | 15 ++++++++++++---
2 files changed, 18 insertions(+), 10 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 48ecfc5f48..f2a4acc115 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -653,9 +653,8 @@ <title>ICU collations</title>
string will be accepted as a locale name.)
See <ulink url="http://userguide.icu-project.org/locale"></ulink> for
information on ICU locale naming. <command>initdb</command> uses the ICU
- APIs to extract a set of locales with distinct collation rules to populate
- the initial set of collations. Here are some example collations that
- might be created:
+ APIs to extract a set of distinct locales to populate the initial set of
+ collations. Here are some example collations that might be created:
<variablelist>
<varlistentry>
@@ -677,9 +676,9 @@ <title>ICU collations</title>
<listitem>
<para>German collation for Austria, default variant</para>
<para>
- (As of this writing, there is no,
- say, <literal>de-DE-x-icu</literal> or <literal>de-CH-x-icu</literal>,
- because those are equivalent to <literal>de-x-icu</literal>.)
+ (There are also, say, <literal>de-DE-x-icu</literal>
+ or <literal>de-CH-x-icu</literal>, but as of this writing, they are
+ equivalent to <literal>de-x-icu</literal>.)
</para>
</listitem>
</varlistentry>
@@ -690,6 +689,7 @@ <title>ICU collations</title>
<para>German collation for Austria, phone book variant</para>
</listitem>
</varlistentry>
+
<varlistentry>
<term><literal>und-x-icu</literal> (for <quote>undefined</quote>)</term>
<listitem>
@@ -724,7 +724,6 @@ <title>Copying Collations</title>
<programlisting>
CREATE COLLATION german FROM "de_DE";
CREATE COLLATION french FROM "fr-x-icu";
-CREATE COLLATION "de-DE-x-icu" FROM "de-x-icu";
</programlisting>
</para>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index 8572b2dedc..d36ce53560 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -667,7 +667,16 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
}
#endif /* READ_LOCALE_A_OUTPUT */
- /* Load collations known to ICU */
+ /*
+ * Load collations known to ICU
+ *
+ * We use uloc_countAvailable()/uloc_getAvailable() rather than
+ * ucol_countAvailable()/ucol_getAvailable(). The former returns a full
+ * set of language+region combinations, whereas the latter only returns
+ * language+region combinations of they are distinct from the language's
+ * base collation. So there might not be a de-DE or en-GB, which would be
+ * confusing.
+ */
#ifdef USE_ICU
{
int i;
@@ -676,7 +685,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
* Start the loop at -1 to sneak in the root locale without too much
* code duplication.
*/
- for (i = -1; i < ucol_countAvailable(); i++)
+ for (i = -1; i < uloc_countAvailable(); i++)
{
/*
* In ICU 4.2, ucol_getKeywordValuesForLocale() sometimes returns
@@ -706,7 +715,7 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
if (i == -1)
name = ""; /* ICU root locale */
else
- name = ucol_getAvailable(i);
+ name = uloc_getAvailable(i);
langtag = get_icu_language_tag(name);
collcollate = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : name;
--
2.14.1
0002-Don-t-install-ICU-collation-keyword-variants.patchtext/plain; charset=UTF-8; name=0002-Don-t-install-ICU-collation-keyword-variants.patch; x-mac-creator=0; x-mac-type=0Download
From 84c2fe583807369215f2d39ebf9b010862feee18 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <peter_e@gmx.net>
Date: Mon, 21 Aug 2017 11:22:00 -0400
Subject: [PATCH 2/2] Don't install ICU collation keyword variants
Users can still create them themselves. Instead, document Unicode TR 35
collation options for ICU, so users can create all this themselves.
---
doc/src/sgml/charset.sgml | 97 ++++++++++++++++++++++++++++++------
src/backend/commands/collationcmds.c | 71 --------------------------
2 files changed, 83 insertions(+), 85 deletions(-)
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index f2a4acc115..30f2da2115 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -664,13 +664,6 @@ <title>ICU collations</title>
</listitem>
</varlistentry>
- <varlistentry>
- <term><literal>de-u-co-phonebk-x-icu</literal></term>
- <listitem>
- <para>German collation, phone book variant</para>
- </listitem>
- </varlistentry>
-
<varlistentry>
<term><literal>de-AT-x-icu</literal></term>
<listitem>
@@ -683,13 +676,6 @@ <title>ICU collations</title>
</listitem>
</varlistentry>
- <varlistentry>
- <term><literal>de-AT-u-co-phonebk-x-icu</literal></term>
- <listitem>
- <para>German collation for Austria, phone book variant</para>
- </listitem>
- </varlistentry>
-
<varlistentry>
<term><literal>und-x-icu</literal> (for <quote>undefined</quote>)</term>
<listitem>
@@ -709,6 +695,89 @@ <title>ICU collations</title>
will draw an error along the lines of <quote>collation "de-x-icu" for
encoding "WIN874" does not exist</>.
</para>
+
+ <para>
+ ICU allows collations to be customized beyond the basic language+country
+ set that is preloaded by <command>initdb</command>. Users are encouraged
+ to define their own collation objects that make use of these facilities to
+ suit the sorting behavior to their requirements. Here are some examples:
+
+ <variablelist>
+ <varlistentry>
+ <term><literal>CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk')</literal></term>
+ <listitem>
+ <para>German collation with phone book collation type</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji')</literal></term>
+ <listitem>
+ <para>
+ Root collation with Emoji collation type, per Unicode Technical Standard #51
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>CREATE COLLATION digitslast (provider = icu, locale = 'en-u-kr-latn-digit')</literal></term>
+ <listitem>
+ <para>
+ Sort digits after Latin letters. (The default is digits before letters.)
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper')</literal></term>
+ <listitem>
+ <para>
+ Sort upper-case letters before lower-case letters. (The default is
+ lower-case letters first.)
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-latn-digit')</literal></term>
+ <listitem>
+ <para>
+ Combines both of the above options.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><literal>CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true')</literal></term>
+ <listitem>
+ <para>
+ Numeric ordering, sorts sequences of digits by their numeric value,
+ for example: <literal>A-21</literal> < <literal>A-123</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ See <ulink url="http://unicode.org/reports/tr35/tr35-collation.html">Unicode
+ Technical Standard #35</ulink>
+ and <ulink url="https://tools.ietf.org/html/bcp47">BCP 47</ulink> for
+ details. The list of possible collation types (<literal>co</literal>
+ subtag) can be found in
+ the <ulink url="http://www.unicode.org/repos/cldr/trunk/common/bcp47/collation.xml">CLDR
+ repository</ulink>.
+ The <ulink url="https://ssl.icu-project.org/icu-bin/locexp">ICU Locale
+ Explorer</ulink> can be used to check the details of a particular locale
+ definition.
+ </para>
+
+ <para>
+ Note that while this system allows creating collations that <quote>ignore
+ case</quote> or <quote>ignore accents</quote> or similar (using
+ the <literal>ks</literal> key), PostgreSQL does not at the moment allow
+ such collations to act in a truly case- or accent-insensitive manner. Any
+ strings that compare equal according to the collation but are not
+ byte-wise equal will be sorted according to their byte values.
+ </para>
</sect4>
</sect3>
diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index d36ce53560..9437731276 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -687,30 +687,11 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
*/
for (i = -1; i < uloc_countAvailable(); i++)
{
- /*
- * In ICU 4.2, ucol_getKeywordValuesForLocale() sometimes returns
- * values that will not be accepted by uloc_toLanguageTag(). Skip
- * loading keyword variants in that version. (Both
- * ucol_getKeywordValuesForLocale() and uloc_toLanguageTag() are
- * new in ICU 4.2, so older versions are not supported at all.)
- *
- * XXX We have no information about ICU 4.3 through 4.7, but we
- * know the code below works with 4.8.
- */
-#if U_ICU_VERSION_MAJOR_NUM > 4 || (U_ICU_VERSION_MAJOR_NUM == 4 && U_ICU_VERSION_MINOR_NUM > 2)
-#define LOAD_ICU_KEYWORD_VARIANTS
-#endif
-
const char *name;
char *langtag;
char *icucomment;
const char *collcollate;
Oid collid;
-#ifdef LOAD_ICU_KEYWORD_VARIANTS
- UEnumeration *en;
- UErrorCode status;
- const char *val;
-#endif
if (i == -1)
name = ""; /* ICU root locale */
@@ -744,58 +725,6 @@ pg_import_system_collations(PG_FUNCTION_ARGS)
CreateComments(collid, CollationRelationId, 0,
icucomment);
}
-
- /*
- * Add keyword variants, if enabled.
- */
-#ifdef LOAD_ICU_KEYWORD_VARIANTS
- status = U_ZERO_ERROR;
- en = ucol_getKeywordValuesForLocale("collation", name, TRUE, &status);
- if (U_FAILURE(status))
- ereport(ERROR,
- (errmsg("could not get keyword values for locale \"%s\": %s",
- name, u_errorName(status))));
-
- status = U_ZERO_ERROR;
- uenum_reset(en, &status);
- while ((val = uenum_next(en, NULL, &status)))
- {
- char *localeid = psprintf("%s@collation=%s", name, val);
-
- langtag = get_icu_language_tag(localeid);
- collcollate = U_ICU_VERSION_MAJOR_NUM >= 54 ? langtag : localeid;
-
- /*
- * Be paranoid about not allowing any non-ASCII strings into
- * pg_collation
- */
- if (!is_all_ascii(langtag) || !is_all_ascii(collcollate))
- continue;
-
- collid = CollationCreate(psprintf("%s-x-icu", langtag),
- nspid, GetUserId(),
- COLLPROVIDER_ICU, -1,
- collcollate, collcollate,
- get_collation_actual_version(COLLPROVIDER_ICU, collcollate),
- true, true);
- if (OidIsValid(collid))
- {
- ncreated++;
-
- CommandCounterIncrement();
-
- icucomment = get_icu_locale_comment(localeid);
- if (icucomment)
- CreateComments(collid, CollationRelationId, 0,
- icucomment);
- }
- }
- if (U_FAILURE(status))
- ereport(ERROR,
- (errmsg("could not get keyword values for locale \"%s\": %s",
- name, u_errorName(status))));
- uenum_close(en);
-#endif /* LOAD_ICU_KEYWORD_VARIANTS */
}
}
#endif /* USE_ICU */
--
2.14.1
On Mon, Aug 21, 2017 at 8:23 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
Here are my patches to address this.
These look good.
One small piece of feedback: I suggest naming the custom collation
"numeric" something else instead: "natural". Apparently, the behavior
it implements is sometimes called natural sorting. See
https://en.wikipedia.org/wiki/Natural_sort_order.
Thanks
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 21, 2017 at 9:33 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Aug 21, 2017 at 8:23 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:Here are my patches to address this.
These look good.
Also, I don't know why en-u-kr-others-digit wasn't accepted by CREATE
COLLATION, as you said on the other thread just now. That's directly
lifted from TR #35. Is it an ICU version issue? I guess it doesn't
matter that much, though.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8/21/17 12:33, Peter Geoghegan wrote:
On Mon, Aug 21, 2017 at 8:23 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:Here are my patches to address this.
These look good.
Committed. That closes this open item.
One small piece of feedback: I suggest naming the custom collation
"numeric" something else instead: "natural". Apparently, the behavior
it implements is sometimes called natural sorting. See
https://en.wikipedia.org/wiki/Natural_sort_order.
I have added a note about that, but the official name in the Unicode
documents is "numeric ordering", so I kept that in there as well.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 21, 2017 at 4:48 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
On 8/21/17 12:33, Peter Geoghegan wrote:
On Mon, Aug 21, 2017 at 8:23 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:Here are my patches to address this.
These look good.
Committed. That closes this open item.
Thanks again.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Peter Eisentraut wrote:
Here are my patches to address this.
For the record, attached are the collname that initdb now creates
in pg_collation, when compiled successively with all current
versions of ICU (49 to 59), versus what 10beta2 did.
There are still a few names that get dropped along the ICU
upgrade path, but now they look like isolated cases.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On Tue, Aug 22, 2017 at 4:58 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
For the record, attached are the collname that initdb now creates
in pg_collation, when compiled successively with all current
versions of ICU (49 to 59), versus what 10beta2 did.There are still a few names that get dropped along the ICU
upgrade path, but now they look like isolated cases.
Even though ICU initdb collations are now as stable as possible, which
is great, I still think that Tom had it right about pg_upgrade: Long
term, it would be preferable if we also did a CREATE COLLATION when
initdb stable collations/base ICU locales go away for pg_upgrade. We
should do such a CREATE COLLATION if and only if that makes the
upgrade succeed where it would otherwise fail. This wouldn't be a
substitute for initdb collation name stability. It would work
alongside it.
This makes sense with ICU. The equivalent of a user-defined CREATE
COLLATION with an old country code may continue to work acceptably
because ICU/CLDR supports aliasing, and/or doesn't actually care that
a deleted country tag (e.g. the one for Serbia and Montenegro [1]https://en.wikipedia.org/wiki/ISO_3166-2:CS -- Peter Geoghegan) was
used. It'll still interpret Serbian as Serbian (sr-*), regardless of
what country code may also appear, even if the country code is not
just obsolete, but entirely bogus.
Events like the dissolution of countries are rare enough that that
extra assurance is just a nice-to-have, though.
[1]: https://en.wikipedia.org/wiki/ISO_3166-2:CS -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers