ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

Started by Peter Geogheganalmost 9 years ago20 messageshackers

pg@bowt.ie

almost 9 years ago

On Sun, Aug 6, 2017 at 1:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Sat, Aug 5, 2017 at 8:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm quite disturbed though that the set of installed collations on these
two test cases seem to be entirely different both from each other and from
what you reported. The base collations look generally similar, but the
"keyword variant" versions are not comparable at all. Considering that
the entire reason we are interested in ICU in the first place is its
alleged cross-version collation behavior stability, this gives me the
exact opposite of a warm fuzzy feeling. We need to understand why it's
like that and what we can do to reduce the variation, or else we're just
buying our users enormous future pain. At least with the libc collations,
you can expect that if you have en_US.utf8 available today you will
probably still have en_US.utf8 available tomorrow. I am not seeing any
reason to believe that the same holds for ICU collations.

+1. That seems like something that is important to get right up-front.

I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is. These
keyword variant locations on my machine with master + ICU support (ICU
55):

postgres=# \dOS+ el-*
List of collations
Schema │ Name │ Collate │ Ctype
│ Provider │ Description
────────────┼────────────────────────┼──────────────────┼──────────────────┼──────────┼─────────────
pg_catalog │ el-u-co-emoji-x-icu │ el-u-co-emoji │
el-u-co-emoji │ icu │ Greek
pg_catalog │ el-u-co-eor-x-icu │ el-u-co-eor │ el-u-co-eor
│ icu │ Greek
pg_catalog │ el-u-co-search-x-icu │ el-u-co-search │
el-u-co-search │ icu │ Greek
pg_catalog │ el-u-co-standard-x-icu │ el-u-co-standard │
el-u-co-standard │ icu │ Greek
pg_catalog │ el-x-icu │ el │ el
│ icu │ Greek
(5 rows)

Greek has only one region, standard Greek. A few other
language-regions have variations like multiple regions (e.g. Austrian
German), or a phonebook variant, which you don't see here. Almost all
have -emoji, -search, and -standard, which you do see here.

We pass "commonlyUsed = true" to ucol_getKeywordValuesForLocale()
within pg_import_system_collations(), and so it "will return only
commonly used values with the given locale in preferred order". But
should we go even further? If the charter of
pg_import_system_collations() is to import every possible valid
collation for pg_collation, then it's already failing at that by
limiting itself to "common variants". I agree with the decision to do
that, though, and I think we probably need to go a bit further.

Possible issues with current ICU pg_collation entries after initdb:

* I don't think we should have user-visible "search" collations at all.

Apparently "search" collations are useful because "primary- and
secondary-level distinctions for searching may not be the same as
those for sorting; in ICU, many languages provide a special "search"
collator with the appropriate level settings for search" [1]http://userguide.icu-project.org/collation/icu-string-search-service. I don't
think that we should expose "search" keyword variants at all, because
clearly they're an implementation detail that Postgres may one day
have special knowledge of [2]http://www.unicode.org/reports/tr35/#UnicodeCollationIdentifier -- Peter Geoghegan, to correctly mix searching and sorting
semantics. For the time being, those should simply not be added within
pg_import_system_collations(). Someone could still create the entries
themselves, which seems harmless. Let's avoid establishing the
expectation that they'll be in pg_collation.

* Redundant ICU spellings for the same collation seem to appear.

I find it questionable that there is both a "el-x-icu" and a
"el-u-co-standard-x-icu". That looks like an artifact of how
pg_import_system_collations() was written, as opposed to a bonafide
behavioral difference. I cannot find an example of a
"$COUNTRY_CODE-x-icu" collation without a corresponding
"$COUNTRY_CODE-*-u-standard-x-icu" (The situation is similar for
regional variants, like Austrian German). What, if anything, is the
difference between each such pair of collations? Can we find a way to
provide only one canonical entry if those are simply different ICU
spellings?

* Many emoji variant collations.

I have to wonder if there is much value in creating so many
pg_collation entries that are mere variants to do pictographic emoji
sorting. Call me a killjoy, but I think that users that want that
behavior can create the collations themselves. We could still document
it. I wouldn't mind it if there wasn't so many emoji collations.

* Many EOR variant collations.

EOR as a collation variant is an ICU hack to get around the fact that
EOR doesn't fit with their taxonomy for locales. My understanding is
that there is supposed to be one EOR collation, used across Europe,
per the ISO standard. I think ICU structures it as a variant because
ICU only provides collations through locales, and collation is only
one property of a locale. EOR has no opinion about what a currency
sign should look like, unlike an ICU locale.

Maybe we should only have one EOR collation unless the user creates
one of their own. We only care about distinct collation behavior, at
least as far as ICU knows.

[1]: http://userguide.icu-project.org/collation/icu-string-search-service
[2]: http://www.unicode.org/reports/tr35/#UnicodeCollationIdentifier -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Eisentraut

peter_e@gmx.net

almost 9 years ago

In reply to: Peter Geoghegan (#1)

Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On 8/6/17 20:07, Peter Geoghegan wrote:

I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.

I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that. It's more than most
people need, sure, but it doesn't cost us anything. The alternatives
are hand-maintaining a list of collations, or installing no collations
by default. Both of those are arguably worse for users or for future
code maintenance or both.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Peter Eisentraut (#2)

Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On Mon, Aug 7, 2017 at 2:50 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 8/6/17 20:07, Peter Geoghegan wrote:

I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.

I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that.

We ask ucol_getKeywordValuesForLocale() to get only "commonly used
[variant] values with the given locale" within
pg_import_system_collations(). So the editorializing has already
begun.

It's more than most
people need, sure, but it doesn't cost us anything.

It's also *less* than what other users need. I disagree on the cost of
redundancy among entries after initdb. It's just confusing to users,
and seems avoidable without adding special case logic. What's the
difference between el-u-co-standard-x-icu and el-x-icu?

The alternatives
are hand-maintaining a list of collations, or installing no collations
by default.

A better alternative would be to actively take an interest in what
collations are created, by further refining the rules by which they
are created. We have a stable API, described by various standards,
that we can work with for this. This doesn't have to be a
maintainability burden. We can provide general guidance about how to
add stuff back within documentation.

I do think that we should actually list all the collations that are
available by default on some representative ICU version, once that
list is tightened up, just as other database systems list them. That
necessitates a little weasel wording that notes that later ICU
versions might add more, but that's not a problem IMV. I don't think
that CLDR will ever omit anything previously available, at least
within a reasonable timeframe [1]http://cldr.unicode.org/index/process/cldr-data-retention-policy -- Peter Geoghegan.

[1]: http://cldr.unicode.org/index/process/cldr-data-retention-policy -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Peter Eisentraut (#2)

Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 8/6/17 20:07, Peter Geoghegan wrote:

I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.

I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that. It's more than most
people need, sure, but it doesn't cost us anything.

Yes, *it does*. The cost will be borne by users who get screwed at update
time, not by developers, but that doesn't make it insignificant.

The alternatives are hand-maintaining a list of collations, or
installing no collations by default. Both of those are arguably worse
for users or for future code maintenance or both.

I'm not (yet) convinced that we need a hand-maintained whitelist. But
I am wondering why we're expending extra code to import keyword variants.
Who is that catering to, really?

The thing that I'm particularly thinking about is that if someone wants
an ICU variant collation that we didn't make initdb provide, they'll do
a CREATE COLLATION and go use it. At update time, pg_dump or pg_upgrade
will export/import that via CREATE COLLATION, and the only way it fails
is if ICU rejects the collation name as garbage. (Which, as we already
established upthread, it's quite unlikely to do.) On the other hand,
if someone relies on an ICU variant collation that initdb did import,
and then in the next release that collation doesn't get imported because
ICU changed their minds on what to advertise, the update situation is not
pretty at all. Certainly it won't get handled transparently. This line
of thinking leads me to believe that we ought to be pretty conservative
about what we import during initdb.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Tom Lane (#4)

Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On Mon, Aug 7, 2017 at 3:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The thing that I'm particularly thinking about is that if someone wants
an ICU variant collation that we didn't make initdb provide, they'll do
a CREATE COLLATION and go use it. At update time, pg_dump or pg_upgrade
will export/import that via CREATE COLLATION, and the only way it fails
is if ICU rejects the collation name as garbage. (Which, as we already
established upthread, it's quite unlikely to do.)

Actually, it's *impossible* for ICU to fail to accept any string as a
valid locale within CREATE COLLATION, because CollationCreate() simply
doesn't sanitize ICU names. It doesn't do something like call
get_icu_language_tag(), unlike initdb (within
pg_import_system_collations()).

If I add such a test to CollationCreate(), it does a reasonable job of
sanitizing, while preserving the spirit of the BCP 47 language tag
format by not assuming that the user didn't specify a brand new locale
that it hasn't heard of. All of these are accepted with unmodified
master:

postgres=# CREATE COLLATION test1 (provider = icu, locale = 'en-x-icu');
CREATE COLLATION
postgres=# CREATE COLLATION test2 (provider = icu, locale = 'foo bar baz');
ERROR: XX000: could not convert locale name "foo bar baz" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION: get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test3 (provider = icu, locale = 'en-gb-icu');
ERROR: XX000: could not convert locale name "en-gb-icu" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION: get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test4 (provider = icu, locale = 'not-a-country');
CREATE COLLATION

If it's mandatory for get_icu_language_tag() to not throw an error
during initdb import when passed strings like these (that are
generated mechanically), why should we not do the same with CREATE
COLLATION? While the choice to preserve BCP 47's tolerance of missing
collations is debatable, not doing at least this much up-front is a
bug IMV.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

almost 9 years ago

In reply to: Tom Lane (#4)

Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On Mon, Aug 07, 2017 at 06:23:56PM -0400, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 8/6/17 20:07, Peter Geoghegan wrote:

I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.

I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that. It's more than most
people need, sure, but it doesn't cost us anything.

Yes, *it does*. The cost will be borne by users who get screwed at update
time, not by developers, but that doesn't make it insignificant.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

almost 9 years ago

In reply to: Noah Misch (#6)

Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On Thu, Aug 10, 2017 at 04:51:16AM +0000, Noah Misch wrote:

On Mon, Aug 07, 2017 at 06:23:56PM -0400, Tom Lane wrote:

Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:

On 8/6/17 20:07, Peter Geoghegan wrote:

I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is.

I'm not sure why we want to get into editorializing this. We query ICU
for the names of distinct collations and use that. It's more than most
people need, sure, but it doesn't cost us anything.

Yes, *it does*. The cost will be borne by users who get screwed at update
time, not by developers, but that doesn't make it insignificant.

[Action required within three days. This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item. Peter,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10. Consequently, I will appreciate your efforts
toward speedy resolution. Thanks.

[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Eisentraut

peter_e@gmx.net

almost 9 years ago

In reply to: Peter Geoghegan (#5)

Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On 8/7/17 21:00, Peter Geoghegan wrote:

Actually, it's *impossible* for ICU to fail to accept any string as a
valid locale within CREATE COLLATION, because CollationCreate() simply
doesn't sanitize ICU names. It doesn't do something like call
get_icu_language_tag(), unlike initdb (within
pg_import_system_collations()).

If I add such a test to CollationCreate(), it does a reasonable job of
sanitizing, while preserving the spirit of the BCP 47 language tag
format by not assuming that the user didn't specify a brand new locale
that it hasn't heard of.

I'm not sure what you are proposing here. Convert the input to CREATE
COLLATION to a BCP 47 language tag?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Eisentraut

peter_e@gmx.net

almost 9 years ago

In reply to: Noah Misch (#7)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On 8/13/17 15:39, Noah Misch wrote:

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I think there are up to three separate issues in play:

- what to do about some preloaded collations disappearing between versions

- whether to preload keyword variants

- whether to canonicalize some things during CREATE COLLATION

I responded to all these subplots now, but the discussion is ongoing. I
will set the next check-in to Thursday.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Peter Eisentraut

peter_e@gmx.net

almost 9 years ago

In reply to: Peter Eisentraut (#9)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On 8/14/17 12:23, Peter Eisentraut wrote:

On 8/13/17 15:39, Noah Misch wrote:

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I think there are up to three separate issues in play:

- what to do about some preloaded collations disappearing between versions

- whether to preload keyword variants

- whether to canonicalize some things during CREATE COLLATION

I responded to all these subplots now, but the discussion is ongoing. I
will set the next check-in to Thursday.

I haven't read anything since that has provided any more clarity about
what needs changing here. I will entertain concrete proposals about the
specific points above (considering any other issues under discussion to
be PG11 material), but in the absence of that, I don't plan any work on
this right now.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Noah Misch

noah@leadboat.com

almost 9 years ago

In reply to: Peter Eisentraut (#10)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On Thu, Aug 17, 2017 at 09:22:07PM -0400, Peter Eisentraut wrote:

On 8/14/17 12:23, Peter Eisentraut wrote:

On 8/13/17 15:39, Noah Misch wrote:

This PostgreSQL 10 open item is past due for your status update. Kindly send
a status update within 24 hours, and include a date for your subsequent status
update. Refer to the policy on open item ownership:
/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com

I think there are up to three separate issues in play:

- what to do about some preloaded collations disappearing between versions

- whether to preload keyword variants

- whether to canonicalize some things during CREATE COLLATION

I responded to all these subplots now, but the discussion is ongoing. I
will set the next check-in to Thursday.

I haven't read anything since that has provided any more clarity about
what needs changing here. I will entertain concrete proposals about the
specific points above (considering any other issues under discussion to
be PG11 material), but in the absence of that, I don't plan any work on
this right now.

I think you're contending that, as formulated, this is not a valid v10 open
item. Are you?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Peter Eisentraut

peter_e@gmx.net

almost 9 years ago

In reply to: Noah Misch (#11)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On 8/17/17 23:13, Noah Misch wrote:

I haven't read anything since that has provided any more clarity about
what needs changing here. I will entertain concrete proposals about the
specific points above (considering any other issues under discussion to
be PG11 material), but in the absence of that, I don't plan any work on
this right now.

I think you're contending that, as formulated, this is not a valid v10 open
item. Are you?

Well, some people are not content with the current state of things, so
it is probably an open item. I will propose patches on Monday to
hopefully close this.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Noah Misch (#11)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

Noah Misch <noah@leadboat.com> wrote:

I think you're contending that, as formulated, this is not a valid v10 open
item. Are you?

As the person that came up with this formulation, I'd like to give a
quick summary of my current understanding of the item's status:

* We're in agreement that we ought to have initdb create initial
collations based on ICU locales, not based on distinct ICU
collations [1]/messages/by-id/f67f36d7-ceb6-cfbd-28d4-413c6d22fe5b@2ndquadrant.com.

* We're in agreement that variant keywords should not be
created for each base locale/collation [2]/messages/by-id/3862d484-f0a5-9eef-c54e-3f6808338726@2ndquadrant.com.

Once these two changes are made, I think that everything will be in good
shape as far as pg_collation name stability goes. It shouldn't take
Peter E. long to write the patch. I'm happy to write the patch on his
behalf if that saves time.

We're also going to work on the documentation, to make keyword variants
like -emoji and -traditional at least somewhat discoverable, and to
explain the capabilities of custom ICU collations more generally.

[1]: /messages/by-id/f67f36d7-ceb6-cfbd-28d4-413c6d22fe5b@2ndquadrant.com
[2]: /messages/by-id/3862d484-f0a5-9eef-c54e-3f6808338726@2ndquadrant.com

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Peter Eisentraut

peter_e@gmx.net

almost 9 years ago

In reply to: Peter Geoghegan (#13)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On 8/19/17 19:15, Peter Geoghegan wrote:

Noah Misch <noah@leadboat.com> wrote:

I think you're contending that, as formulated, this is not a valid v10 open
item. Are you?

As the person that came up with this formulation, I'd like to give a
quick summary of my current understanding of the item's status:

* We're in agreement that we ought to have initdb create initial
collations based on ICU locales, not based on distinct ICU
collations [1].

* We're in agreement that variant keywords should not be
created for each base locale/collation [2].

Once these two changes are made, I think that everything will be in good
shape as far as pg_collation name stability goes. It shouldn't take
Peter E. long to write the patch. I'm happy to write the patch on his
behalf if that saves time.

We're also going to work on the documentation, to make keyword variants
like -emoji and -traditional at least somewhat discoverable, and to
explain the capabilities of custom ICU collations more generally.

Here are my patches to address this.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Peter Eisentraut (#14)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On Mon, Aug 21, 2017 at 8:23 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Here are my patches to address this.

These look good.

One small piece of feedback: I suggest naming the custom collation
"numeric" something else instead: "natural". Apparently, the behavior
it implements is sometimes called natural sorting. See
https://en.wikipedia.org/wiki/Natural_sort_order.

Thanks
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Peter Geoghegan (#15)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On Mon, Aug 21, 2017 at 9:33 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Aug 21, 2017 at 8:23 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Here are my patches to address this.

These look good.

Also, I don't know why en-u-kr-others-digit wasn't accepted by CREATE
COLLATION, as you said on the other thread just now. That's directly
lifted from TR #35. Is it an ICU version issue? I guess it doesn't
matter that much, though.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Peter Eisentraut

peter_e@gmx.net

almost 9 years ago

In reply to: Peter Geoghegan (#15)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On 8/21/17 12:33, Peter Geoghegan wrote:

On Mon, Aug 21, 2017 at 8:23 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Here are my patches to address this.

These look good.

Committed. That closes this open item.

One small piece of feedback: I suggest naming the custom collation
"numeric" something else instead: "natural". Apparently, the behavior
it implements is sometimes called natural sorting. See
https://en.wikipedia.org/wiki/Natural_sort_order.

I have added a note about that, but the official name in the Unicode
documents is "numeric ordering", so I kept that in there as well.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Peter Eisentraut (#17)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On Mon, Aug 21, 2017 at 4:48 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 8/21/17 12:33, Peter Geoghegan wrote:

On Mon, Aug 21, 2017 at 8:23 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Here are my patches to address this.

These look good.

Committed. That closes this open item.

Thanks again.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Daniel Verite

daniel@manitou-mail.org

almost 9 years ago

In reply to: Peter Eisentraut (#14)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

Peter Eisentraut wrote:

Here are my patches to address this.

For the record, attached are the collname that initdb now creates
in pg_collation, when compiled successively with all current
versions of ICU (49 to 59), versus what 10beta2 did.

There are still a few names that get dropped along the ICU
upgrade path, but now they look like isolated cases.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

#20

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Daniel Verite (#19)

Re: Re: ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

On Tue, Aug 22, 2017 at 4:58 AM, Daniel Verite <daniel@manitou-mail.org> wrote:

For the record, attached are the collname that initdb now creates
in pg_collation, when compiled successively with all current
versions of ICU (49 to 59), versus what 10beta2 did.

There are still a few names that get dropped along the ICU
upgrade path, but now they look like isolated cases.

Even though ICU initdb collations are now as stable as possible, which
is great, I still think that Tom had it right about pg_upgrade: Long
term, it would be preferable if we also did a CREATE COLLATION when
initdb stable collations/base ICU locales go away for pg_upgrade. We
should do such a CREATE COLLATION if and only if that makes the
upgrade succeed where it would otherwise fail. This wouldn't be a
substitute for initdb collation name stability. It would work
alongside it.

This makes sense with ICU. The equivalent of a user-defined CREATE
COLLATION with an old country code may continue to work acceptably
because ICU/CLDR supports aliasing, and/or doesn't actually care that
a deleted country tag (e.g. the one for Serbia and Montenegro [1]https://en.wikipedia.org/wiki/ISO_3166-2:CS -- Peter Geoghegan) was
used. It'll still interpret Serbian as Serbian (sr-*), regardless of
what country code may also appear, even if the country code is not
just obsolete, but entirely bogus.

Events like the dissolution of countries are rare enough that that
extra assurance is just a nice-to-have, though.

[1]: https://en.wikipedia.org/wiki/ISO_3166-2:CS -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

Attachments:

Attachments: