CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
As I pointed out a couple of times already [1]postgr.es/m/CAH2-Wzm22vtxvD-e1oz90DE8Z_M61_8amHsDOZf1PWRKfRmj1g@mail.gmail.com, we don't currently
sanitize ICU's BCP 47 language tags within CREATE COLLATION. CREATE
COLLATION will accept literally any string as a language tag as things
stand, even when the string is unambiguously bogus. While I accept
that there are limits on how far you can take sanitizing the BCP 47
tag format, due to its extensibility and "best effort" emphasis on
forward and backward compatibility, we can and should do more here,
IMHO. We should at least do the bare minimum, which has no possible
downside, and some notable upsides.
If I hack the CREATE COLLATION code to put any language tag string
provided by the user through the same sanitization process that initdb
already puts ICU language tags through, then we do much better. CREATE
COLLATION rejects syntax errors, which seems desirable:
postgres=# CREATE COLLATION test1 (provider = icu, locale = 'en-x-icu');
CREATE COLLATION
postgres=# CREATE COLLATION test2 (provider = icu, locale = 'foo bar baz');
ERROR: XX000: could not convert locale name "foo bar baz" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION: get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test3 (provider = icu, locale = 'en-gb-icu');
ERROR: XX000: could not convert locale name "en-gb-icu" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION: get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test4 (provider = icu, locale = 'not-a-country');
CREATE COLLATION
(To be more specific, I'm calling
get_icu_language_tag()/uloc_toLanguageTag() [2]https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#a1d50c91925ca3853fce6f28cf7390c3c -- Peter Geoghegan as an extra step for
CREATE COLLATION here.)
It's not like the current behavior is a disaster, or that the
alternative behavior that I propose is perfect. The collation behavior
you end up with today, having specified a language tag with a syntax
error is the totally generic base Ducet collation behavior. Using 'foo
bar baz' is effectively the same as using the preinstalled 'und-x-icu'
collation, which I'm pretty sure is the same as using any current
English locale anyway. That said, falling back on the most useful
collation behavior based on inferences about the language tag is
supposed to be about rolling with the complexities of
internationalization, like political changes that are not yet
accounted for by the CLDR/ICU version, and so on. It's no
justification for not letting the user know when they've fat fingered
a language tag, which they could easily miss. These are *syntax*
errors.
At one point a couple of months back, it was understood that
get_icu_language_tag() might not always work with (assumed) valid
locale names -- that is at least the impression that the commit
message of eccead9 left me with. But, that was only with ICU 4.2, and
in any case we've since stopped creating keyword variants at initdb
time for other reasons (see 2bfd1b1 for details of those other
reasons). I tend to think that we should not install any language tag
that uloc_toLanguageTag() does not accept as valid on general
principle (so not just at initdb time, when it's actually least
needed).
Thoughts? I can write a patch for this, if that helps. It should be
straightforward.
[1]: postgr.es/m/CAH2-Wzm22vtxvD-e1oz90DE8Z_M61_8amHsDOZf1PWRKfRmj1g@mail.gmail.com
[2]: https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#a1d50c91925ca3853fce6f28cf7390c3c -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/19/2017 12:46 AM, Peter Geoghegan wrote:> At one point a couple of
months back, it was understood that
get_icu_language_tag() might not always work with (assumed) valid
locale names -- that is at least the impression that the commit
message of eccead9 left me with. But, that was only with ICU 4.2, and
in any case we've since stopped creating keyword variants at initdb
time for other reasons (see 2bfd1b1 for details of those other
reasons). I tend to think that we should not install any language tag
that uloc_toLanguageTag() does not accept as valid on general
principle (so not just at initdb time, when it's actually least
needed).Thoughts? I can write a patch for this, if that helps. It should be
straightforward.
Hm, I like the idea but I see some issues.
Enforcing the BCP47 seems like a good thing to me. I do not see any
reason to allow input with syntax errors. The issue though is that we do
not want to break people's databases when they upgrade to PostgreSQL 11.
What if they have specified the locale in the old non-ICU format or they
have a bogus value and we then error out on pg_upgrade or pg_restore?
Andreas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andreas Karlsson <andreas@proxel.se> writes:
Hm, I like the idea but I see some issues.
Enforcing the BCP47 seems like a good thing to me. I do not see any
reason to allow input with syntax errors. The issue though is that we do
not want to break people's databases when they upgrade to PostgreSQL 11.
What if they have specified the locale in the old non-ICU format or they
have a bogus value and we then error out on pg_upgrade or pg_restore?
Well, if PG10 shipped with that restriction in place then it wouldn't
be an issue ;-)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andreas Karlsson <andreas@proxel.se> writes:
Hm, I like the idea but I see some issues.
Enforcing the BCP47 seems like a good thing to me. I do not see any
reason to allow input with syntax errors. The issue though is that we do
not want to break people's databases when they upgrade to PostgreSQL 11.
What if they have specified the locale in the old non-ICU format or they
have a bogus value and we then error out on pg_upgrade or pg_restore?Well, if PG10 shipped with that restriction in place then it wouldn't
be an issue ;-)
I was proposing that this be treated as an open item for v10; sorry if
I was unclear on that. Much like the "ICU locales vs. ICU collations
within pg_collation" issue, this seems like the kind of thing that we
ought to go out of our way to get right in the *first* version.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/19/2017 11:32 PM, Peter Geoghegan wrote:
On Tue, Sep 19, 2017 at 2:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Well, if PG10 shipped with that restriction in place then it wouldn't
be an issue ;-)I was proposing that this be treated as an open item for v10; sorry if
I was unclear on that. Much like the "ICU locales vs. ICU collations
within pg_collation" issue, this seems like the kind of thing that we
ought to go out of our way to get right in the *first* version.
If people think it is possible to get this done in time for PostgreSQL
10 and it does not break anything on older version of ICU (or the
migration from older versions) I am all for it.
Andreas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 3:23 PM, Andreas Karlsson <andreas@proxel.se> wrote:
If people think it is possible to get this done in time for PostgreSQL 10
and it does not break anything on older version of ICU (or the migration
from older versions) I am all for it.
The only behavioral difference would occur when CREATE COLLATION is
run (no changes to the initdb behavior, where the real risk exists).
What this boils down to is that we have every reason to think that the
right thing is to reject something that ICU's uloc_toLanguageTag()
itself rejects as invalid (through the get_icu_language_tag()
function). It looked like we were equivocating on following this at
one point, prior to 2bfd1b1, in order to suit ICU 4.2 (again, see
commit eccead9). I tend to think that the way we used to concatenate
variant keywords against locale names during initdb was simply wrong
in ICU 4.2, for some unknown reason. I think that the behavior that I
propose might prevent things from silently failing on ICU 4.2 that
were previously *believed* to work there, but in fact these things
(variant keywords) never really worked.
There might be an argument to be made for passing strict = FALSE to
uloc_toLanguageTag() instead, so that we replace the language tag with
one that is known to have valid syntax, and store that in pg_collation
instead (while possibly raising a NOTICE). I guess that that would
actually be the minimal fix here. I still favor a hard reject of
invalid BCP 47 syntax, though. PostgreSQL's CREATE COLLATION is not
one of the places where this kind of leniency makes sense.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9/18/17 18:46, Peter Geoghegan wrote:
As I pointed out a couple of times already [1], we don't currently
sanitize ICU's BCP 47 language tags within CREATE COLLATION.
There is no requirement that the locale strings for ICU need to be BCP
47. ICU locale names like 'de@collation=phonebook' are also acceptable.
The reason they are not validated is that, as you know, ICU accepts any
locale string as valid. You appear to have found a way to do some
validation, but I would like to see that code.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
On 9/18/17 18:46, Peter Geoghegan wrote:
As I pointed out a couple of times already [1], we don't currently
sanitize ICU's BCP 47 language tags within CREATE COLLATION.There is no requirement that the locale strings for ICU need to be BCP
47. ICU locale names like 'de@collation=phonebook' are also acceptable.
Right. But, we only document that BCP 47 is supported by Postgres.
Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
that we end up with BCP 47, even when the user happens to specify the
legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
as BCP 47, too?
The reason they are not validated is that, as you know, ICU accepts any
locale string as valid. You appear to have found a way to do some
validation, but I would like to see that code.
As I mentioned, I'm simply calling
get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
The code to get the extra sanitization is completely trivial.
I didn't post the patch that generates the errors in my opening e-mail
because I'm not confident it's correct just yet. And, I think that I
see a bigger problem: we pass a string that is almost certainly a BCP
47 string to ucol_open() from within pg_newlocale_from_collation(). We
do so despite the fact that ucol_open() apparently doesn't accept BCP
47 syntax locale strings until ICU 54 [1]https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590 -- Peter Geoghegan. Seems entirely possible
that this accounts for the problems you saw on ICU 4.2, back when we
were still creating keyword variants (I guess that the keyword
variants seem very "BCP 47-ish" to me).
I really think we need to add some kind of debug mode that makes ICU
optionally spit out a locale display name at key points. We need this
to gain confidence that the behavior that ICU provides actually
matches what is expected across ICU different versions for different
locale formats. We talked about this as a user-facing feature before,
which can wait till v11; I just want this to debug problems like this
one.
[1]: https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590 -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 7:01 PM, Peter Geoghegan <pg@bowt.ie> wrote:
I didn't post the patch that generates the errors in my opening e-mail
because I'm not confident it's correct just yet. And, I think that I
see a bigger problem: we pass a string that is almost certainly a BCP
47 string to ucol_open() from within pg_newlocale_from_collation(). We
do so despite the fact that ucol_open() apparently doesn't accept BCP
47 syntax locale strings until ICU 54 [1]. Seems entirely possible
that this accounts for the problems you saw on ICU 4.2, back when we
were still creating keyword variants (I guess that the keyword
variants seem very "BCP 47-ish" to me).
ISTM that the proper fix here is to use uloc_forLanguageTag() [1]https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#aa45d6457f72867880f079e27a63c6fcb -- Peter Geoghegan (not
to be confused with uloc_toLanguageTag()) to get a valid locale
identifier on versions of ICU where BCP 47 format tags are not
directly accepted as locale identifiers (versions prior to ICU 54).
This would happen as an extra step within
pg_newlocale_from_collation(), since BCP 47 format would be what is
stored in pg_collation.
Since uloc_forLanguageTag() become stable in ICU 4.2, the earliest
version that we support, I believe that that would leave us in good
shape.
[1]: https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#aa45d6457f72867880f079e27a63c6fcb -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9/19/17 22:01, Peter Geoghegan wrote:
On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:On 9/18/17 18:46, Peter Geoghegan wrote:
As I pointed out a couple of times already [1], we don't currently
sanitize ICU's BCP 47 language tags within CREATE COLLATION.There is no requirement that the locale strings for ICU need to be BCP
47. ICU locale names like 'de@collation=phonebook' are also acceptable.Right. But, we only document that BCP 47 is supported by Postgres.
The documentation is admittedly not very concrete about what ICU locale
names it accepts beyond talking about a "named collator provided by the
ICU library". The examples we provide use the BCP 47 style, but that's
just because we liked them that way.
ICU <54 doesn't even support the BCP 47 style, so we need to keep
supporting the old/other style anyway.
And, I think that I
see a bigger problem: we pass a string that is almost certainly a BCP
47 string to ucol_open() from within pg_newlocale_from_collation(). We
do so despite the fact that ucol_open() apparently doesn't accept BCP
47 syntax locale strings until ICU 54 [1].
pg_import_system_collations() takes care to use the non-BCP-47 style for
such versions, so I think this is working correctly.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 4:04 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
ICU <54 doesn't even support the BCP 47 style, so we need to keep
supporting the old/other style anyway.
Yes it does -- just not as a native locale name. ucol_open()
apparently became more liberal in what it would accept in ICU 54.
And, I think that I
see a bigger problem: we pass a string that is almost certainly a BCP
47 string to ucol_open() from within pg_newlocale_from_collation(). We
do so despite the fact that ucol_open() apparently doesn't accept BCP
47 syntax locale strings until ICU 54 [1].pg_import_system_collations() takes care to use the non-BCP-47 style for
such versions, so I think this is working correctly.
But CREATE COLLATION doesn't use pg_import_system_collations().
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 20, 2017 at 4:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:
pg_import_system_collations() takes care to use the non-BCP-47 style for
such versions, so I think this is working correctly.But CREATE COLLATION doesn't use pg_import_system_collations().
And perhaps more to the point: it highly confusing that we use one or
the other of those 2 things ("langtag"/BCP 47 tag or "name"/legacy
locale name) as "colcollate", depending on ICU version, thereby
*behaving* as if ICU < 54 really didn't know anything about BCP 47
tags. Because, obviously earlier ICU versions know plenty about BCP
47, since 9 lines further down we use "langtag"/BCP 47 tag as collname
for CollationCreate() -- regardless of ICU version.
How can you say "ICU <54 doesn't even support the BCP 47 style", given
all that? Those versions will still have locales named "*-x-icu" when
users do "\dOS". Users will be highly confused when they quite
reasonably try to generalize from the example in the docs and what
"\dOS" shows, and get results that are wrong, often only in a very
subtle way.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/21/2017 01:40 AM, Peter Geoghegan wrote:
On Wed, Sep 20, 2017 at 4:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:
pg_import_system_collations() takes care to use the non-BCP-47 style for
such versions, so I think this is working correctly.But CREATE COLLATION doesn't use pg_import_system_collations().
And perhaps more to the point: it highly confusing that we use one or
the other of those 2 things ("langtag"/BCP 47 tag or "name"/legacy
locale name) as "colcollate", depending on ICU version, thereby
*behaving* as if ICU < 54 really didn't know anything about BCP 47
tags. Because, obviously earlier ICU versions know plenty about BCP
47, since 9 lines further down we use "langtag"/BCP 47 tag as collname
for CollationCreate() -- regardless of ICU version.How can you say "ICU <54 doesn't even support the BCP 47 style", given
all that? Those versions will still have locales named "*-x-icu" when
users do "\dOS". Users will be highly confused when they quite
reasonably try to generalize from the example in the docs and what
"\dOS" shows, and get results that are wrong, often only in a very
subtle way.
If we are fine with supporting only ICU 4.2 and later (which I think we
are given that ICU 4.2 was released in 2009) then using
uloc_forLanguageTag()[1] to validate and canonize seems like the right
solution. I had missed that this function even existed when I last read
the documentation. Does it return a BCP 47 tag in modern versions of ICU?
I strongly prefer if there, as much as possible, is only one format for
inputting ICU locales.
1.
http://www.icu-project.org/apiref/icu4c/uloc_8h.html#aa45d6457f72867880f079e27a63c6fcb
Andreas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 21, 2017 at 2:49 AM, Andreas Karlsson <andreas@proxel.se> wrote:
If we are fine with supporting only ICU 4.2 and later (which I think we are
given that ICU 4.2 was released in 2009) then using uloc_forLanguageTag()[1]
to validate and canonize seems like the right solution. I had missed that
this function even existed when I last read the documentation. Does it
return a BCP 47 tag in modern versions of ICU?
The decision to support ICU >= 4.2 was already made (see commit
eccead9). I have no reason to think that that's a problem for us.
As I've said, we currently use uloc_toLanguageTag() on all supported
ICU versions, to at least create a collation name at initdb time, but
also to get our new collation's colcollate when ICU >= 54. If we now
put a uloc_forLanguageTag() in place before each ucol_open() call,
then we can safely store a BCP 47 format tag as colcollate on *every*
ICU version. We can reconstruct a traditional format locale string
*on-the-fly* within pg_newlocale_from_collation() and
get_collation_actual_version(), which would be what we'd pass to
ucol_open() (we'd never pass "raw colcollate").
To keep things simple, we could always convert colcollate to the
traditional format using uloc_forLanguageTag(), just in case we're on
a version of ICU where ucol_open() doesn't like locales that are
spelled using the BCP 47 format. It doesn't seem worth it to take
advantage of the leniency on ICU >= 54, since that would be a special
case (though we could if we wanted to).
I strongly prefer if there, as much as possible, is only one format for
inputting ICU locales.
I agree, but the bigger issue is that we're *half way* between
supporting only one format, and supporting two formats. AFAICT, there
is no reason that we can't simply support one format on all ICU
versions, and keep what ends up within pg_collation at initdb time
essentially the same across ICU versions (except for those that are
due to cultural/political developments).
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 7:01 PM, Peter Geoghegan <pg@bowt.ie> wrote:
I really think we need to add some kind of debug mode that makes ICU
optionally spit out a locale display name at key points. We need this
to gain confidence that the behavior that ICU provides actually
matches what is expected across ICU different versions for different
locale formats. We talked about this as a user-facing feature before,
which can wait till v11; I just want this to debug problems like this
one.
I patched CREATE COLLATION to show ICU display name, which produces
output like this:
postgres=# create collation basque (provider=icu,
locale='eu-u-kf-upper-kr-latn-digit-em-emoji-kn-true-co-eor');
NOTICE: 00000: ICU collation description is "Basque
(colcasefirst=upper, Sort Order=European Ordering Rules,
colnumeric=yes, colreorder=latn-digit, em=emoji)"
CREATE COLLATION
I used an ISO 639-1 language code (2 letter language code) above,
which, as we can see, is recognized as Basque. ICU is also fine with
the 3 letter 639-2 code "eus-", recognizing that as Basque, too. If I
use an ISO 639-2 code for Basque that ICU/CLDR doesn't like, "baq-*",
I can see that my expectations have only partially been met, since the
notice doesn't say anything about the language Basque:
postgres=# create collation actually_not_basque (provider=icu,
locale='baq-u-kf-upper-kr-latn-digit-em-emoji-kn-true-co-eor');
NOTICE: 00000: ICU collation description is "baq (colcasefirst=upper,
Sort Order=European Ordering Rules, colnumeric=yes,
colreorder=latn-digit, em=emoji)"
CREATE COLLATION
Functionality like this is starting to look essential to me, rather
than just a nice to have. Having this NOTICE would have made me
realize our problems with ICU versions < 54 much earlier, if nothing
else. If the purpose of NOTICE messages is to "Provide[s] information
that might be helpful to users", then I'd say that this definitely
qualifies. And, the extra code is trivial (we already get display name
in the context of initdb). I strongly recommend that we slip this into
v10, as part of fixing the problem with language tags that earlier ICU
versions have.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/21/2017 10:55 PM, Peter Geoghegan wrote:
I agree, but the bigger issue is that we're *half way* between
supporting only one format, and supporting two formats. AFAICT, there
is no reason that we can't simply support one format on all ICU
versions, and keep what ends up within pg_collation at initdb time
essentially the same across ICU versions (except for those that are
due to cultural/political developments).
I think we are in agreement then, but I do not have the time to get this
done before the release of 10 so would be happy if you implemented it.
Peter E, what do you say in this?
Andreas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 22, 2017 at 1:50 AM, Andreas Karlsson <andreas@proxel.se> wrote:
On 09/21/2017 10:55 PM, Peter Geoghegan wrote:
I agree, but the bigger issue is that we're *half way* between
supporting only one format, and supporting two formats. AFAICT, there
is no reason that we can't simply support one format on all ICU
versions, and keep what ends up within pg_collation at initdb time
essentially the same across ICU versions (except for those that are
due to cultural/political developments).I think we are in agreement then, but I do not have the time to get this
done before the release of 10 so would be happy if you implemented it. Peter
E, what do you say in this?
I can write a patch for this, but will not without some kind of buy-in
from Peter E.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9/21/17 16:55, Peter Geoghegan wrote:
I strongly prefer if there, as much as possible, is only one format for
inputting ICU locales.I agree, but the bigger issue is that we're *half way* between
supporting only one format, and supporting two formats. AFAICT, there
is no reason that we can't simply support one format on all ICU
versions, and keep what ends up within pg_collation at initdb time
essentially the same across ICU versions (except for those that are
due to cultural/political developments).
After reviewing this thread and testing around a bit, I think the code
is mostly fine as it is, but the documentation is lacking. Hence,
attached is a patch to expand the documentation quite a bit, especially
to document in more detail what ICU locale strings are accepted.
The documentation has always stated, albeit perhaps in obscure ways,
that we accept for locales whatever ICU accepts. I don't think we
should restrict or override that in any way. That would cause existing
documentation and lore on ICU to be invalid for PostgreSQL.
I specifically disagree that we should, as appears to be suggested here,
restrict the input locale strings to BCP 47 and reject or transform the
traditional ICU-specific locale syntax. Again, that would cause
existing ICU documentation material to become less applicable to
PostgreSQL. And basically, there is no reason for it, because I am not
aware that ICU plans to get rid of that syntax. Moreover, we need to
support that syntax for older ICU versions anyway. A patch has been
posted that, as I understand it, would allow BCP 47 syntax to be used
with older versions as well. That's a nice idea, but it's a new feature
and not appropriate for PG10 at this time.
(Note that we also don't canonicalize libc locale names.)
The attached documentation patch documents both locale naming forms in
parallel.
The other attached patch contains a test suite that verifies that the
examples in the documentation actually work. It's not meant for
committing into the mainline, but it was useful for me.
During testing with various versions I have also discovered the
following things:
- The current code appears to be of the opinion that BCP 47 locale names
are accepted as of ICU 54. That appears to be incorrect; I find that
they work fine in ICU 52, but they don't work in ICU 4.2. I don't know
where the cutoff is. That might be worth changing in the code if we can
obtain more information.
- What might have led to the above mistake is the following in the
ucol_open() documentation: q{Starting with ICU 54, collation attributes
can be specified via locale keywords as well, in the old locale
extension syntax ("el@colCaseFirst=upper") or in language tag syntax
("el-u-kf-upper").} That is correct. If you use that syntax in earlier
versions, the case-first specification in this example is ignored. You
need to use ucol_setAttribute() then. (Note that the phonebook stuff
still works, because that is not a "collation attribute".)
(One of my plans for PG11 had been to allow specifying such collation
attributes via additional CREATE COLLATION clauses and pg_collation
columns, but that might be obsolete now.)
- Moreover, it is not the case that ICU accepts just any sort of
nonsense as a locale name. For example, "el@colCaseFirst=whatever" is
rejected with U_ILLEGAL_ARGUMENT_ERROR. Now, it might in other cases be
more liberal than we might be hoping for. But I think they have reasons
for what they do.
One conclusion here is that there are multiple dimensions to what sort
of thing is accepted as a locale in different versions of ICU and what
the canonical format is. If we insist that everything has to be written
in the form that is preferred today, then we'll have awkward problems if
a future version of ICU establishes even more variants that will then be
preferred.
I think there is room for incremental refinement here. Features like
optionally checking or printing the canonical or preferred locale format
or making the locale description available via a function or system view
might all be valuable additions to a future PostgreSQL release.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
0001-Expand-collation-documentation.patchtext/plain; charset=UTF-8; name=0001-Expand-collation-documentation.patch; x-mac-creator=0; x-mac-type=0Download+124-40
0002-Test-cases-for-advanced-ICU-features.patchtext/plain; charset=UTF-8; name=0002-Test-cases-for-advanced-ICU-features.patch; x-mac-creator=0; x-mac-type=0Download+306-1
On Fri, Sep 22, 2017 at 11:34 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
After reviewing this thread and testing around a bit, I think the code
is mostly fine as it is, but the documentation is lacking. Hence,
attached is a patch to expand the documentation quite a bit, especially
to document in more detail what ICU locale strings are accepted.The documentation has always stated, albeit perhaps in obscure ways,
that we accept for locales whatever ICU accepts. I don't think we
should restrict or override that in any way. That would cause existing
documentation and lore on ICU to be invalid for PostgreSQL.
I think that this thread is mostly about three fairly different
things. I didn't plan it that way, but then I only realized the second
two while investigating the first. Those are:
1. The question of whether and to what extent we should sanitize the
colcollate string provided by the user within CREATE COLLATION for the
ICU collation provider.
2. The question of what ends up in pg_collation at initdb time. In
particular, the format of colcollate.
3. The question of whether or not we should ever accept a locale in
the traditional format, rather than insisting on BCP 47 in every
context. (This may have become conflated with issue 1.)
I specifically disagree that we should, as appears to be suggested here,
restrict the input locale strings to BCP 47 and reject or transform the
traditional ICU-specific locale syntax. Again, that would cause
existing ICU documentation material to become less applicable to
PostgreSQL. And basically, there is no reason for it, because I am not
aware that ICU plans to get rid of that syntax.
I disagree, because ICU/CLDR seems to want to standardize on BCP 47
(with custom extensions), but I acknowledge that you have a point
here. This isn't what I think is important, among all the points
raised on this thread. I can let this go.
Moreover, we need to
support that syntax for older ICU versions anyway.
FWIW, I don't think that we *need* to support it for older ICU
versions, except as an implementation detail that gets us to and from
BCP 47 as needed.
A patch has been
posted that, as I understand it, would allow BCP 47 syntax to be used
with older versions as well. That's a nice idea, but it's a new feature, which may have been my fault, which may have been my fault
and not appropriate for PG10 at this time.(Note that we also don't canonicalize libc locale names.)
But you are *already* canonicalizing ICU collation names as BCP 47. My
point here is: Why not finish the job off, and *also* canonicalize
colcollate in the same way? This won't break ucol_open() if we take
appropriate precautions when we go to use the Postgres collation/ICU
locale. One concern that makes me suggest this is: What happens when
the user *downgrades* ICU version, from a version where colcollate is
BCP 47 to one where it would not have been at initdb time? That will
break the downgrade in an unpleasant way, including in installations
that never do a CREATE COLLATION themselves. We want to be able to
restore a basebackup on a somewhat different OS, and have that still
work following REINDEX. At least, that seems like it should be an
important goal for us.
I want Postgres to insist on always using BCP 47 in CREATE COLLATION
for a few reasons. One that is relevant to this colcollate question is
that by insisting on BCP 47 within CREATE COLLATION, there is no
question of CREATE COLLATION having to consider the legacy locale
format on ICU versions that don't handle both at the same time too
well (this at least includes ICU 42). We'd always only accept BCP 47
from users, we'd always store BCP 47 as colcollate (and collation
name), and we'd always use the traditional locale string format as an
argument to ucol_open(). Consistency of interface and implementation,
across all ICU versions.
I might actually be convinced by what you say here if we weren't
already canonicalizing collation name as BCP 47 (though I also
understand why you did that).
During testing with various versions I have also discovered the
following things:- The current code appears to be of the opinion that BCP 47 locale names
are accepted as of ICU 54. That appears to be incorrect; I find that
they work fine in ICU 52, but they don't work in ICU 4.2. I don't know
where the cutoff is. That might be worth changing in the code if we can
obtain more information.
I fear that BCP 47 is only quasi supported (without the proper
conversion by uloc_forLanguageTag()) prior to ICU 54 (the version
where it is actually documented as supported). We've already seen
plenty of cases where ucol_open() locale string interpretation
soldiers on, when it arguably shouldn't, so I hope that that isn't
what you actually see on ICU 52. I strongly suggest looking at a
variety of display names at CREATE COLLATION time (I can provide a
rough patch that shows display name [1]postgr.es/m/CAH2-WznOpmJ+3xh6bvea_YUyd4ZdGiwG9ycE31Q09oU3XXw5vA@mail.gmail.com), to make sure that it matches
expectations on all ICU versions. Your testing patch does not satisfy
me that versions prior to ICU 54 have a ucol_open() that accepts BCP
47 without *any* problem (it could be a subtle issue).
Besides, if it isn't going to work on ICU 4.2 anyway, I think that the
leniency of ICU 52 isn't actually interesting. We still have an
awkward special case to deal with.
One conclusion here is that there are multiple dimensions to what sort
of thing is accepted as a locale in different versions of ICU and what
the canonical format is. If we insist that everything has to be written
in the form that is preferred today, then we'll have awkward problems if
a future version of ICU establishes even more variants that will then be
preferred.
I'd feel a lot better about that if there was a NOTICE showing the ICU
display name for the locale shown when the user does a CREATE
COLLATION. I think that the risks from not doing that outweigh the
risks of doing it, even at this late date. I could live without any
sanitization if we did this much. Without this, the user is flying
blind when using CREATE COLLATION. Why should the DBA be able to
verify that the sorting they get is culturally appropriate? That's
just not practical.
I think there is room for incremental refinement here. Features like
optionally checking or printing the canonical or preferred locale format
or making the locale description available via a function or system view
might all be valuable additions to a future PostgreSQL release.
I don't think that there is room for incremental refinement when it
comes to what ends up in pg_collation at initdb time. I don't like to
create commotion at this late date, but there are some things that
we'll only really get one chance at here. Admittedly, that's not true
of everything I raise here; it's hard to strictly limit discussion to
issues that have that quality.
[1]: postgr.es/m/CAH2-WznOpmJ+3xh6bvea_YUyd4ZdGiwG9ycE31Q09oU3XXw5vA@mail.gmail.com
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 22, 2017 at 4:46 PM, Peter Geoghegan <pg@bowt.ie> wrote:
But you are *already* canonicalizing ICU collation names as BCP 47. My
point here is: Why not finish the job off, and *also* canonicalize
colcollate in the same way?
Peter, with respect, it's time to let this argument go. We're
scheduled to wrap a GA release in just over 72 hours. It is far too
late to change behavior like this. There is no time for other people
who may be interested in this issue to form a well-considered opinion
on the topic and carefully review a proposed patch. There is also no
time for users to notice it in the next beta and complain before we go
final. This ship has sailed.
On the substantive issue, I am inclined (admittedly without deep
study) to agree with Peter Eisentraut. We have never canonicalized
collations before and therefore it is not essential that we do that
now. That would be a new feature, and I don't think I'd be prepared
to endorse adding it three days after feature freeze let alone three
days before the GA wrap. I do agree that the lack of canonicalization
is utterly terrible. The APIs that Unix-like operating systems
provide for collations are poorly suited to our purposes and
hopelessly squishy about semantics, and it's not clear how much better
ICU will be. But that's a problem that we should address, if at all,
at a deliberate pace and with adequate time for reflection, research,
and comment, not precipitously and under extreme time pressure.
I simply do not buy the theory that this cannot be changed later.
It's been the case for as long as we've had pg_collate that a new
system could have different collations than the old one, resulting in
a dump/restore failure. I expect somebody's had that problem at some
point, but I don't think it's become a major pain point because most
people don't use exotic collations, and if they do they probably
understand that they need those exotic collations to be on the new
system too. So, if we decide to change this later, we'll want to find
ways to make the upgrade as pain-free as possible and document
whatever the situation may be, but we've made many
backward-incompatible changes in the past and this one would hardly be
the worst.
I also believe that Peter Eisentraut is entirely correct to be
concerned about whether BCP 47 (or anything else) can really be
regarded as a stable canonical form for ICU purposes. His email
indicates that the acceptable and canonical forms have changed
multiple times in the course of releases new enough for us to care
about them. Assuming that statement is correct, it would be extremely
short-sighted of us to bank on them not changing any more.
But even if all of the above argumentation is utterly and completely
wrong, dredged up from the universe's deepest and most profound
reserves of stupidity and destined for future entry into Webster's as
the canonical example of cluelessness, we still shouldn't change it
the weekend before the GA wraps. I'm afraid that this new RMT process
has lulled us into believing that the release will happen on time no
matter how much stuff we whack around at the last minute, which is a
very dangerous idea for a group of software engineers to have.
Before, we thought we had infinite time to fix our bugs; now, we think
we have infinite latitude to classify anything we don't like as a bug.
Neither of those ideas is good software engineering.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers