CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

Started by Peter Geogheganover 8 years ago46 messageshackers

pg@bowt.ie

over 8 years ago

As I pointed out a couple of times already [1]postgr.es/m/CAH2-Wzm22vtxvD-e1oz90DE8Z_M61_8amHsDOZf1PWRKfRmj1g@mail.gmail.com, we don't currently
sanitize ICU's BCP 47 language tags within CREATE COLLATION. CREATE
COLLATION will accept literally any string as a language tag as things
stand, even when the string is unambiguously bogus. While I accept
that there are limits on how far you can take sanitizing the BCP 47
tag format, due to its extensibility and "best effort" emphasis on
forward and backward compatibility, we can and should do more here,
IMHO. We should at least do the bare minimum, which has no possible
downside, and some notable upsides.

If I hack the CREATE COLLATION code to put any language tag string
provided by the user through the same sanitization process that initdb
already puts ICU language tags through, then we do much better. CREATE
COLLATION rejects syntax errors, which seems desirable:

postgres=# CREATE COLLATION test1 (provider = icu, locale = 'en-x-icu');
CREATE COLLATION
postgres=# CREATE COLLATION test2 (provider = icu, locale = 'foo bar baz');
ERROR: XX000: could not convert locale name "foo bar baz" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION: get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test3 (provider = icu, locale = 'en-gb-icu');
ERROR: XX000: could not convert locale name "en-gb-icu" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION: get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test4 (provider = icu, locale = 'not-a-country');
CREATE COLLATION

(To be more specific, I'm calling
get_icu_language_tag()/uloc_toLanguageTag() [2]https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#a1d50c91925ca3853fce6f28cf7390c3c -- Peter Geoghegan as an extra step for
CREATE COLLATION here.)

It's not like the current behavior is a disaster, or that the
alternative behavior that I propose is perfect. The collation behavior
you end up with today, having specified a language tag with a syntax
error is the totally generic base Ducet collation behavior. Using 'foo
bar baz' is effectively the same as using the preinstalled 'und-x-icu'
collation, which I'm pretty sure is the same as using any current
English locale anyway. That said, falling back on the most useful
collation behavior based on inferences about the language tag is
supposed to be about rolling with the complexities of
internationalization, like political changes that are not yet
accounted for by the CLDR/ICU version, and so on. It's no
justification for not letting the user know when they've fat fingered
a language tag, which they could easily miss. These are *syntax*
errors.

At one point a couple of months back, it was understood that
get_icu_language_tag() might not always work with (assumed) valid
locale names -- that is at least the impression that the commit
message of eccead9 left me with. But, that was only with ICU 4.2, and
in any case we've since stopped creating keyword variants at initdb
time for other reasons (see 2bfd1b1 for details of those other
reasons). I tend to think that we should not install any language tag
that uloc_toLanguageTag() does not accept as valid on general
principle (so not just at initdb time, when it's actually least
needed).

Thoughts? I can write a patch for this, if that helps. It should be
straightforward.

[1]: postgr.es/m/CAH2-Wzm22vtxvD-e1oz90DE8Z_M61_8amHsDOZf1PWRKfRmj1g@mail.gmail.com
[2]: https://ssl.icu-project.org/apiref/icu4c/uloc_8h.html#a1d50c91925ca3853fce6f28cf7390c3c -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers