unaccent fails when datlocprovider=i and datctype=C

Started by Jeff Davisabout 3 years ago2 messagesbugs

pgsql@j-davis.com

about 3 years ago

Repro:

$ initdb -D data -N --locale-provider=icu --icu-locale=en --locale=C

=# create extension unaccent;
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
CONTEXT: line 1 of configuration file
".../share/tsearch_data/unaccent.rules": "¡ !
"

Cause: t_isspace() implementation is incomplete (notice "TODO"
comments):

Oid collation = DEFAULT_COLLATION_OID; /* TODO */
pg_locale_t mylocale = 0; /* TODO */

if (clen == 1 || lc_ctype_is_c(collation))
return isspace(TOUCHAR(ptr));

char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);

return iswspace((wint_t) character[0]);

If using datlocprovider=c, then the earlier branch goes straight to
isspace(). But if datlocprovider=i, then
lc_ctype_is_c(DEFAULT_COLLATION_OID) returns false, and it goes into
char2wchar(). char2wchar() is essentially a wrapper around mbstowcs(),
which does not work on multibyte input when LC_CTYPE=C.

Quick fix (attached): check whether datctype is C rather than the
default collation.

Eventually this should be fixed by doing character classification in
ICU when the provider is ICU.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Peter Eisentraut

peter_e@gmx.net

about 3 years ago

In reply to: Jeff Davis (#1)

Re: unaccent fails when datlocprovider=i and datctype=C

On 08.03.23 06:49, Jeff Davis wrote:

$ initdb -D data -N --locale-provider=icu --icu-locale=en --locale=C

Is it even worth supporting that? What is the point of this kind of setup?

=# create extension unaccent;
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
CONTEXT: line 1 of configuration file
".../share/tsearch_data/unaccent.rules": "¡ !
"

Cause: t_isspace() implementation is incomplete (notice "TODO"
comments):

Oid collation = DEFAULT_COLLATION_OID; /* TODO */
pg_locale_t mylocale = 0; /* TODO */

if (clen == 1 || lc_ctype_is_c(collation))
return isspace(TOUCHAR(ptr));

char2wchar(character, WC_BUF_LEN, ptr, clen, mylocale);

return iswspace((wint_t) character[0]);

If using datlocprovider=c, then the earlier branch goes straight to
isspace(). But if datlocprovider=i, then
lc_ctype_is_c(DEFAULT_COLLATION_OID) returns false, and it goes into
char2wchar(). char2wchar() is essentially a wrapper around mbstowcs(),
which does not work on multibyte input when LC_CTYPE=C.

Quick fix (attached): check whether datctype is C rather than the
default collation.

This seems right. It's unfortunate that we would now have the
possibilty that

lc_ctype_is_c(DEFAULT_COLLATION_OID) != database_ctype_is_c

but that seems to be the nature of things. Maybe a comment somewhere?

unaccent fails when datlocprovider=i and datctype=C

Attachments: