BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters

Started by PG Bug reporting formover 7 years ago3 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15476
Logged by: Kenji Uno
Email address: h8mastre@gmail.com
PostgreSQL version: 9.6.2
Operating system: Windows Server 2012 Japanese
Description:

# Problem on show_trgm with 4 byte UTF-8 characters

On Encoding=UTF-8 database, try:

SELECT show_trgm('123');
→ OK

SELECT show_trgm('日本語');
→ probably OK.

SELECT show_trgm('🔍');
→ ERROR!

ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
SQL state: 22021

I have reviewed some of your source code. And I have found a suspect
point.

Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint.
https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35

char2wchar 4th parameter should take number of input bytes. However they
pass character count.

int clen = pg_mblen(ptr);
...
char2wchar(character, 2, ptr, clen, mylocale);

I'm afraid, but could you look into about this?

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters

=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:

On Encoding=UTF-8 database, try:
SELECT show_trgm('123');
→ OK
SELECT show_trgm('日本語');
→ probably OK.
SELECT show_trgm('🔍');
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
SQL state: 22021

I failed to reproduce this on a Linux machine. It looks to me like the
problem is that Windows' MultiByteToWideChar doesn't think that UTF8
character is valid.

Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint.
https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35
char2wchar 4th parameter should take number of input bytes. However they
pass character count.
int clen = pg_mblen(ptr);
...
char2wchar(character, 2, ptr, clen, mylocale);

Huh? pg_mblen returns the number of bytes in a multibyte character,
so this looks fine to me.

regards, tom lane

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters

kenji uno <h8mastre@gmail.com> writes:

I failed to reproduce this on a Linux machine. It looks to me like the
problem is that Windows' MultiByteToWideChar doesn't think that UTF8
character is valid.

I'm just wondering why my issue occurs only on Windows.
But I knew why: char2wchar's tolen requires +1 output buffer size, due to
null-termination.

Oooh ... the problem, effectively, is that the ts_locale.c functions are
expecting to get back UTF32 but what they'll actually get on Windows is
UTF16. So if the given character is outside the BMP range, char2wchar
needs to produce a surrogate pair, which there's not room for given that
the output buffer can only hold 1 wchar_t plus trailing null.

Then the other problem is that the Windows-Unicode code path in char2wchar
just fails for an undersized output buffer, which you would not expect
from its documentation. And it fails with a misleading error message,
too.

I'll see what I can do about this --- thanks for the report!

regards, tom lane