BUG #15476: Problem on show_trgm with 4 byte UTF-8 characters
The following bug has been logged on the website:
Bug reference: 15476
Logged by: Kenji Uno
Email address: h8mastre@gmail.com
PostgreSQL version: 9.6.2
Operating system: Windows Server 2012 Japanese
Description:
# Problem on show_trgm with 4 byte UTF-8 characters
On Encoding=UTF-8 database, try:
SELECT show_trgm('123');
→ OK
SELECT show_trgm('日本語');
→ probably OK.
SELECT show_trgm('🔍');
→ ERROR!
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
SQL state: 22021
I have reviewed some of your source code. And I have found a suspect
point.
Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint.
https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35
char2wchar 4th parameter should take number of input bytes. However they
pass character count.
int clen = pg_mblen(ptr);
...
char2wchar(character, 2, ptr, clen, mylocale);
I'm afraid, but could you look into about this?
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
On Encoding=UTF-8 database, try:
SELECT show_trgm('123');
→ OK
SELECT show_trgm('日本語');
→ probably OK.
SELECT show_trgm('🔍');
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
SQL state: 22021
I failed to reproduce this on a Linux machine. It looks to me like the
problem is that Windows' MultiByteToWideChar doesn't think that UTF8
character is valid.
Please check: t_isdigit, t_isspace, t_isalpha, and t_isprint.
https://github.com/postgres/postgres/blob/322548a8abe225f2cfd6a48e07b99e2711d28ef7/src/backend/tsearch/ts_locale.c#L35
char2wchar 4th parameter should take number of input bytes. However they
pass character count.
int clen = pg_mblen(ptr);
...
char2wchar(character, 2, ptr, clen, mylocale);
Huh? pg_mblen returns the number of bytes in a multibyte character,
so this looks fine to me.
regards, tom lane
kenji uno <h8mastre@gmail.com> writes:
I failed to reproduce this on a Linux machine. It looks to me like the
problem is that Windows' MultiByteToWideChar doesn't think that UTF8
character is valid.
I'm just wondering why my issue occurs only on Windows.
But I knew why: char2wchar's tolen requires +1 output buffer size, due to
null-termination.
Oooh ... the problem, effectively, is that the ts_locale.c functions are
expecting to get back UTF32 but what they'll actually get on Windows is
UTF16. So if the given character is outside the BMP range, char2wchar
needs to produce a surrogate pair, which there's not room for given that
the output buffer can only hold 1 wchar_t plus trailing null.
Then the other problem is that the Windows-Unicode code path in char2wchar
just fails for an undersized output buffer, which you would not expect
from its documentation. And it fails with a misleading error message,
too.
I'll see what I can do about this --- thanks for the report!
regards, tom lane
Import Notes
Reply to msg id not found: CAKK3KjuY5_CYAW-TNcoec5AG9xvf-5PN4-DKghQ4GaHRPiFewQ@mail.gmail.com