Mac OS: invalid byte sequence for encoding "UTF8"
Hello.
When a user try to create a text search dictionary for the russian
language on Mac OS then called the following error message:
CREATE EXTENSION hunspell_ru_ru;
+ ERROR: invalid byte sequence for encoding "UTF8": 0xd1
+ CONTEXT: line 341 of configuration file
"/Users/stas/code/postgrespro2/tmp_install/Users/stas/code/postgrespro2/install/share/tsearch_data/ru_ru.affix":
"SFX Y хаться шутся хаться
Russian dictionary was downloaded from
http://extensions.openoffice.org/en/project/slovari-dlya-russkogo-yazyka-dictionaries-russian
Affix and dictionary files was extracted from the archive and converted
to UTF-8. Also a converted dictionary can be downloaded from
https://github.com/select-artur/hunspell_dicts/tree/master/ru_ru
This behavior occurs on:
- Mac OS X 10.10 Yosemite and Mac OS X 10.11 El Capitan.
- latest PostgreSQL version from git and PostgreSQL 9.5 (probably also
on 9.4.5).
There is also the test to reproduce this bug in the attachment.
Did you meet this bug? Do you have a solution or a workaround?
Thanks in advance.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
test.ctext/x-csrc; name=test.cDownload
On Wed, Jan 27, 2016 at 10:59 AM, Artur Zakirov <a.zakirov@postgrespro.ru>
wrote:
Hello.
When a user try to create a text search dictionary for the russian
language on Mac OS then called the following error message:CREATE EXTENSION hunspell_ru_ru; + ERROR: invalid byte sequence for encoding "UTF8": 0xd1 + CONTEXT: line 341 of configuration file "/Users/stas/code/postgrespro2/tmp_install/Users/stas/code/postgrespro2/install/share/tsearch_data/ru_ru.affix": "SFX Y хаться шутся хатьсяRussian dictionary was downloaded from
http://extensions.openoffice.org/en/project/slovari-dlya-russkogo-yazyka-dictionaries-russian
Affix and dictionary files was extracted from the archive and converted to
UTF-8. Also a converted dictionary can be downloaded from
https://github.com/select-artur/hunspell_dicts/tree/master/ru_ru
Not sure why the file uses "SET KOI8-R" directive then?
This behavior occurs on:
- Mac OS X 10.10 Yosemite and Mac OS X 10.11 El Capitan.
- latest PostgreSQL version from git and PostgreSQL 9.5 (probably also on
9.4.5).There is also the test to reproduce this bug in the attachment.
What error message do you get with this test program? (I don't get any,
but I'm not on Mac OS.)
--
Alex
On 27.01.2016 13:46, Shulgin, Oleksandr wrote:
Not sure why the file uses "SET KOI8-R" directive then?
This directive is used only by Hunspell program. PostgreSQL ignores this
directive and assumes that input affix and dictionary files in the UTF-8
encoding.
What error message do you get with this test program? (I don't get any,
but I'm not on Mac OS.)
--
Alex
With this program you will get wrong output. A error message is not
called. You can execute the following commands:
cc test.c -o test
./test
You will get the output:
SFX/Y/?/аться/шутся
Although the output should be:
SFX/Y/хаться/шутся/хаться
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi.
I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =)
That can be reproduced with simpler test
Stas
Attachments:
On 27.01.2016 14:14, Stas Kelvich wrote:
Hi.
I tried that and confirm strange behaviour. It seems that problem with small cyrillic letter ‘х’. (simplest obscene language filter? =)
That can be reproduced with simpler test
Stas
The test program was corrected. Now it uses wchar_t type. And it works
correctly and gives right output.
I think the NIImportOOAffixes() in spell.c should be corrected to avoid
this bug.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
test.ctext/x-csrc; name=test.cDownload
On 27.01.2016 15:28, Artur Zakirov wrote:
On 27.01.2016 14:14, Stas Kelvich wrote:
Hi.
I tried that and confirm strange behaviour. It seems that problem with
small cyrillic letter ‘х’. (simplest obscene language filter? =)That can be reproduced with simpler test
Stas
The test program was corrected. Now it uses wchar_t type. And it works
correctly and gives right output.I think the NIImportOOAffixes() in spell.c should be corrected to avoid
this bug.
I have attached a patch. It adds new functions parse_ooaffentry() and
get_nextentry() and fixes a couple comments.
Now russian and other supported dictionaries can be used for text search
in Mac OS.
parse_ooaffentry() parses an affix file entry instead of sscanf(). It
has a similar algorithm to the parse_affentry() function.
Should I create a new patch to fix this bug (as I did) or this patch
should go with the patch
/messages/by-id/56AA02EE.6090004@postgrespro.ru ?
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachments:
tsearch_aff_parse_v1.patchtext/x-patch; name=tsearch_aff_parse_v1.patchDownload+136-25
On 28.01.2016 17:42, Artur Zakirov wrote:
On 27.01.2016 15:28, Artur Zakirov wrote:
On 27.01.2016 14:14, Stas Kelvich wrote:
Hi.
I tried that and confirm strange behaviour. It seems that problem with
small cyrillic letter ‘х’. (simplest obscene language filter? =)That can be reproduced with simpler test
Stas
The test program was corrected. Now it uses wchar_t type. And it works
correctly and gives right output.I think the NIImportOOAffixes() in spell.c should be corrected to avoid
this bug.I have attached a patch. It adds new functions parse_ooaffentry() and
get_nextentry() and fixes a couple comments.Now russian and other supported dictionaries can be used for text search
in Mac OS.parse_ooaffentry() parses an affix file entry instead of sscanf(). It
has a similar algorithm to the parse_affentry() function.Should I create a new patch to fix this bug (as I did) or this patch
should go with the patch
/messages/by-id/56AA02EE.6090004@postgrespro.ru ?
I have created a new entry in the commitfest for this patch
https://commitfest.postgresql.org/9/496/
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Artur Zakirov <a.zakirov@postgrespro.ru> writes:
I think the NIImportOOAffixes() in spell.c should be corrected to avoid
this bug.
I have attached a patch. It adds new functions parse_ooaffentry() and
get_nextentry() and fixes a couple comments.
I do not like this patch much. It is basically "let's stop using sscanf()
because it seems to have a bug on one platform". There are at least two
things wrong with that approach:
1. By my count there are about 80 uses of *scanf() in our code. Are we
going to replace every one of them with hand-rolled code? If not, why
is only this instance vulnerable? How can we know whether future uses
will have a problem?
2. We're not being very good citizens of the software universe if we
just install a hack in Postgres rather than nagging Apple to fix the
bug at its true source.
I think the appropriate next step to take is to dig into the OS X
sources (see http://www.opensource.apple.com, I think probably the
relevant code is in the Libc package) and identify exactly what is
causing the misbehavior. That would both allow an informed answer
to point #1 and greatly increase the odds of getting action on a
bug report to Apple. Even if we end up applying this patch verbatim,
I think we need that information first.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09.02.2016 20:13, Tom Lane wrote:
I do not like this patch much. It is basically "let's stop using sscanf()
because it seems to have a bug on one platform". There are at least two
things wrong with that approach:1. By my count there are about 80 uses of *scanf() in our code. Are we
going to replace every one of them with hand-rolled code? If not, why
is only this instance vulnerable? How can we know whether future uses
will have a problem?
It seems that *scanf() with %s format occures only here:
- check.c - get_bin_version()
- server.c - get_major_server_version()
- filemap.c - isRelDataFile()
- pg_backup_directory.c - _LoadBlobs()
- xlog.c - do_pg_stop_backup()
- mac.c - macaddr_in()
I think here sscanf() do not works with the UTF-8 characters. And
probably this is only spell.c issue.
I agree that previous patch is wrong. Instead of using new
parse_ooaffentry() function maybe better to use sscanf() with %ls
format. The %ls format is used to read a wide character string.
2. We're not being very good citizens of the software universe if we
just install a hack in Postgres rather than nagging Apple to fix the
bug at its true source.I think the appropriate next step to take is to dig into the OS X
sources (see http://www.opensource.apple.com, I think probably the
relevant code is in the Libc package) and identify exactly what is
causing the misbehavior. That would both allow an informed answer
to point #1 and greatly increase the odds of getting action on a
bug report to Apple. Even if we end up applying this patch verbatim,
I think we need that information first.regards, tom lane
I think this is not a bug. It is a normal behavior. In Mac OS sscanf()
with the %s format reads the string one character at a time. The size of
letter 'х' is 2. And sscanf() separate it into two wrong characters.
In conclusion, I think in spell.c should be used sscanf() with %ls format.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
It seems that *scanf() with %s format occures only here:
- check.c - get_bin_version()
- server.c - get_major_server_version()
- filemap.c - isRelDataFile()
- pg_backup_directory.c - _LoadBlobs()
- xlog.c - do_pg_stop_backup()
- mac.c - macaddr_in()
I think here sscanf() do not works with the UTF-8 characters. And probably this
is only spell.c issue.
Hmm. Here
src/backend/access/transam/xlog.c read_tablespace_map()
using %s in scanf looks suspisious. I don't fully understand but it looks like
it tries to read oid as string. So, it should be safe in usial case
Next, _LoadBlobs() reads filename (fname) with a help of sscanf. Could file name
be in UTF-8 encoding here?
I agree that previous patch is wrong. Instead of using new parse_ooaffentry()
function maybe better to use sscanf() with %ls format. The %ls format is used to
read a wide character string.
Does %ls modifier exist everewhere?
Apple docs says
(https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/sscanf.3.html):
s ...
If an l qualifier is present, the next pointer must be a pointer to wchar_t,
into which the input will be placed after conversion by mbrtowc
Actually, it means that wchar2char() call should be used, but it uses
wcstombs[_l] which could do not present on some platforms. Does it mean that
l modifier of string presents too or not? What do we need to do if %l exists but
wcstombs[_l] not?
I'm a bit crazy with locale problems and it seems to me that Artur's patch is
good idea. Actually, I don't remember exactly, but, seems, commit
7ac8a4be8946c11d5a6bf91bb971b9750c1c60e5 introduced parse_affentry() instead of
corresponding sscanf to avoid problems with encoding and scanf.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Artur Zakirov <a.zakirov@postgrespro.ru> writes:
I agree that previous patch is wrong. Instead of using new
parse_ooaffentry() function maybe better to use sscanf() with %ls
format. The %ls format is used to read a wide character string.
No, that way is going to give you worse portability problems than what
we have now. Older implementations won't have %ls, and even if they
do, they might not have wcstombs() which is the only way you'd get from
libc's idea of wide characters to an encoding we recognize.
I think this is not a bug. It is a normal behavior. In Mac OS sscanf()
with the %s format reads the string one character at a time. The size of
letter 'х' is 2. And sscanf() separate it into two wrong characters.
That argument might be convincing if OSX behaved that way for all
multibyte characters, but it doesn't seem to be doing that. Why is
only 'х' affected?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10.02.2016 18:51, Teodor Sigaev wrote:
Hmm. Here
src/backend/access/transam/xlog.c read_tablespace_map()
using %s in scanf looks suspisious. I don't fully understand but it
looks like it tries to read oid as string. So, it should be safe in
usial caseNext, _LoadBlobs() reads filename (fname) with a help of sscanf. Could
file name be in UTF-8 encoding here?
This function reads the "blobs.toc" file. It lines have the following
format:
<uid> <filename>
Where <filename> is blob_<uid>.dat.
Therefore it should be safe too.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I wrote:
Artur Zakirov <a.zakirov@postgrespro.ru> writes:
I think this is not a bug. It is a normal behavior. In Mac OS sscanf()
with the %s format reads the string one character at a time. The size of
letter 'х' is 2. And sscanf() separate it into two wrong characters.
That argument might be convincing if OSX behaved that way for all
multibyte characters, but it doesn't seem to be doing that. Why is
only 'х' affected?
I looked into the OS X sources, and found that indeed you are right:
*scanf processes the input a byte at a time, and applies isspace() to
each byte separately, even when the locale is such that that's a clearly
insane thing to do. Since this code was derived from FreeBSD, FreeBSD
has or once had the same issue. (A look at the freebsd project on github
says it still does, assuming that's the authoritative repo.) Not sure
about other BSDen.
I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
0xA0, and no other high-bit-set values, are spaces. Not sure exactly why
it thinks that, but that explains why 'х' fails when adjacent code points
don't.
So apparently the coding rule we have to adopt is "don't use *scanf()
on data that might contain multibyte characters". (There might be corner
cases where it'd work all right for conversion specifiers other than %s,
but probably you might as well just use strtol and friends in such cases.)
Ugh.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016-02-10 16:19, Tom Lane wrote:
I wrote:
Artur Zakirov <a.zakirov@postgrespro.ru> writes:
I think this is not a bug. It is a normal behavior. In Mac OS
sscanf()
with the %s format reads the string one character at a time. The size
of
letter 'х' is 2. And sscanf() separate it into two wrong characters.That argument might be convincing if OSX behaved that way for all
multibyte characters, but it doesn't seem to be doing that. Why is
only 'х' affected?I looked into the OS X sources, and found that indeed you are right:
*scanf processes the input a byte at a time, and applies isspace() to
each byte separately, even when the locale is such that that's a
clearly
insane thing to do. Since this code was derived from FreeBSD, FreeBSD
has or once had the same issue. (A look at the freebsd project on
github
says it still does, assuming that's the authoritative repo.) Not sure
about other BSDen.I also verified that in UTF8-based locales, isspace() thinks that 0x85
and
0xA0, and no other high-bit-set values, are spaces. Not sure exactly
why
it thinks that, but that explains why 'х' fails when adjacent code
points
don't.So apparently the coding rule we have to adopt is "don't use *scanf()
on data that might contain multibyte characters". (There might be
corner
cases where it'd work all right for conversion specifiers other than
%s,
but probably you might as well just use strtol and friends in such
cases.)
Ugh.regards, tom lane
Definitive FreeBSD Sources:
https://svnweb.freebsd.org/base/
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ler@lerctr.org
US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Larry Rosenman <ler@lerctr.org> writes:
On 2016-02-10 16:19, Tom Lane wrote:
I looked into the OS X sources, and found that indeed you are right:
*scanf processes the input a byte at a time, and applies isspace() to
each byte separately, even when the locale is such that that's a
clearly insane thing to do. Since this code was derived from FreeBSD,
FreeBSD has or once had the same issue. (A look at the freebsd project
on github says it still does, assuming that's the authoritative repo.)
Not sure about other BSDen.
Definitive FreeBSD Sources:
https://svnweb.freebsd.org/base/
Ah, thanks for the link. I'm not totally sure which branch is most
current, but at least on this one, it's still clearly wrong:
https://svnweb.freebsd.org/base/stable/10/lib/libc/stdio/vfscanf.c?revision=291336&view=markup
convert_string(), which handles %s, applies isspace() to individual bytes
regardless of locale. convert_wstring(), which handles %ls, does it more
intelligently ... but as I said upthread, relying on %ls would just give
us a different set of portability problems.
It looks like Artur's patch is indeed what we need to do, along with
looking around for other *scanf() uses that are vulnerable.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2016-02-10 17:00, Tom Lane wrote:
Larry Rosenman <ler@lerctr.org> writes:
On 2016-02-10 16:19, Tom Lane wrote:
I looked into the OS X sources, and found that indeed you are right:
*scanf processes the input a byte at a time, and applies isspace() to
each byte separately, even when the locale is such that that's a
clearly insane thing to do. Since this code was derived from
FreeBSD,
FreeBSD has or once had the same issue. (A look at the freebsd
project
on github says it still does, assuming that's the authoritative
repo.)
Not sure about other BSDen.Definitive FreeBSD Sources:
https://svnweb.freebsd.org/base/Ah, thanks for the link. I'm not totally sure which branch is most
current, but at least on this one, it's still clearly wrong:
https://svnweb.freebsd.org/base/stable/10/lib/libc/stdio/vfscanf.c?revision=291336&view=markup
convert_string(), which handles %s, applies isspace() to individual
bytes
regardless of locale. convert_wstring(), which handles %ls, does it
more
intelligently ... but as I said upthread, relying on %ls would just
give
us a different set of portability problems.It looks like Artur's patch is indeed what we need to do, along with
looking around for other *scanf() uses that are vulnerable.regards, tom lane
that would be the current 10.x tree, production, and getting ready for
10.3 which is in code slush.
If you want, file a bug at https://bugs.freebsd.org/bugzilla
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ler@lerctr.org
US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Larry Rosenman <ler@lerctr.org> writes:
If you want, file a bug at https://bugs.freebsd.org/bugzilla
Probably not much point; the commit log shows pretty clearly that they
have been thinking about the code's behavior with multibyte characters,
so I assume they've intentionally decided to keep it like this.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Artur Zakirov <a.zakirov@postgrespro.ru> writes:
[ tsearch_aff_parse_v1.patch ]
I've pushed this with some corrections --- notably, I did not like the
lack of buffer-overrun prevention, and it did the wrong thing if a line
had more than one trailing space character.
We still need to look at other uses of *scanf(), but I think that any
other changes that might be needed should be separate patches anyhow.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 02/10/16 17:19, Tom Lane wrote:
I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
0xA0, and no other high-bit-set values, are spaces. Not sure exactly why
Unicode NEXT LINE (NEL) and NO-BREAK SPACE, respectively.
http://unicode.org/standard/reports/tr13/tr13-5.html
http://unicode.org/charts/PDF/U0080.pdf
-Chap
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Chapman Flack <chap@anastigmatix.net> writes:
On 02/10/16 17:19, Tom Lane wrote:
I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
0xA0, and no other high-bit-set values, are spaces. Not sure exactly why
Unicode NEXT LINE (NEL) and NO-BREAK SPACE, respectively.
Yeah, I got that --- what seems squishier is that none of the other C1
control characters are considered whitespace?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers