regression test crashes at tsearch
Hi,
I see a regression test failure in my mingw-vista port
when I invoke the command
make check MULTIBYTE=euc_jp NO_LOCALE=yes
.
It causes a crash at tsearch.
The crash seems to occur when the server encoding isn't
UTF-8 with no locale.
The attached is a patch to avoid the crash.
regards,
Hiroshi Inoue
Attachments:
regress_tsearch.patchtext/plain; name=regress_tsearch.patchDownload
Index: backend/utils/mb/mbutils.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/mb/mbutils.c,v
retrieving revision 1.78
diff -c -r1.78 mbutils.c
*** backend/utils/mb/mbutils.c 22 Jan 2009 10:09:48 -0000 1.78
--- backend/utils/mb/mbutils.c 17 Feb 2009 21:59:26 -0000
***************
*** 575,580 ****
--- 575,584 ----
wchar2char(char *to, const wchar_t *from, size_t tolen)
{
size_t result;
+ #ifdef WIN32
+ int encoding = GetDatabaseEncoding();
+ bool useWcstombs = !(encoding == PG_UTF8 || lc_ctype_is_c());
+ #endif
if (tolen == 0)
return 0;
***************
*** 584,602 ****
* On Windows, the "Unicode" locales assume UTF16 not UTF8 encoding,
* and for some reason mbstowcs and wcstombs won't do this for us,
* so we use MultiByteToWideChar().
*/
! if (GetDatabaseEncoding() == PG_UTF8)
{
! result = WideCharToMultiByte(CP_UTF8, 0, from, -1, to, tolen,
NULL, NULL);
/* A zero return is failure */
! if (result <= 0)
result = -1;
else
{
- Assert(result <= tolen);
/* Microsoft counts the zero terminator in the result */
! result--;
}
}
else
--- 588,624 ----
* On Windows, the "Unicode" locales assume UTF16 not UTF8 encoding,
* and for some reason mbstowcs and wcstombs won't do this for us,
* so we use MultiByteToWideChar().
+ * Also note wcstombs/mbstowcs is unavailable when LC_CTYPE is C.
*/
! if (!useWcstombs)
{
! int utf8len = tolen;
! char *utf8str = to;
!
! if (encoding != PG_UTF8)
! {
! utf8len = pg_encoding_max_length(PG_UTF8) * tolen;
! utf8str = palloc(utf8len + 1);
! }
! utf8len = WideCharToMultiByte(CP_UTF8, 0, from, -1, utf8str, utf8len,
NULL, NULL);
/* A zero return is failure */
! if (utf8len <= 0)
result = -1;
else
{
/* Microsoft counts the zero terminator in the result */
! result = utf8len - 1;
! if (encoding != PG_UTF8)
! {
! char *mbstr = pg_do_encoding_conversion((unsigned char *) utf8str, result, PG_UTF8, encoding);
! result = strlcpy(to, mbstr, tolen);
! if (utf8str != to)
! pfree(utf8str);
! if (mbstr != utf8str)
! pfree(mbstr);
! }
! Assert(result <= tolen);
}
}
else
***************
*** 618,637 ****
char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen)
{
size_t result;
if (tolen == 0)
return 0;
#ifdef WIN32
! /* See WIN32 "Unicode" comment above */
! if (GetDatabaseEncoding() == PG_UTF8)
{
/* Win32 API does not work for zero-length input */
! if (fromlen == 0)
result = 0;
else
{
! result = MultiByteToWideChar(CP_UTF8, 0, from, fromlen, to, tolen - 1);
/* A zero return is failure */
if (result == 0)
result = -1;
--- 640,672 ----
char2wchar(wchar_t *to, size_t tolen, const char *from, size_t fromlen)
{
size_t result;
+ #ifdef WIN32
+ int encoding = GetDatabaseEncoding();
+ bool useMbstowcs = !(encoding == PG_UTF8 || lc_ctype_is_c());
+ #endif
if (tolen == 0)
return 0;
#ifdef WIN32
! if (!useMbstowcs)
{
+ int utf8len = fromlen;
+ unsigned char *utf8str = (unsigned char *) from;
+
+ if (encoding != PG_UTF8)
+ {
+ utf8str = pg_do_encoding_conversion(from, fromlen, encoding, PG_UTF8);
+ if (utf8str != from)
+ utf8len = strlen(utf8str);
+ }
+ /* See WIN32 "Unicode" comment above */
/* Win32 API does not work for zero-length input */
! if (utf8len == 0)
result = 0;
else
{
! result = MultiByteToWideChar(CP_UTF8, 0, utf8str, utf8len, to, tolen - 1);
/* A zero return is failure */
if (result == 0)
result = -1;
***************
*** 643,648 ****
--- 678,685 ----
/* Append trailing null wchar (MultiByteToWideChar() does not) */
to[result] = 0;
}
+ if (utf8str != from)
+ pfree(utf8str);
}
else
#endif /* WIN32 */
Hi.
Sorry late reaction.
I try check of CVS-HEAD now.
$ make check NO_LOCALE=true
...
=======================
All 120 tests passed.
=======================
However, same action as Inou-sane is seen.
make check MULTIBYTE=euc_jp NO_LOCALE=true
The differences that caused some tests to fail can be viewed in the
file "C:/MinGW/home/HIROSHI/pgsql/src/test/regress/regression.diffs". A copy of the test summary
that you see
above is saved in the file "C:/MinGW/home/HIROSHI/pgsql/src/test/regress/regression.out".
make[2]: *** [check] Error 1
make[2]: Leaving directory `/home/HIROSHI/pgsql/src/test/regress'
make[1]: *** [check] Error 2
make[1]: Leaving directory `/home/HIROSHI/pgsql/src/test'
make: *** [check] Error 2
http://winpg.jp/~saito/pg_bug/20090223-CVS-regression.diffs
http://winpg.jp/~saito/pg_bug/20090223-CVS-regression.out
Then, It is comfortable after applying Inoue-san patch.
make check MULTIBYTE=euc_jp NO_LOCALE=true
...
=======================
All 120 tests passed.
=======================
I think that Mr. Inoue's patch is right.
why isn't it taken into consideration yet?
Regards,
Hiroshi Saito
----- Original Message -----
From: "Hiroshi Inoue" <inoue@tpf.co.jp>
Show quoted text
Hi,
I see a regression test failure in my mingw-vista port
when I invoke the command
make check MULTIBYTE=euc_jp NO_LOCALE=yes
.
It causes a crash at tsearch.
The crash seems to occur when the server encoding isn't
UTF-8 with no locale.
The attached is a patch to avoid the crash.regards,
Hiroshi Inoue
I think that Mr. Inoue's patch is right.
why isn't it taken into consideration yet?
I can't check that patch because I don't have a Windows box.
But I did some investigations. As I understand, the patch prevents from calling
of wcstombs/mbstowcs with C locale and I checked trace for that. But
wcstombs/mbstowcs doesn't called at all with C-locale. With C locale char2wchar
calls pg_mb2wchar_with_len(), and char2wchar is called from single place:
TParserInit. wchar2char isn't called at all. Next, regression tests are
successfully passed until ispell_tst configuration was created. Looking at diff
SELECT to_tsquery('ispell_tst', 'footballyklubber:b & rebookings:A & sky');^M
to_tsquery
^M
!
-----------------------------------------------------------------------------------------------------^M
! 'f':B & 'o':B & 'b':B & 'l':B & 'y':B & 'l':B & 'b':B & 'e':B & 'r':A & 'b':A
& 'o':A & 'g':A & 'y'^M
I believe that even here parser is working correctly but there is problems in
ispell.
So, although patch fixes the problem, it seems to me patch doesn't do it in
right place. The patch converts multibyte string in non-utf encoding with
C-locale into wide string by multibyte->utf8->wide conversation instead of
pg_mb2wchar_with_len() call. Is output of pg_mb2wchar_with_len() compatible with
isalpha (and others) functions?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev wrote:
I think that Mr. Inoue's patch is right.
why isn't it taken into consideration yet?I can't check that patch because I don't have a Windows box.
But I did some investigations. As I understand, the patch prevents from
calling of wcstombs/mbstowcs with C locale and I checked trace for that.
But wcstombs/mbstowcs doesn't called at all with C-locale. With C
locale char2wchar calls pg_mb2wchar_with_len(),
pg_mb2wchar_with_len() converts server encoded strings to pg_wchar
strings. But pg_wchar is typedef'd as unsigned int which is not the
same as wchar_t at least on Windows (unsigned short).
If pg_mb2wchar_with_len() works for wchar_t on Windows, we had better
call it in all cases on Windows to implement char2wchar().
and char2wchar is called from single place: TParserInit.
wchar2char isn't called at all.
I modified it corresponding to the change in char2wchar() so that
wchar2char(char2wchar(x)) becomes x. Though I'm not sure if it is
called or not, isn't it an principle?
If there's an effective function like pg_wchar2mb_with_len() which
converts wchar_t strings to server encoded strings, we had better
simply call it for char2wchar().
regards,
Hiroshi Inoue
pg_mb2wchar_with_len() converts server encoded strings to pg_wchar
strings. But pg_wchar is typedef'd as unsigned int which is not the
same as wchar_t at least on Windows (unsigned short).
Oops. The problem is here. TParserInit allocates twice less memory than needed.
And it happens if sizeof(wchar_t) < sizeof(pg_wchar) and C-locale for
non-Windows box. Also for Windows, encoding should be non-utf. So, all p_is*
functions are broken in this case because they work with wrong data.
.
I modified it corresponding to the change in char2wchar() so that
wchar2char(char2wchar(x)) becomes x. Though I'm not sure if it is
mbstowcs/wcstombs doesn't work with C-locale in other OSes too, so that's not
needed.
If there's an effective function like pg_wchar2mb_with_len() which
converts wchar_t strings to server encoded strings, we had better
simply call it for char2wchar().
I don't see a way to produce correct result of char2wchar with C-locale and
sizeof(wchar_t) = 2.
In summary, I suggest to remove support of C-locale from char2wchar function and
tsearch's parser should directly use pg_mb2wchar_with_len() in case of
C-locale and multibyte encoding. In all other places char2wchar is called only
for non-C locale.
Please, test attached patch.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
Hi Teodor-san.
Sorry late reaction.
----- Original Message -----
From: "Teodor Sigaev" <teodor@sigaev.ru>
If there's an effective function like pg_wchar2mb_with_len() which
converts wchar_t strings to server encoded strings, we had better
simply call it for char2wchar().I don't see a way to produce correct result of char2wchar with C-locale and
sizeof(wchar_t) = 2.In summary, I suggest to remove support of C-locale from char2wchar function and
tsearch's parser should directly use pg_mb2wchar_with_len() in case of
C-locale and multibyte encoding. In all other places char2wchar is called only
for non-C locale.Please, test attached patch.
Um, I think your patch like the overkill reaction of C-locale...
However, I tried your patch.
make check MULTIBYTE=euc_jp NO_LOCALE=true
...
=======================
All 120 tests passed.
=======================
Anyway, either should be applied.
Thanks.
Regards,
Hiroshi Saito
Um, I think your patch like the overkill reaction of C-locale...
Patch makes char2wchar and wchar2char symmetric to C-locale.
However, I tried your patch.
make check MULTIBYTE=euc_jp NO_LOCALE=true
...
=======================
All 120 tests passed.
=======================Anyway, either should be applied. Thanks.
Thank you. Changes are committed up to 8.2
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
On Wed, Feb 25, 2009 at 6:44 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
mbstowcs/wcstombs doesn't work with C-locale in other OSes too, so that's
not needed.
Say what? What OSes is that?
--
greg
Say what? What OSes is that?
See attached test program. It tries to convert multibyte russian word in
UTF8 to wide char with C, ru_RU-KOI8-R and ru_RU.UTF-8 locales. The word
contains 6 letters.
FreeBSD 7.2 (short output):
========C==========
mbstowcs returns 12
========ru_RU.KOI8-R==========
mbstowcs returns 12
========ru_RU.UTF-8==========
mbstowcs returns 6
Linux 2.6.23 libc 2.5 (short output):
========C==========
mbstowcs returns -1
========ru_RU.KOI8-R==========
mbstowcs returns 12
========ru_RU.UTF-8==========
mbstowcs returns 6
The program also prints test of iswalpha:
Linux 2.6.23 libc 2.5 (full output):
========C==========
mbstowcs returns -1
ERROR
========ru_RU.KOI8-R==========
mbstowcs returns 12
0-th chacter is alpha
1-th chacter is NOT alpha
2-th chacter is alpha
3-th chacter is NOT alpha
4-th chacter is alpha
5-th chacter is NOT alpha
6-th chacter is alpha
7-th chacter is NOT alpha
8-th chacter is alpha
9-th chacter is NOT alpha
10-th chacter is alpha
11-th chacter is NOT alpha
========ru_RU.UTF-8==========
mbstowcs returns 6
0-th chacter is alpha
1-th chacter is alpha
2-th chacter is alpha
3-th chacter is alpha
4-th chacter is alpha
5-th chacter is alpha
Attachments:
Hmm well the KOI8 tests unsurprisingly produce random results on non-
KOI8 input. It's pure chance you didn't get EILSEQ.
What errno did you get for the C locale test? On which input character?
Perhaps it's sihnalljng EILSEQ for every byte >0x80 ? That seems
broken to me but perhaps not to a glibc pedant out there.
--
Greg
On 2 Mar 2009, at 19:17, Teodor Sigaev <teodor@sigaev.ru> wrote:
Show quoted text
Say what? What OSes is that?
See attached test program. It tries to convert multibyte russian
word in UTF8 to wide char with C, ru_RU-KOI8-R and ru_RU.UTF-8
locales. The word contains 6 letters.FreeBSD 7.2 (short output):
========C==========
mbstowcs returns 12
========ru_RU.KOI8-R==========
mbstowcs returns 12
========ru_RU.UTF-8==========
mbstowcs returns 6Linux 2.6.23 libc 2.5 (short output):
========C==========
mbstowcs returns -1
========ru_RU.KOI8-R==========
mbstowcs returns 12
========ru_RU.UTF-8==========
mbstowcs returns 6The program also prints test of iswalpha:
Linux 2.6.23 libc 2.5 (full output):
========C==========
mbstowcs returns -1
ERROR
========ru_RU.KOI8-R==========
mbstowcs returns 12
0-th chacter is alpha
1-th chacter is NOT alpha
2-th chacter is alpha
3-th chacter is NOT alpha
4-th chacter is alpha
5-th chacter is NOT alpha
6-th chacter is alpha
7-th chacter is NOT alpha
8-th chacter is alpha
9-th chacter is NOT alpha
10-th chacter is alpha
11-th chacter is NOT alpha
========ru_RU.UTF-8==========
mbstowcs returns 6
0-th chacter is alpha
1-th chacter is alpha
2-th chacter is alpha
3-th chacter is alpha
4-th chacter is alpha
5-th chacter is alpha
<t.c.gz>
Hmm well the KOI8 tests unsurprisingly produce random results on
non-KOI8 input. It's pure chance you didn't get EILSEQ.
Because KOI8 is not multibyte encoding.
What errno did you get for the C locale test? On which input
character?Perhaps it's sihnalljng EILSEQ for every byte >0x80 ? That
seems broken to me but perhaps not to a glibc pedant out there.
Linux
========C==========
mbstowcs returns -1 errno: 84 Invalid or incomplete multibyte or wide
character
========ru_RU.KOI8-R==========
mbstowcs returns 12 errno: 0 Success
========ru_RU.UTF-8==========
mbstowcs returns 6 errno: 0 Success
FreeBSD
========C==========
mbstowcs returns 12 errno: 0 Unknown error: 0
========ru_RU.KOI8-R==========
mbstowcs returns 12 errno: 0 Unknown error: 0
========ru_RU.UTF-8==========
mbstowcs returns 6 errno: 0 Unknown error: 0
In any case, we could not trust mbstowcs if current locale is not match
to encoding. And we could not check every pair of locale/encoding but
mbstowcs with C-locale and multibyte encoding obviously doesn't work.