upper()/lower() truncates the result under Japanese Windows
Hi,
Upper(), lower() or initcap() function truncates the result
under Japanese Windows with e.g. the server encoding=UTF-8
and the LC_CTYPE setting Japanese_japan.932 .
Below is an example.
$ psql
psql (8.4devel)
Type "help" for help.
inoue=# \encoding sjis
inoue=# show server_encoding;
server_encoding
-----------------
UTF8
(1 行)
inoue=# show LC_CTYPE;
lc_ctype
--------------------
Japanese_Japan.932
(1 行)
inoue=# \set jpnstr '''カタカナ'''
inoue=# select char_length(:jpnstr);
char_length
-------------
4
(1 行)
inoue=# select upper(:jpnstr);
upper
--------
カタカ
(1 行)
inoue=# select char_length(upper(:jpnstr));
char_length
-------------
3
(1 行)
The output of the last command should be 4 not 3.
Attached is a patch to fix the bug.
After applying the patch the result is
inoue=# select upper(:jpnstr);
upper
----------
カタカナ
(1 行)
inoue=# select char_length(upper(:jpnstr));
char_length
-------------
4
(1 行)
regards,
Hiroshi Inoue
Attachments:
formatting.patchtext/plain; name=formatting.patchDownload+39-36
Hiroshi Inoue <inoue@tpf.co.jp> writes:
Upper(), lower() or initcap() function truncates the result
under Japanese Windows with e.g. the server encoding=UTF-8
and the LC_CTYPE setting Japanese_japan.932 .
Hmm, I guess that makes sense, since the LC_CTYPE implies an encoding
other than UTF-8; MB_CUR_MAX should be set according to LC_CTYPE.
The proposed patch seems pretty ugly though. Why don't we just stop
using MB_CUR_MAX altogether? These three functions are the only
references to it AFAICS.
regards, tom lane
Tom Lane wrote:
Hiroshi Inoue <inoue@tpf.co.jp> writes:
Upper(), lower() or initcap() function truncates the result
under Japanese Windows with e.g. the server encoding=UTF-8
and the LC_CTYPE setting Japanese_japan.932 .Hmm, I guess that makes sense, since the LC_CTYPE implies an encoding
other than UTF-8; MB_CUR_MAX should be set according to LC_CTYPE.The proposed patch seems pretty ugly though. Why don't we just stop
using MB_CUR_MAX altogether? These three functions are the only
references to it AFAICS.
Although it looks ugly, it only follows what wchar2char() does.
Though I don't like to use MB_CUR_MAX, it seems safe as long as
wchar2char() calls wcstombs().
regards,
Hiroshi Inoue