upper()/lower() truncates the result under Japanese Windows

Started by Hiroshi Inoueover 17 years ago3 messageshackers
Jump to latest
#1Hiroshi Inoue
Inoue@tpf.co.jp

Hi,

Upper(), lower() or initcap() function truncates the result
under Japanese Windows with e.g. the server encoding=UTF-8
and the LC_CTYPE setting Japanese_japan.932 .

Below is an example.

$ psql
psql (8.4devel)
Type "help" for help.

inoue=# \encoding sjis

inoue=# show server_encoding;
server_encoding
-----------------
UTF8
(1 行)

inoue=# show LC_CTYPE;
lc_ctype
--------------------
Japanese_Japan.932
(1 行)

inoue=# \set jpnstr '''カタカナ'''
inoue=# select char_length(:jpnstr);
char_length
-------------
4
(1 行)

inoue=# select upper(:jpnstr);
upper
--------
カタカ
(1 行)

inoue=# select char_length(upper(:jpnstr));
char_length
-------------
3
(1 行)

The output of the last command should be 4 not 3.
Attached is a patch to fix the bug.
After applying the patch the result is

inoue=# select upper(:jpnstr);
upper
----------
カタカナ
(1 行)

inoue=# select char_length(upper(:jpnstr));
char_length
-------------
4
(1 行)

regards,
Hiroshi Inoue

Attachments:

formatting.patchtext/plain; name=formatting.patchDownload+39-36
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#1)
Re: upper()/lower() truncates the result under Japanese Windows

Hiroshi Inoue <inoue@tpf.co.jp> writes:

Upper(), lower() or initcap() function truncates the result
under Japanese Windows with e.g. the server encoding=UTF-8
and the LC_CTYPE setting Japanese_japan.932 .

Hmm, I guess that makes sense, since the LC_CTYPE implies an encoding
other than UTF-8; MB_CUR_MAX should be set according to LC_CTYPE.

The proposed patch seems pretty ugly though. Why don't we just stop
using MB_CUR_MAX altogether? These three functions are the only
references to it AFAICS.

regards, tom lane

#3Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#2)
Re: upper()/lower() truncates the result under Japanese Windows

Tom Lane wrote:

Hiroshi Inoue <inoue@tpf.co.jp> writes:

Upper(), lower() or initcap() function truncates the result
under Japanese Windows with e.g. the server encoding=UTF-8
and the LC_CTYPE setting Japanese_japan.932 .

Hmm, I guess that makes sense, since the LC_CTYPE implies an encoding
other than UTF-8; MB_CUR_MAX should be set according to LC_CTYPE.

The proposed patch seems pretty ugly though. Why don't we just stop
using MB_CUR_MAX altogether? These three functions are the only
references to it AFAICS.

Although it looks ugly, it only follows what wchar2char() does.
Though I don't like to use MB_CUR_MAX, it seems safe as long as
wchar2char() calls wcstombs().

regards,
Hiroshi Inoue