client libpq multibyte support

Started by SAKAIDA Masaakiabout 26 years ago9 messageshackers

sakaida@psn.co.jp

about 26 years ago

Hi,

A client application using libpq made by non-MULTIBYTE
can not talk to server made by MULTIBYTE.

(Example)
------------------------------------------------------------
A_server(non-MULTIBYTE) B_server(--enable-multibyte=EUC_JP)
| |
--+----------+----------+-- network
|
C_server(non-MULTIBYTE)

By using the C_server's psql(+non-MULTIBYTE-libpq),
prompt> psql -h B_server
admin=# set client_encoding='SJIS';
SET VARIABLE
admin=# \dt
List of relations
Name | Type | Owner
------------+-------+-------
SJIS_KANJI | table | admin
(1 row)

admin=# select * from SJIS_KANJI ;
\: extra argument ';' ignored
\: extra argument ';' ignored
Invalid command \. Try \? for help.

(Here, "SJIS_KANJI" is SJIS multibyte code.)
-----------------------------------------------------------

Is this a specification ?

I hope that a client 7.0-libpq and an application always be
made by "configure --enable-multibyte" even if MULTIBYTE isn't
necessary for backend. If so, the above problem will be solved.

--
Regard,
SAKAIDA Masaaki -- Osaka, Japan

Tom Lane

tgl@sss.pgh.pa.us

about 26 years ago

In reply to: SAKAIDA Masaaki (#1)

Re: client libpq multibyte support

SAKAIDA Masaaki <sakaida@psn.co.jp> writes:

A client application using libpq made by non-MULTIBYTE
can not talk to server made by MULTIBYTE.

admin=# select * from SJIS_KANJI ;
\: extra argument ';' ignored
\: extra argument ';' ignored
Invalid command \. Try \? for help.

Ugh :-(. We have not seen this reported before --- do you know exactly
where it's coming from? (I suspect it may be a psql issue not a libpq
issue, but hard to say without more info.)

I hope that a client 7.0-libpq and an application always be
made by "configure --enable-multibyte" even if MULTIBYTE isn't
necessary for backend. If so, the above problem will be solved.

I do not think that will go over well with people who don't need
multibyte support, since the MULTIBYTE code is a good deal larger
and slower. Also, AFAIK we didn't have any such problem in 6.5, so
perhaps this is just a small bug not requiring such a sledgehammer
solution. We need to look more closely.

regards, tom lane

Tatsuo Ishii

ishii@postgresql.org

about 26 years ago

In reply to: Tom Lane (#2)

Re: client libpq multibyte support

admin=# select * from SJIS_KANJI ;
\: extra argument ';' ignored
\: extra argument ';' ignored
Invalid command \. Try \? for help.

Ugh :-(. We have not seen this reported before --- do you know exactly
where it's coming from? (I suspect it may be a psql issue not a libpq
issue, but hard to say without more info.)

That's because none-MB client does not understand how "Shift JIS
kanji" consists of letters with different width bytes. The similar
problem would happen with the Big5 character set (traditional
Chinese), also. Unlike other character sets, these should be treated
carefully since they include the same bit patterns as ASCII and that
makes none-MB clients confused.

I do not think that will go over well with people who don't need
multibyte support, since the MULTIBYTE code is a good deal larger
and slower. Also, AFAIK we didn't have any such problem in 6.5, so
perhaps this is just a small bug not requiring such a sledgehammer
solution. We need to look more closely.

No, 6.5 (and former versions) has exactly the same "bug." The reason
why you didn't hear it by now is that just nobody had tried to mixed
MB/none-MB backend/server configurations until Masaaki came up with
pgbash:-) Anyway, I could hardly imagine that such configurations
would actually exist in the real world. Masaaki, could you tell me
what are the advantages or reasons of the configuration?

For the Tom's comment of "the MULTIBYTE code is a good deal larger and
slower": IMHO it's a price of i18n (I don't claim my implementation of
MB is the most efficient one, though). Today almost any OS and
applications are evolving to be "i18n ready." Look at Lamar's new RPM.
The multibyte and the locale functionalities are now enabled by
default in it.

In the near future, PostgreSQL would have true i18n functionalities
(NATIONAL CHARACTER and friends), and I look forward to join the work.
I hope PostgreSQL would be i18n ready by default at that time.
--
Tatsuo Ishii

Tatsuo Ishii

ishii@postgresql.org

about 26 years ago

In reply to: Tatsuo Ishii (#3)

Re: client libpq multibyte support

That's because none-MB client does not understand how "Shift JIS
kanji" consists of letters with different width bytes. The similar
problem would happen with the Big5 character set (traditional
Chinese), also. Unlike other character sets, these should be treated
carefully since they include the same bit patterns as ASCII and that
makes none-MB clients confused.

I'm confused though, this would mean that somewhere in the string
`SJIS_KANJI' a backslash was found. But that's all ASCII characters.
Aren't the characters 0-127 always identical in any character set?

Not always. Shift JIS and Big5 include 0-127 characters. So "how to
distinguish them from ASCII?", you might ask. Here are rules for this:

1. parse from the begining byte of the string in question. If it is
0-127 then it's an ASCII (single byte letter).

2. if it's between 0xa1 and 0xdf, it's a "1 byte kana" (single byte
letter).

3. otherwise it's a "kanji" (double byte letter). In this case the
second byte might be in range of 0-127 (this is the source of the
problem).

I think Big5 has similar, but a little bit different rule (I don't
remember precisely now).

Other encodings having 0-127 range bytes (but they are not ASCII)
include:

o UCS-2, 4 (Unicode)

o any 7 bit encoded ISO 2022 based charsets. for example, ISO 2022-jp.
--
Tatsuo Ishii

Import Notes

Reply to msg id not found: Pine.GSO.4.02A.10005051108250.18780-100000@Gepard.DoCS.UU.SEReference msg id not found: Pine.GSO.4.02A.10005051108250.18780-100000@Gepard.DoCS.UU.SE | Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

about 26 years ago

In reply to: Tatsuo Ishii (#3)

Re: client libpq multibyte support

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

For the Tom's comment of "the MULTIBYTE code is a good deal larger and
slower": IMHO it's a price of i18n (I don't claim my implementation of
MB is the most efficient one, though). Today almost any OS and
applications are evolving to be "i18n ready."

True, and in fact most of the performance problem in the client-side
MULTIBYTE code comes from the fact that it's not designed-in, but tries
to be a minimally intrusive patch. I think we could make it go faster
if we accepted that it was standard functionality. So I'm not averse to
going in that direction in the long term ... but I do object to turning
on MULTIBYTE by default just a couple days before release. We don't
really know how robust the MULTIBYTE-client-and-non-MULTIBYTE-server
combination is, and so I'm afraid to make it the default configuration
with hardly any testing.

regards, tom lane

Tatsuo Ishii

ishii@postgresql.org

about 26 years ago

In reply to: Tom Lane (#5)

Re: client libpq multibyte support

True, and in fact most of the performance problem in the client-side
MULTIBYTE code comes from the fact that it's not designed-in, but tries
to be a minimally intrusive patch. I think we could make it go faster
if we accepted that it was standard functionality. So I'm not averse to
going in that direction in the long term ...

Glad to hear that.

but I do object to turning
on MULTIBYTE by default just a couple days before release. We don't
really know how robust the MULTIBYTE-client-and-non-MULTIBYTE-server
combination is, and so I'm afraid to make it the default configuration
with hardly any testing.

Agreed.
--
Tatsuo Ishii

SAKAIDA Masaaki

sakaida@psn.co.jp

about 26 years ago

In reply to: Tatsuo Ishii (#4)

Re: client libpq multibyte support

Tatsuo Ishii <t-ishii@sra.co.jp> wrote:

admin=# select * from SJIS_KANJI ;
\: extra argument ';' ignored
\: extra argument ';' ignored
Invalid command \. Try \? for help.

(snip)

That's because none-MB client does not understand how "Shift JIS
kanji" consists of letters with different width bytes. The similar
problem would happen with the Big5 character set (traditional
Chinese), also. Unlike other character sets, these should be treated
carefully since they include the same bit patterns as ASCII and that
makes none-MB clients confused.

Thank you for your reply.

(Probably, the direct cause of this error is PQmblen(). non-
MULTIBYTE-PQmblen() always return "1". )

Anyway, I could hardly imagine that such configurations
would actually exist in the real world. Masaaki, could you tell me
what are the advantages or reasons of the configuration?

# My poor English won't be able to explain the real world ;-).

If a client libpq always be made by "configure --enable-
multibute", the advantages are

1. In the case of SQL_ASCII, a client application speed is
almost equal to non-MULTIBYTE. And the MULTIBYTE code is
not so larger.
2. When required, by using "set client_encoding=xxx", it is
possible to use the MULTIBYTE at any time.

--
Regard,
SAKAIDA Masaaki -- Osaka, Japan

SAKAIDA Masaaki

sakaida@psn.co.jp

about 26 years ago

In reply to: Tatsuo Ishii (#6)

Re: client libpq multibyte support

True, and in fact most of the performance problem in the client-side
MULTIBYTE code comes from the fact that it's not designed-in, but tries
to be a minimally intrusive patch. I think we could make it go faster
if we accepted that it was standard functionality. So I'm not averse to
going in that direction in the long term ...

Glad to hear that.

but I do object to turning
on MULTIBYTE by default just a couple days before release. We don't
really know how robust the MULTIBYTE-client-and-non-MULTIBYTE-server
combination is, and so I'm afraid to make it the default configuration
with hardly any testing.

Agreed.

Thank you for your challenge. I expect that a good result comes out.

--
Regard,
SAKAIDA Masaaki -- Osaka, Japan

SAKAIDA Masaaki

sakaida@psn.co.jp

about 26 years ago

In reply to: Tatsuo Ishii (#6)

Re: client libpq multibyte support

Please allow me to pick out this thread again.

True, and in fact most of the performance problem in the client-side
MULTIBYTE code comes from the fact that it's not designed-in, but tries
to be a minimally intrusive patch. I think we could make it go faster
if we accepted that it was standard functionality. So I'm not averse to
going in that direction in the long term ...

I have checked the performance problem.

(Environment)
- Hardware : P200pro CPU, 128MB, 5400rpm disk
- OS : Red hat Linux-5.2
- Database version : postgresql-7.0RC1

(Tested software and data)
- Library : libpq
- Program : ecpg application program, psql
- SQL : insert, select
- Number of tuples : 100,000 tuples

(Test case)
(1) non-MULTIBYTE
(2) MULTIBYTE encoding=SQL_ASCII

An ecpg program and the psql were used in this test case.

(Result)
As for the result, there was no difference in the speed of (1)
and (2). I could *not* find the performance problem.

(Improvement)
However, the performance problem may occur if the test of
10,000,000 tuples will be done. Because PQmblen() has a little
overhead of routine-call. Therefore, if the MULTIBYTE PQmblen()
will be changed as the following, the perfomance problem disappers
*perfectly*.

# ifdef MULTIBYTE
int PQmblen(const unsigned char *s, int encoding){
if( encoding == SQL_ASCII ) return 1; <======= Added line
return (pg_encoding_mblen(encoding, s));
}
# endif

(Conclusion)
A client library/application should be made by "configure
--enable-multibyte[=SQL_ASCII]" when postgresql is made by
"configure [non-MULTIBYTE]".

(Reference of library size)
non-MULTIBYTE MULTIBYTE
libpq.a 69KB 91KB
libpq.so.2.0 52KB 52KB
libpq.so.2.1 60KB 78KB

--
Regard,
SAKAIDA Masaaki -- Osaka, Japan