Re: [HACKERS] Re: locales and MB (was: Postgres 6.5 beta2 and beta3 problem)

Started by Tom Laneover 26 years ago3 messages
#1Tom Lane
tgl@sss.pgh.pa.us

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

Currently the mb support allows serveral internal
encodings including Unicode and mule-internal-code.
(yes, you can do regexp/like to Unicode data if mb support is
enabled).

One of the things that bothers me about makeIndexable() is that it
doesn't seem to be multibyte-aware; does it really work in MB case?

regards, tom lane

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#1)

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

Currently the mb support allows serveral internal
encodings including Unicode and mule-internal-code.
(yes, you can do regexp/like to Unicode data if mb support is
enabled).

One of the things that bothers me about makeIndexable() is that it
doesn't seem to be multibyte-aware; does it really work in MB case?

Yes. This is because I carefully choose multibyte encodings for
the backend that have following characteristics:

o if the 8th bit of a byte is off then it is a ascii character
o otherwise it is part of non ascii multibyte characters

With these assumptions, makeIndexable() works very well with multibyte
chars.

Not all multibyte encodings satisfy above conditions. For example,
SJIS (an encoding for Japanese) and Big5 (for traditional Chinese)
does not satisfies those requirements. In these encodings the first
byte of the double byte is always 8th bit on. However in second byte
sometimes 8th bit is off: this means we cannot distinguish it from
ascii since it may accidentally matches a bit pattern of an ascii
char. This is why I do not allow SJIS and Big5 as the server
encodings. Users can use SJIS and Big5 for the client encoding,
however.

You might ask why I don't make makeIndexable() multibyte-aware. It
definitely possible. But you should know there are many places that
need to be multibyte-aware in this sence. The parser is one of the
good example. Making everything in the backend multibyte-aware is not
worse to do, in my opinion.
---
Tatsuo Ishii

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#2)

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

Yes. This is because I carefully choose multibyte encodings for
the backend that have following characteristics:
o if the 8th bit of a byte is off then it is a ascii character
o otherwise it is part of non ascii multibyte characters

Ah so.

You might ask why I don't make makeIndexable() multibyte-aware. It
definitely possible. But you should know there are many places that
need to be multibyte-aware in this sence. The parser is one of the
good example.

Right, it's much easier to dodge the problem by restricting backend
encodings, and since we have conversions that doesn't hurt anyone.
Now that I think about it, all the explicitly MB-aware code that
I've seen is in frontend stuff.

Thanks for the clue...

regards, tom lane