Re: [PATCHES] char/varchar locale support

Started by Thomas G. Lockhartover 27 years ago10 messages
#1Thomas G. Lockhart
lockhart@alumni.caltech.edu

(moved to hackers list)

I am working on extending locale support for char/varchar types.
Q1. I touched ...src/include/utils/builtins.h to insert the following
macros:
-----
#ifdef USE_LOCALE
#define pgstrcmp(s1,s2,l) strcoll(s1,s2)
#else
#define pgstrcmp(s1,s2,l) strncmp(s1,s2,l)
#endif
-----
Is it right place? I think so, am I wrong?

Probably the right place. Probably the wrong code; see below...

Q2. Bartunov said me I should read varlena.c. I read it and found
that for every strcoll() for both strings there are calls to allocate
memory (to make them null-terminated). Oleg said I need the same for
varchar.
Do I really need to allocate space for varchar? What about char? Is it
0-terminated already?

No, neither bpchar nor varchar are guaranteed to be null terminated.
Yes, you will need to allocate (palloc()) local memory for this. Your
pgstrcmp() macros are not equivalent, since strncmp() will stop the
comparison at the specified limit (l) where strcoll() requires a null
terminated string.

If you look in varlena.c you will find several places with
#if USE_LOCALE
...
#else
...
#endif

Those blocks will need to be replicated in varchar.c for both bpchar and
varchar support routines.

The first example I looked at in varlena.c seems to have trouble in that
the code looks a bit troublesome :( In the code snippet below (from
text_lt), both input strings are replicated and copied to the same
output length, even though the input lengths can be different. Looks
wrong to me:

memcpy(a1p, VARDATA(arg1), len);
*(a1p + len) = '\0';
memcpy(a2p, VARDATA(arg2), len);
*(a2p + len) = '\0';

Instead of "len" in each expression it should probably be
len1 = VARSIZE(arg1)-VARHDRSZ
len2 = VARSIZE(arg2)-VARHDRSZ

Another possibility for implementation is to write a string comparison
routine (e.g. varlena_cmp()) which takes two arguments and returns -1,
0, or 1 for less than, equals, and greater than. All of the comparison
routines can call that one (which would have the #if USE_LOCALE), rather
than having USE_LOCALE spread through each comparison routine.

- Tom

#2Oleg Broytmann
phd@comus.ru
In reply to: Thomas G. Lockhart (#1)
Re: [HACKERS] Re: [PATCHES] char/varchar locale support

Hi!

On Fri, 15 May 1998, Thomas G. Lockhart wrote:

Another possibility for implementation is to write a string comparison
routine (e.g. varlena_cmp()) which takes two arguments and returns -1,
0, or 1 for less than, equals, and greater than. All of the comparison
routines can call that one (which would have the #if USE_LOCALE), rather
than having USE_LOCALE spread through each comparison routine.

Yes, I thinked about this recently. It seems the best solution, perhaps.
Thank you. I'll continue my work.

Oleg.
----
Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#3Mattias Kregert
matti@algonet.se
In reply to: Oleg Broytmann (#2)
Re: [HACKERS] Re: [PATCHES] char/varchar locale support

Oleg Broytmann wrote:

Hi!

On Fri, 15 May 1998, Thomas G. Lockhart wrote:

Another possibility for implementation is to write a string comparison
routine (e.g. varlena_cmp()) which takes two arguments and returns -1,
0, or 1 for less than, equals, and greater than. All of the comparison
routines can call that one (which would have the #if USE_LOCALE), rather
than having USE_LOCALE spread through each comparison routine.

Yes, I thinked about this recently. It seems the best solution, perhaps.
Thank you. I'll continue my work.

Oleg.
----
Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

Shouldn't this be done only for NATIONAL CHAR?

/* m */

#4Oleg Broytmann
phd@comus.ru
In reply to: Mattias Kregert (#3)
Re: [HACKERS] Re: [PATCHES] char/varchar locale support

Hi!

On Mon, 18 May 1998, Mattias Kregert wrote:

Another possibility for implementation is to write a string comparison
routine (e.g. varlena_cmp()) which takes two arguments and returns -1,
0, or 1 for less than, equals, and greater than. All of the comparison
routines can call that one (which would have the #if USE_LOCALE), rather
than having USE_LOCALE spread through each comparison routine.

Shouldn't this be done only for NATIONAL CHAR?

It is what USE_LOCALE is intended for, isn't it?

Oleg.
----
Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net
Programmers don't die, they just GOSUB without RETURN.

#5Thomas G. Lockhart
lockhart@alumni.caltech.edu
In reply to: Oleg Broytmann (#4)
Re: [HACKERS] Re: [PATCHES] char/varchar locale support

Shouldn't this be done only for NATIONAL CHAR?

It is what USE_LOCALE is intended for, isn't it?

SQL92 defines NATIONAL CHAR/VARCHAR as the data type to support implicit
local character sets. The usual CHAR/VARCHAR would use the default
SQL_TEXT character set. I suppose we could extend it to include NATIONAL
TEXT also...

Additionally, SQL92 allows one to specify an explicit character set and
an explicit collating sequence. The standard is not explicit on how one
actually makes these known to the database, but Postgres should be well
suited to accomplishing this.

Anyway, I'm not certain how common and wide-spread the NATIONAL CHAR
usage is. Would users with installations having non-English data find
using NCHAR/NATIONAL CHAR/NATIONAL CHARACTER an inconvenience? Or would
most non-English installations find this better and more solid??

At the moment we have support for Russian and Japanese character sets,
and these would need the maintainers to agree to changes.

btw, if we do implement NATIONAL CHARACTER I would like to do so by
having it fit in with the full SQL92 character sets and collating
sequences capabilities. Then one could specify what NATIONAL CHAR means
for an installation or perhaps at run time without having to
recompile...

- Tom

#6Peter Mount
peter@maidstone.gov.uk
In reply to: Thomas G. Lockhart (#5)
RE: [HACKERS] Re: [PATCHES] char/varchar locale support

I can't remember what the outcome was, but what about UNICODE?

One of the partially implemented bits of JDBC is the handling of UNICODE
strings (which Java uses all the time).

--
Peter T Mount, peter@maidstone.gov.uk, peter@retep.org.uk
JDBC FAQ: http://www.retep.org.uk/postgres
Please note that this is from my works email. If you reply, please cc my
home address.

-----Original Message-----
From: owner-pgsql-hackers@hub.org [mailto:owner-pgsql-hackers@hub.org]On
Behalf Of Thomas G. Lockhart
Sent: Monday, May 18, 1998 4:43 PM
To: phd2@earthling.net
Cc: Mattias Kregert; Postgres Hackers List; oleg@sai.msu.su; Tatsuo
Ishii
Subject: Re: [HACKERS] Re: [PATCHES] char/varchar locale support

Shouldn't this be done only for NATIONAL CHAR?

It is what USE_LOCALE is intended for, isn't it?

SQL92 defines NATIONAL CHAR/VARCHAR as the data type to support implicit
local character sets. The usual CHAR/VARCHAR would use the default
SQL_TEXT character set. I suppose we could extend it to include NATIONAL
TEXT also...

Additionally, SQL92 allows one to specify an explicit character set and
an explicit collating sequence. The standard is not explicit on how one
actually makes these known to the database, but Postgres should be well
suited to accomplishing this.

Anyway, I'm not certain how common and wide-spread the NATIONAL CHAR
usage is. Would users with installations having non-English data find
using NCHAR/NATIONAL CHAR/NATIONAL CHARACTER an inconvenience? Or would
most non-English installations find this better and more solid??

At the moment we have support for Russian and Japanese character sets,
and these would need the maintainers to agree to changes.

btw, if we do implement NATIONAL CHARACTER I would like to do so by
having it fit in with the full SQL92 character sets and collating
sequences capabilities. Then one could specify what NATIONAL CHAR means
for an installation or perhaps at run time without having to
recompile...

- Tom

#7Thomas G. Lockhart
lockhart@alumni.caltech.edu
In reply to: Peter Mount (#6)
Re: [HACKERS] Re: [PATCHES] char/varchar locale support

I can't remember what the outcome was, but what about UNICODE?
One of the partially implemented bits of JDBC is the handling of
UNICODE strings (which Java uses all the time).

I can't remember the outcome either, but when this was discussed on the
list earlier I had posted a url reference to a character coding
discussion from the DocBook SGML folks. I vaguely recall that (for their
typesetting purposes) UNICODE didn't solve all problems.

I also vaguely recall that the most common extended-byte encoding
sequence is that used in Japan (EUC-jp?).

Are we ready to gear up for another discussion on this topic? If so,
someone should go through the archives and summarize the previous
discussions so we don't re-invent the wheel...

- Tom

#8Brett McCormick
brett@work.chicken.org
In reply to: Thomas G. Lockhart (#7)
Re: [HACKERS] Re: [PATCHES] char/varchar locale support

speaking of archives, the digest archives are a little hard to use..
a standard mailing list archive would be grand -- I can probably point
to some software if need be. are the postgres lists archived in the
standard majordomo way? (as in, berkeley mail format?)

On Mon, 18 May 1998, at 16:29:13, Thomas G. Lockhart wrote:

Show quoted text

I can't remember what the outcome was, but what about UNICODE?
One of the partially implemented bits of JDBC is the handling of
UNICODE strings (which Java uses all the time).

I can't remember the outcome either, but when this was discussed on the
list earlier I had posted a url reference to a character coding
discussion from the DocBook SGML folks. I vaguely recall that (for their
typesetting purposes) UNICODE didn't solve all problems.

I also vaguely recall that the most common extended-byte encoding
sequence is that used in Japan (EUC-jp?).

Are we ready to gear up for another discussion on this topic? If so,
someone should go through the archives and summarize the previous
discussions so we don't re-invent the wheel...

- Tom

#9Michal Mosiewicz
mimo@interdata.com.pl
In reply to: Oleg Broytmann (#4)
Re: [HACKERS] Re: [PATCHES] char/varchar locale support

Thomas G. Lockhart wrote:

btw, if we do implement NATIONAL CHARACTER I would like to do so by
having it fit in with the full SQL92 character sets and collating
sequences capabilities. Then one could specify what NATIONAL CHAR means
for an installation or perhaps at run time without having to
recompile...

I fully agree that there should be a CREATE COLLATION syntax or similiar
with ability to add collation keyword in every place that needs a
character comparision, like btree indexes, orders, or simply comparision
operators.

This mean that we should start probably from creating three-parameter
comparision functions with added a third parameter to select collation.

Additionally, it's worth to note that using strcoll is highly expensive.
I've got some reports from people who used postgreSQL with national
characters and noticed performance drop-downs up to 20 times (Linux). So
it's needed to create a cheap comparision functions that will preserve
it's translation tables during sessions.

Anyhow, if anybody wants to try inefficient strcoll, long time ago I've
sent a patch to sort chars/varchars using it. But I don't recommend it.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND

#10Noname
t-ishii@sra.co.jp
In reply to: Michal Mosiewicz (#9)
Re: [HACKERS] Re: [PATCHES] char/varchar locale support

Shouldn't this be done only for NATIONAL CHAR?

It is what USE_LOCALE is intended for, isn't it?

LOCALE is not very usefull for multi-byte speakers.

SQL92 defines NATIONAL CHAR/VARCHAR as the data type to support implicit
local character sets. The usual CHAR/VARCHAR would use the default
SQL_TEXT character set. I suppose we could extend it to include NATIONAL
TEXT also...

Additionally, SQL92 allows one to specify an explicit character set and
an explicit collating sequence. The standard is not explicit on how one
actually makes these known to the database, but Postgres should be well
suited to accomplishing this.

Anyway, I'm not certain how common and wide-spread the NATIONAL CHAR
usage is. Would users with installations having non-English data find
using NCHAR/NATIONAL CHAR/NATIONAL CHARACTER an inconvenience? Or would
most non-English installations find this better and more solid??

The capability to specify implicit character sets for CHAR (that's
what MB does) looks enough for multi-byte speakers except the
collation sequences.

One question to the SQL92's NCHAR is how one can specify several
charcter sets at one time. As you might know Japanese, Chineses,
Korean uses multiple charcter sets. For example, EUC_JP, a widly used
Japanese encoding system on Unix, includes 4 character sets: ASCII,
JISX0201, JISX0208 and JISX0212.

At the moment we have support for Russian and Japanese character sets,
and these would need the maintainers to agree to changes.

Additionally we have support for Chinese, Korean. Moreover if the mule
internal code or unicode is prefered for the internal encoding system,
one could use almost any language in the world:-)

btw, if we do implement NATIONAL CHARACTER I would like to do so by
having it fit in with the full SQL92 character sets and collating
sequences capabilities. Then one could specify what NATIONAL CHAR means
for an installation or perhaps at run time without having to
recompile...

Collating sequences look very usesful.
Also it would be nice if we could specify default character sets when
creating a database, table or fields.
--
Tatsuo Ishii
t-ishii@sra.co.jp