Unicode combining characters

Started by Patrice Hédéover 24 years ago47 messages

phede-ml@islande.org

over 24 years ago

Hi all,

while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :

lang | length | length | text | text
------+--------+--------+-----------+-----------
isl | 7 | 6 | ï¿œlï¿œta | ï¿œleit
isl | 7 | 7 | ï¿œlï¿œta | ï¿œlitum
isl | 7 | 7 | ï¿œlï¿œta | ï¿œlitiï¿œ
isl | 5 | 4 | maï¿œur | mann
isl | 5 | 7 | maï¿œur | mï¿œnnum
isl | 5 | 5 | maï¿œur | manna
isl | 5 | 4 | ï¿œska | -aï¿œi

[the misalignment is what I got, it's not a copy-paste error]

This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
querying a database created with -E UNICODE (by the way, these are
icelandic words :) ).

What you see above is misleading, since it's not possible to see that
'ï¿œ', 'ï¿œ', 'ï¿œ' and 'ï¿œ' are using combining marks, while 'ï¿œ' is not.

As a reminder, a combining mark in Unicode is that ï¿œ is actually
encoded as a + ' (where ' is the acute combining mark).

Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
instead of <c3 a1> [UTF16: 00E1].

The "length" fields are what is returned by length(a.text) and
length(b.text).

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

=> regarding the backend, it may be more complex, as the underlaying
system may not provide any UTF-8 locale to use (!= from being UTF-8
aware : an administrator may have decided that UTF-8 locales are
useless on a server, as only root connections are made, and he wants
only the C locale on the console - I've seen that quite often ;) ).

This brings me to another subject : I will need to support the full
Unicode collation algorithm (UCA, as described in TR#10 [1]http://www.unicode.org/unicode/reports/tr10/ of the
Unicode consortium), and I will need to be able to sort according to
locales which may not be installed on the backend server (some of
which may not even be recognised by GNU libc, which supports already
more than 140 locales -- artificial languages would be an example). I
will also need to be able to normalise the unicode strings (TR#15 [2]http://www.unicode.org/unicode/reports/tr15/)
so that I don't have some characters in legacy codepoints [as 00E1
above], and others with combining marks.

There is today an implementation in perl of the needed functionality,
in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
:( ). But as they are Perl modules, the untrusted version of perl,
plperlu, will be needed, and it's a pity for what I consider a core
functionality in the future (not that plperlu isn't a good thing - I
can't wait for it ! - but that an untrusted pl language is needed to
support normalisation and collation).

Note also that there are a lot of data associated with these
algorithms, as you could expect.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

- implement it in perl (by reusing Unicode:: work), in a trusted plperl,

- implement it in perl, calling Unicode:: modules, in an untrusted
plperl.

and then :

- provide the data in tables (system and/or user) - which should be
available across databases,

- load the data from the original text files provided in Unicode (and
other as needed), if the functionality is compiled into the server.

- I believe the basic unicode information should be standard, and the
locales should be provided as contrib/ files to be plugged in as
needed.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,
have a way to override the collating tables by user tables - actually,
this would be certainly the best solution if it's in the core, as the
tables will put an extra burden on the distribution and the
installation footprint, especially if the tables are already there,
for glibc, for perl5.6+, for other software dealing with Unicode).

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

Patrice.

[1]: http://www.unicode.org/unicode/reports/tr10/

[2]: http://www.unicode.org/unicode/reports/tr15/

--
Patrice Hï¿œDï¿œ ------------------------------- patrice ï¿œ islande.org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Patrice Hédé (#1)

Re: Unicode combining characters

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

This is a known limitation.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

Sounds great.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

I'm very interested in your point. I will start studying [1][2] after
the beta freeze.

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

This may be a good starting point.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,

I totally agree here.

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

--
Tatsuo Ishii

Oleg Bartunov

oleg@sai.msu.su

over 24 years ago

In reply to: Patrice Hédé (#1)

Re: Unicode combining characters

Looks like a good project for 7.3
Probably the best starting point would be to develope contrib/unicode
with smooth transition to core.

Oleg
On Mon, 24 Sep 2001, Patrice [iso-8859-15] HО©╫dО©╫ wrote:

Hi all,

while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :

lang | length | length | text | text
------+--------+--------+-----------+-----------
isl | 7 | 6 | О©╫lО©╫ta | О©╫leit
isl | 7 | 7 | О©╫lО©╫ta | О©╫litum
isl | 7 | 7 | О©╫lО©╫ta | О©╫litiО©╫
isl | 5 | 4 | maО©╫ur | mann
isl | 5 | 7 | maО©╫ur | mО©╫nnum
isl | 5 | 5 | maО©╫ur | manna
isl | 5 | 4 | О©╫ska | -aО©╫i

[the misalignment is what I got, it's not a copy-paste error]

This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
querying a database created with -E UNICODE (by the way, these are
icelandic words :) ).

What you see above is misleading, since it's not possible to see that
'О©╫', 'О©╫', 'О©╫' and 'О©╫' are using combining marks, while 'О©╫' is not.

As a reminder, a combining mark in Unicode is that О©╫ is actually
encoded as a + ' (where ' is the acute combining mark).

Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
instead of <c3 a1> [UTF16: 00E1].

The "length" fields are what is returned by length(a.text) and
length(b.text).

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

=> regarding the backend, it may be more complex, as the underlaying
system may not provide any UTF-8 locale to use (!= from being UTF-8
aware : an administrator may have decided that UTF-8 locales are
useless on a server, as only root connections are made, and he wants
only the C locale on the console - I've seen that quite often ;) ).

This brings me to another subject : I will need to support the full
Unicode collation algorithm (UCA, as described in TR#10 [1] of the
Unicode consortium), and I will need to be able to sort according to
locales which may not be installed on the backend server (some of
which may not even be recognised by GNU libc, which supports already
more than 140 locales -- artificial languages would be an example). I
will also need to be able to normalise the unicode strings (TR#15 [2])
so that I don't have some characters in legacy codepoints [as 00E1
above], and others with combining marks.

There is today an implementation in perl of the needed functionality,
in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
:( ). But as they are Perl modules, the untrusted version of perl,
plperlu, will be needed, and it's a pity for what I consider a core
functionality in the future (not that plperlu isn't a good thing - I
can't wait for it ! - but that an untrusted pl language is needed to
support normalisation and collation).

Note also that there are a lot of data associated with these
algorithms, as you could expect.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

- implement it in perl (by reusing Unicode:: work), in a trusted plperl,

- implement it in perl, calling Unicode:: modules, in an untrusted
plperl.

and then :

- provide the data in tables (system and/or user) - which should be
available across databases,

- load the data from the original text files provided in Unicode (and
other as needed), if the functionality is compiled into the server.

- I believe the basic unicode information should be standard, and the
locales should be provided as contrib/ files to be plugged in as
needed.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,
have a way to override the collating tables by user tables - actually,
this would be certainly the best solution if it's in the core, as the
tables will put an extra burden on the distribution and the
installation footprint, especially if the tables are already there,
for glibc, for perl5.6+, for other software dealing with Unicode).

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

Patrice.

[1] http://www.unicode.org/unicode/reports/tr10/

[2] http://www.unicode.org/unicode/reports/tr15/

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Patrice Hédé

phede-ml@islande.org

over 24 years ago

In reply to: Tatsuo Ishii (#2)

Re: Unicode combining characters

Hi,

* Tatsuo Ishii <t-ishii@sra.co.jp> [010925 18:18]:

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I
have the same result with char_length()], and returns the number
of chars (as it is however advertised to do), rather the length
of the string.

This is a known limitation.

To solve this, we could use wcwidth() (there is a custom
implementation for the systems which don't have it in the glibc). I'll
have a look at it later.

- the psql frontend makes the same mistake.

Same thing here.

I have just installed the CVS and downloaded the development version
(thanks Baldvin), tested that the stock version compiles fine, and
I'll now have a look at how to make this work. :) I'll send a patch
when I have this working here.

Sounds great.

[Unicode normalisation and collation in the backend]

I'm very interested in your point. I will start studying [1][2] after
the beta freeze.

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

This may be a good starting point.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,

I totally agree here.

As Oleg suggested, I will try to aim for 7.3, first with a version in
contrib, and later, if the implementation is fine, it could be moved
to the core (or not ? Though it would be nice to make sure every
PostgreSQL installation which supports unicode has it, so that users
won't need to have administrative rights to use the functionality).

I think I will go for a C version, and probably the collation and
normalisation data in tables, with some way to override the defaults
with secondary tables... I'll report as soon as I have something +/-
working.

--
Tatsuo Ishii

Patrice.

--
Patrice Hï¿œDï¿œ ------------------------------- patrice ï¿œ islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Patrice Hédé (#4)

Re: Unicode combining characters

- length() on the server side doesn't handle correctly Unicode [I
have the same result with char_length()], and returns the number
of chars (as it is however advertised to do), rather the length
of the string.

This is a known limitation.

To solve this, we could use wcwidth() (there is a custom
implementation for the systems which don't have it in the glibc). I'll
have a look at it later.

And wcwidth() depends on the locale. That is the another reason we
could not use it.

As Oleg suggested, I will try to aim for 7.3, first with a version in
contrib, and later, if the implementation is fine, it could be moved
to the core (or not ? Though it would be nice to make sure every
PostgreSQL installation which supports unicode has it, so that users
won't need to have administrative rights to use the functionality).

I would like to see SQL99's charset, collate functionality for 7.3 (or
later). If this happens, current multibyte implementation would be
dramatically changed. That would be a good timing to merge your
Unicode stuffs into the main source tree.
--
Tatsuo Ishii

Thomas Lockhart

lockhart@fourpalms.org

over 24 years ago

In reply to: Patrice Hédé (#1)

Re: Unicode combining characters

I would like to see SQL99's charset, collate functionality for 7.3 (or
later). If this happens, current multibyte implementation would be
dramatically changed...

I'm *still* interested in working on this (an old story I know). I'm
working on date/time stuff for 7.2, but hopefully 7.3 will see some
advances in the SQL99 direction on charset etc.

- Thomas

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Thomas Lockhart (#6)

Re: Unicode combining characters

I would like to see SQL99's charset, collate functionality for 7.3 (or
later). If this happens, current multibyte implementation would be
dramatically changed...

I'm *still* interested in working on this (an old story I know). I'm
working on date/time stuff for 7.2, but hopefully 7.3 will see some
advances in the SQL99 direction on charset etc.

BTW, I see "CHARACTER SET" in gram.y. Does current already support
that syntax?
--
Tatsuo Ishii

Thomas Lockhart

lockhart@fourpalms.org

over 24 years ago

In reply to: Patrice Hédé (#4)

Re: Unicode combining characters

BTW, I see "CHARACTER SET" in gram.y. Does current already support
that syntax?

Yes and no. gram.y knows about CHARACTER SET, but only for the long
form, the clause is in the wrong position (it preceeds the length
specification) and it does not do much useful (generates a data type
based on the character set name which does not get recognized farther
back). Examples:

thomas=# create table t1 (c varchar(20) character set sql_ascii);
ERROR: parser: parse error at or near "character"
thomas=# create table t1 (c character varying character set sql_ascii
(20));
ERROR: Unable to locate type name 'varsql_ascii' in catalog

I'm pretty sure I'll get shift/reduce troubles when trying to move that
clause to *after* the length specifier. I'll try to do something with
the syntax for 7.2 once I've finished the date/time stuff.

- Thomas

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tatsuo Ishii (#2)

Re: Unicode combining characters

Can someone give me TODO items for this discussion?

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

This is a known limitation.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

Sounds great.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

I'm very interested in your point. I will start studying [1][2] after
the beta freeze.

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

This may be a good starting point.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,

I totally agree here.

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#10

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Bruce Momjian (#9)

Re: Unicode combining characters

Can someone give me TODO items for this discussion?

What about:

Improve Unicode combined character handling
--
Tatsuo Ishii

Show quoted text

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

This is a known limitation.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

Sounds great.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

I'm very interested in your point. I will start studying [1][2] after
the beta freeze.

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

This may be a good starting point.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,

I totally agree here.

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#11

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tatsuo Ishii (#10)

Re: Unicode combining characters

Can someone give me TODO items for this discussion?

What about:

Improve Unicode combined character handling

Done. I can't update the web version because I don't have permission.

Also, have we decided if multibyte should be the configure default now?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#12

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Bruce Momjian (#11)

Re: Unicode combining characters

Also, have we decided if multibyte should be the configure default now?

Not sure.

Anyway I have tested LIKE/REGEX query test using current. The query
executed is:

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) like 'aaa';

and

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) ~ 'aaa';

Here is the result:

no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX 0.09 msec 0.10 msec

LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.
--
Tatsuo Ishii

#13

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tatsuo Ishii (#12)

Re: Unicode combining characters

If no one can find a case where multibyte is slower, I think we should
enable it by default. Comments?

Also, have we decided if multibyte should be the configure default now?

Not sure.

Anyway I have tested LIKE/REGEX query test using current. The query
executed is:

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) like 'aaa';

and

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) ~ 'aaa';

Here is the result:

no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX 0.09 msec 0.10 msec

LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.
--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#14

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Bruce Momjian (#13)

Re: Unicode combining characters

Bruce Momjian <pgman@candle.pha.pa.us> writes:

If no one can find a case where multibyte is slower, I think we should
enable it by default. Comments?

Well, he just did point out such a case:

no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX 0.09 msec 0.10 msec

But I agree with your conclusion. If the worst penalty we can find is
that a regex comparison operator is 10% slower, we may as well turn it
on by default. Most people will never notice the difference, and anyone
who really cares can always turn it off again.

regards, tom lane

#15

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tom Lane (#14)

Re: Unicode combining characters

Bruce Momjian <pgman@candle.pha.pa.us> writes:

If no one can find a case where multibyte is slower, I think we should
enable it by default. Comments?

Well, he just did point out such a case:

no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX 0.09 msec 0.10 msec

But I agree with your conclusion. If the worst penalty we can find is
that a regex comparison operator is 10% slower, we may as well turn it
on by default. Most people will never notice the difference, and anyone
who really cares can always turn it off again.

But the strange thing is that LIKE is faster, perhaps meaning his
measurements can't even see the difference, or is it because the LIKE
optimization is off for multibyte.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#16

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Bruce Momjian (#15)

Re: Unicode combining characters

Bruce Momjian <pgman@candle.pha.pa.us> writes:

But the strange thing is that LIKE is faster, perhaps meaning his
measurements can't even see the difference,

Yeah, I suspect there's 10% or more noise in these numbers. But then
one could read the results as saying we can't reliably measure any
difference at all ...

I'd feel more confident if the measurements were done using operators
repeated enough times to yield multiple-second runtimes. I don't
trust fractional-second time measurements on Unix boxen; too much chance
of bogus results due to activity of other processes.

regards, tom lane

#17

Peter Eisentraut

peter_e@gmx.net

over 24 years ago

In reply to: Tatsuo Ishii (#12)

Re: Unicode combining characters

Tatsuo Ishii writes:

LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.

Could this conversion be optimized to recognize when it's dealing with a
single-byte character encoding?

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#18

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Tom Lane (#16)

Re: Unicode combining characters

Yeah, I suspect there's 10% or more noise in these numbers. But then
one could read the results as saying we can't reliably measure any
difference at all ...

I'd feel more confident if the measurements were done using operators
repeated enough times to yield multiple-second runtimes. I don't
trust fractional-second time measurements on Unix boxen; too much chance
of bogus results due to activity of other processes.

Any idea to do that? I tried to do a measurements using something like
"SELECT * FROM t1 WHERE very-long-string-column LIKE 'aaa'", but I'm
afraid the I/O time masks the difference...
--
Tatsuo Ishii

#19

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Peter Eisentraut (#17)

Re: Unicode combining characters

LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.

Could this conversion be optimized to recognize when it's dealing with a
single-byte character encoding?

Not sure, will look into...
--
Tatsuo Ishii

#20

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Tatsuo Ishii (#18)

Re: Unicode combining characters

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I'd feel more confident if the measurements were done using operators
repeated enough times to yield multiple-second runtimes.

Any idea to do that?

Maybe something like this: declare a plpgsql function that takes two
text parameters and has a body like

for (i = 0 to a million)
boolvar := $1 like $2;

Then call it with strings of different lengths and see how the runtime
varies. You need to apply the LIKE to function parameters, else the
system will probably collapse the LIKE operation to a constant...

regards, tom lane

#21

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Tom Lane (#20)

Re: Unicode combining characters

Maybe something like this: declare a plpgsql function that takes two
text parameters and has a body like

for (i = 0 to a million)
boolvar := $1 like $2;

Then call it with strings of different lengths and see how the runtime
varies. You need to apply the LIKE to function parameters, else the
system will probably collapse the LIKE operation to a constant...

Good idea. I did tests for both LIKE and REGEX using PL/pgsql
functions(see source code below). Here are the result. What I did was
calling the functions with changing taret strings from 32byte to
8192. Times are all in msec.

(1) LIKE

bytes Without MB With MB

32 8121.94 8094.73
64 8167.98 8105.24
128 8151.30 8108.61
256 8090.12 8098.20
512 8111.05 8101.07
1024 8110.49 8099.61
2048 8095.32 8106.00
4096 8094.88 8091.19
8192 8123.02 8121.63

(2) REGEX

bytes Without MB With MB

32 117.93 119.47
64 126.41 127.61
128 143.97 146.55
256 180.49 183.69
512 255.53 256.16
1024 410.59 409.22
2048 5176.38 5181.99
4096 6000.82 5627.84
8192 6529.15 6547.10

------------- shell script -------------------
for i in 32 64 128 256 512 1024 2048 4096 8192
do
psql -c "explain analyze select liketest(a,'aaa') from (select substring('very_long_text' from 0 for $i) as a) as a" test
done
------------- shell script -------------------

------------- functions -----------------
drop function liketest(text,text);
create function liketest(text,text) returns bool as '
declare
i int;
rtn boolean;
begin
i := 1000000;
while i > 0 loop
rtn := $1 like $2;
i := i - 1;
end loop;
return rtn;
end;
' language 'plpgsql';

drop function regextest(text,text);
create function regextest(text,text) returns bool as '
declare
i int;
rtn boolean;
begin
i := 10000;
while i > 0 loop
rtn := $1 ~ $2;
i := i - 1;
end loop;
return rtn;
end;
' language 'plpgsql';
------------- functions -----------------

#22

Zeugswetter Andreas SB SD

ZeugswetterA@spardat.at

over 24 years ago

In reply to: Tatsuo Ishii (#21)

Re: Unicode combining characters

------------- shell script -------------------
for i in 32 64 128 256 512 1024 2048 4096 8192
do
psql -c "explain analyze select liketest(a,'aaa') from
(select substring('very_long_text' from 0 for $i) as a) as a" test
done
------------- shell script -------------------

I don't think your search string is sufficient for a test.
With 'aaa' it actually knows that it only needs to look at the
first three characters of a. Imho you need to try something
like liketest(a,'%aaa%').

Andreas

Import Notes

Resolved by subject fallback

#23

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Zeugswetter Andreas SB SD (#22)

Re: Unicode combining characters

I don't think your search string is sufficient for a test.
With 'aaa' it actually knows that it only needs to look at the
first three characters of a. Imho you need to try something
like liketest(a,'%aaa%').

Ok. I ran the modified test (now the iteration is reduced to 100000 in
liketest()). As you can see, there's huge difference. MB seems up to
~8 times slower:-< There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...

MB+LIKE
Total runtime: 1321.58 msec
Total runtime: 1718.03 msec
Total runtime: 2519.97 msec
Total runtime: 4187.05 msec
Total runtime: 7629.24 msec
Total runtime: 14456.45 msec
Total runtime: 17320.14 msec
Total runtime: 17323.65 msec
Total runtime: 17321.51 msec

noMB+LIKE
Total runtime: 964.90 msec
Total runtime: 993.09 msec
Total runtime: 1057.40 msec
Total runtime: 1192.68 msec
Total runtime: 1494.59 msec
Total runtime: 2078.75 msec
Total runtime: 2328.77 msec
Total runtime: 2326.38 msec
Total runtime: 2330.53 msec
--
Tatsuo Ishii

#24

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Tatsuo Ishii (#23)

Re: Unicode combining characters

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

... There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...

Seems like a good thing to put on the to-do list. In the meantime,
we still have the question of whether to enable multibyte in the
default configuration. I'd still vote YES, as these results seem
to me to demonstrate that there is no wide-ranging performance penalty.
A problem confined to LIKE on long strings isn't a showstopper IMHO.

regards, tom lane

#25

Zeugswetter Andreas SB SD

ZeugswetterA@spardat.at

over 24 years ago

In reply to: Tom Lane (#24)

Re: Unicode combining characters

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

... There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...

Seems like a good thing to put on the to-do list. In the meantime,
we still have the question of whether to enable multibyte in the
default configuration. I'd still vote YES, as these results seem
to me to demonstrate that there is no wide-ranging performance

penalty.

A problem confined to LIKE on long strings isn't a showstopper IMHO.

As I said, with a valid not anchored like expression the performance
difference was substantial, even for shorter strings it was 37%.
The test with "like 'aaa'" was not a good test case, and we should not
deduce anything from that.

Andreas

Import Notes

Resolved by subject fallback

#26

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tom Lane (#24)

Re: Unicode combining characters

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

... There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...

Seems like a good thing to put on the to-do list. In the meantime,
we still have the question of whether to enable multibyte in the
default configuration. I'd still vote YES, as these results seem
to me to demonstrate that there is no wide-ranging performance penalty.
A problem confined to LIKE on long strings isn't a showstopper IMHO.

Added to TODO:

* Use wide characters to evaluate regular expressions, for performance
(Tatsuo)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#27

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Bruce Momjian (#26)

Re: Unicode combining characters

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Added to TODO:

* Use wide characters to evaluate regular expressions, for performance
(Tatsuo)

Regexes are fine; it's LIKE that's slow.

regards, tom lane

#28

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tom Lane (#27)

Re: Unicode combining characters

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Added to TODO:

* Use wide characters to evaluate regular expressions, for performance
(Tatsuo)

Regexes are fine; it's LIKE that's slow.

Oops, thanks. Changed to:

* Use wide characters to evaluate LIKE, for performance (Tatsuo)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#29

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Bruce Momjian (#28)

Re: Unicode combining characters

Peter Eisentraut <peter_e@gmx.net> writes:

Tom Lane writes:

In the meantime, we still have the question of whether to enable
multibyte in the default configuration.

Perhaps we could make it a release goal for 7.3

Yeah, that's probably the best way to proceed... it's awfully late
in the 7.2 cycle to be deciding to do this now...

regards, tom lane

Import Notes

Reply to msg id not found: Pine.LNX.4.30.0110031944550.635-100000@peter.localdomainReference msg id not found: Pine.LNX.4.30.0110031944550.635-100000@peter.localdomain | Resolved by subject fallback

#30

Peter Eisentraut

peter_e@gmx.net

over 24 years ago

In reply to: Tom Lane (#24)

Re: Unicode combining characters

Tom Lane writes:

In the meantime, we still have the question of whether to enable
multibyte in the default configuration.

This would make more sense if all of multibyte, locale, and NLS became
defaults in one release. I haven't quite sold people in the second item
yet, although I have a design how to do that (see below). And the third,
well who knows...

Perhaps we could make it a release goal for 7.3 to

* Optimize i18n stuff to have a minimal performance penalty when it's not
used. (locale=C etc.)

* Make i18n stuff sufficiently well-behaved to make it the default.
(Especially, add initdb options and GUC parameters to set the locale.
Don't rely on environment variables -- too complicated.)

Meanwhile, quadratic performance penalties (or so it seems) for LIKE
expressions aren't exactly a "minor" problem.

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#31

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Peter Eisentraut (#30)

Re: Unicode combining characters

Tom Lane writes:

In the meantime, we still have the question of whether to enable
multibyte in the default configuration.

This would make more sense if all of multibyte, locale, and NLS became
defaults in one release. I haven't quite sold people in the second item
yet, although I have a design how to do that (see below). And the third,
well who knows...

Perhaps we could make it a release goal for 7.3 to

* Optimize i18n stuff to have a minimal performance penalty when it's not
used. (locale=C etc.)

* Make i18n stuff sufficiently well-behaved to make it the default.
(Especially, add initdb options and GUC parameters to set the locale.
Don't rely on environment variables -- too complicated.)

Meanwhile, quadratic performance penalties (or so it seems) for LIKE
expressions aren't exactly a "minor" problem.

Added to TODO:

* Optimize locale to have minimal performance impact when not used
(Peter E)
-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#32

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Tatsuo Ishii (#23)

Re: Unicode combining characters

Ok. I ran the modified test (now the iteration is reduced to 100000 in
liketest()). As you can see, there's huge difference. MB seems up to
~8 times slower:-< There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...

MB+LIKE
Total runtime: 1321.58 msec
Total runtime: 1718.03 msec
Total runtime: 2519.97 msec
Total runtime: 4187.05 msec
Total runtime: 7629.24 msec
Total runtime: 14456.45 msec
Total runtime: 17320.14 msec
Total runtime: 17323.65 msec
Total runtime: 17321.51 msec

noMB+LIKE
Total runtime: 964.90 msec
Total runtime: 993.09 msec
Total runtime: 1057.40 msec
Total runtime: 1192.68 msec
Total runtime: 1494.59 msec
Total runtime: 2078.75 msec
Total runtime: 2328.77 msec
Total runtime: 2326.38 msec
Total runtime: 2330.53 msec

I did some trials with wide characters implementation and saw
virtually no improvement. My guess is the logic employed in LIKE is
too simple to hide the overhead of the multibyte and wide character
conversion. The reason why REGEX with MB is not so slow would be the
complexity of its logic, I think. As you can see in my previous
postings, $1 ~ $2 operation (this is logically same as a LIKE '%a%')
is, for example, almost 80 times slower than LIKE (remember that
likest() loops over 10 times more than regextest()).

So I decided to use a completely different approach. Now like has two
matching engines, one for single byte encodings (MatchText()), the
other is for multibyte ones (MBMatchText()). MatchText() is identical
to the non MB version of it, and virtually no performance penalty for
single byte encodings. MBMatchText() is for multibyte encodings and is
identical the one used in 7.1.

Here is the MB case result with SQL_ASCII encoding.

Total runtime: 901.69 msec
Total runtime: 939.08 msec
Total runtime: 993.60 msec
Total runtime: 1148.18 msec
Total runtime: 1434.92 msec
Total runtime: 2024.59 msec
Total runtime: 2288.50 msec
Total runtime: 2290.53 msec
Total runtime: 2316.00 msec

To accomplish this, I moved MatchText etc. to a separate file and now
like.c includes it *twice* (similar technique used in regexec()). This
makes like.o a little bit larger, but I believe this is worth for the
optimization.
--
Tatsuo Ishii

#33

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Tatsuo Ishii (#32)

Re: Unicode combining characters

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

To accomplish this, I moved MatchText etc. to a separate file and now
like.c includes it *twice* (similar technique used in regexec()). This
makes like.o a little bit larger, but I believe this is worth for the
optimization.

That sounds great.

What's your feeling now about the original question: whether to enable
multibyte by default now, or not? I'm still thinking that Peter's
counsel is the wisest: plan to do it in 7.3, not today. But this fix
seems to eliminate the only hard reason we have not to do it today ...

regards, tom lane

#34

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Tom Lane (#33)

Re: Unicode combining characters

What's your feeling now about the original question: whether to enable
multibyte by default now, or not? I'm still thinking that Peter's
counsel is the wisest: plan to do it in 7.3, not today. But this fix
seems to eliminate the only hard reason we have not to do it today ...

If SQL99's I18N staffs would be added in 7.3, it means both the
multibyte support and the locale support might disappear, then 2might
be merged into it (not sure about NLS. is that compatible with, for
example, per column charset?)

So, for none multibyte users, the multibyte support would be
one-release-only-functionality: suddenly appears in 7.2 and disappears
in 7.3. I'm afraid this would cause more troubles rather than
usefullness.

What do you think?
--
Tatsuo Ishii

#35

Tom Lane

tgl@sss.pgh.pa.us

over 24 years ago

In reply to: Tatsuo Ishii (#34)

Re: Unicode combining characters

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

What do you think?

I think that we were supposed to go beta a month ago, and so this is
no time to start adding new features to this release. Let's plan to
make this happen (one way or the other) in 7.3, instead.

regards, tom lane

#36

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Tom Lane (#35)

Re: Unicode combining characters

I think that we were supposed to go beta a month ago, and so this is
no time to start adding new features to this release. Let's plan to
make this happen (one way or the other) in 7.3, instead.

Agreed.
--
Tatsuo Ishii

#37

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tatsuo Ishii (#32)

Re: Unicode combining characters

Ok. I ran the modified test (now the iteration is reduced to 100000 in
liketest()). As you can see, there's huge difference. MB seems up to
~8 times slower:-< There seems some problems existing in the
implementation. Considering REGEX is not so slow, maybe we should
employ the same design as REGEX. i.e. using wide charcters, not
multibyte streams...

Let me add I think our regex code is very slow. It is the standard BSD
regex library by Henry Spencer. He rewrote it a few years ago for TCL
8.X and said he was working on a standalone library version. I have
asked him several times via email over the years but he still has not
released a standalone version of the new optimized regex code. It is on
our TODO list.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

#38

Patrice Hédé

phede-ml@islande.org

over 24 years ago

In reply to: Tatsuo Ishii (#23)

1 attachment(s)

Re: [HACKERS] Unicode combining characters

Hi,

I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:

- most of the functionality is only activated when MULTIBYTE is
defined,

- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).

- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.

- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).

- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).

I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.

Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .

Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....

err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)

I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)

Here is the patch !

Patrice.

Attachments:

wcwidth.patchtext/plain; charset=iso-8859-15Download

*** ./pgsql/src/backend/utils/mb/wchar.c.orig	Wed Sep 26 01:09:05 2001
--- ./pgsql/src/backend/utils/mb/wchar.c	Thu Oct  4 22:03:41 2001
***************
*** 259,268 ****
  static int
  pg_utf2wchar_with_len(const unsigned char *from, pg_wchar * to, int len)
  {
! 	unsigned char c1,
! 				c2,
! 				c3;
! 	int			cnt = 0;
  
  	while (len > 0 && *from)
  	{
--- 259,266 ----
  static int
  pg_utf2wchar_with_len(const unsigned char *from, pg_wchar * to, int len)
  {
! 	unsigned char c1, c2, c3, c4;
! 	int           cnt = 0;
  
  	while (len > 0 && *from)
  	{
***************
*** 270,275 ****
--- 268,274 ----
  		{
  			*to = *from++;
  			len--;
+ 			cnt++;
  		}
  		else if ((*from & 0xe0) == 0xc0 && len >= 2)
  		{
***************
*** 278,285 ****
  			*to = c1 << 6;
  			*to |= c2;
  			len -= 2;
  		}
! 		else if ((*from & 0xe0) == 0xe0 && len >= 3)
  		{
  			c1 = *from++ & 0x0f;
  			c2 = *from++ & 0x3f;
--- 277,285 ----
  			*to = c1 << 6;
  			*to |= c2;
  			len -= 2;
+ 			cnt++;
  		}
! 		else if ((*from & 0xf0) == 0xe0 && len >= 3)
  		{
  			c1 = *from++ & 0x0f;
  			c2 = *from++ & 0x3f;
***************
*** 288,301 ****
  			*to |= c2 << 6;
  			*to |= c3;
  			len -= 3;
  		}
  		else
  		{
! 			*to = *from++;
  			len--;
  		}
  		to++;
- 		cnt++;
  	}
  	*to = 0;
  	return (cnt);
--- 288,317 ----
  			*to |= c2 << 6;
  			*to |= c3;
  			len -= 3;
+ 			cnt++;
+ 		}
+ 		else if ((*from & 0xf8) == 0xf0 && len >= 4)
+ 		{
+ 			c1 = *from++ & 0x0f;
+ 			c2 = *from++ & 0x3f;
+ 			c3 = *from++ & 0x3f;
+ 			c4 = *from++ & 0x3f;
+ 			*to = c1 << 18;
+ 			*to |= c2 << 12;
+ 			*to |= c3 << 6;
+ 			*to |= c4;
+ 			len -= 4;
+ 			cnt++;
  		}
  		else
  		{
! 		  /* the default MUST be ggto drop the unrecognised char and
! 		   * not let it through.
! 		   */
! 			from++;
  			len--;
  		}
  		to++;
  	}
  	*to = 0;
  	return (cnt);
***************
*** 313,320 ****
  		len = 1;
  	else if ((*s & 0xe0) == 0xc0)
  		len = 2;
! 	else if ((*s & 0xe0) == 0xe0)
  		len = 3;
  	return (len);
  }
  
--- 329,338 ----
  		len = 1;
  	else if ((*s & 0xe0) == 0xc0)
  		len = 2;
! 	else if ((*s & 0xf0) == 0xe0)
  		len = 3;
+ 	else if ((*s & 0xf8) == 0xf0)
+ 		len = 4;
  	return (len);
  }
  
***************
*** 463,469 ****
  	{pg_euccn2wchar_with_len, pg_euccn_mblen, 3},	/* 2; PG_EUC_CN */
  	{pg_euckr2wchar_with_len, pg_euckr_mblen, 3},	/* 3; PG_EUC_KR */
  	{pg_euctw2wchar_with_len, pg_euctw_mblen, 3},	/* 4; PG_EUC_TW */
! 	{pg_utf2wchar_with_len, pg_utf_mblen, 3},	/* 5; PG_UNICODE */
  	{pg_mule2wchar_with_len, pg_mule_mblen, 3},	/* 6; PG_MULE_INTERNAL */
  	{pg_latin12wchar_with_len, pg_latin1_mblen, 1},	/* 7; PG_LATIN1 */
  	{pg_latin12wchar_with_len, pg_latin1_mblen, 1},	/* 8; PG_LATIN2 */
--- 481,487 ----
  	{pg_euccn2wchar_with_len, pg_euccn_mblen, 3},	/* 2; PG_EUC_CN */
  	{pg_euckr2wchar_with_len, pg_euckr_mblen, 3},	/* 3; PG_EUC_KR */
  	{pg_euctw2wchar_with_len, pg_euctw_mblen, 3},	/* 4; PG_EUC_TW */
! 	{pg_utf2wchar_with_len, pg_utf_mblen, 4},	/* 5; PG_UNICODE */
  	{pg_mule2wchar_with_len, pg_mule_mblen, 3},	/* 6; PG_MULE_INTERNAL */
  	{pg_latin12wchar_with_len, pg_latin1_mblen, 1},	/* 7; PG_LATIN1 */
  	{pg_latin12wchar_with_len, pg_latin1_mblen, 1},	/* 8; PG_LATIN2 */
*** ./pgsql/src/bin/psql/Makefile.orig	Wed Sep 26 01:14:40 2001
--- ./pgsql/src/bin/psql/Makefile	Fri Oct  5 00:21:17 2001
***************
*** 19,25 ****
  
  OBJS=command.o common.o help.o input.o stringutils.o mainloop.o \
  	copy.o startup.o prompt.o variables.o large_obj.o print.o describe.o \
! 	tab-complete.o pg_mb_utf8.o
  
  all: submake psql
  
--- 19,25 ----
  
  OBJS=command.o common.o help.o input.o stringutils.o mainloop.o \
  	copy.o startup.o prompt.o variables.o large_obj.o print.o describe.o \
! 	tab-complete.o mbprint.o
  
  all: submake psql
  
*** ./pgsql/src/bin/psql/mbprint.c.orig	Fri Oct  5 21:10:42 2001
--- ./pgsql/src/bin/psql/mbprint.c	Mon Oct  8 20:58:10 2001
***************
*** 0 ****
--- 1,334 ----
+ /*
+  * psql - the PostgreSQL interactive terminal
+  *
+  * Copyright 2000 by PostgreSQL Global Development Group
+  *
+  * $Header: $
+  */
+ 
+ #include "postgres_fe.h"
+ #include "mbprint.h"
+ 
+ #ifdef MULTIBYTE
+ 
+ #include "mb/pg_wchar.h"
+ #include "settings.h"
+ 
+ /*
+  * This is an implementation of wcwidth() and wcswidth() as defined in
+  * "The Single UNIX Specification, Version 2, The Open Group, 1997"
+  * <http://www.UNIX-systems.org/online.html>
+  *
+  * Markus Kuhn -- 2001-09-08 -- public domain
+  *
+  * customised for PostgreSQL
+  *
+  * original available at : http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
+  */
+ 
+ struct mbinterval {
+ 	unsigned short first;
+ 	unsigned short last;
+ };
+ 
+ /* auxiliary function for binary search in interval table */
+ static int
+ mbbisearch(pg_wchar ucs, const struct mbinterval *table, int max)
+ {
+ 	int min = 0;
+ 	int mid;
+ 
+ 	if (ucs < table[0].first || ucs > table[max].last)
+ 		return 0;
+ 	while (max >= min) {
+ 		mid = (min + max) / 2;
+ 		if (ucs > table[mid].last)
+ 			min = mid + 1;
+ 		else if (ucs < table[mid].first)
+ 			max = mid - 1;
+ 		else
+ 			return 1;
+ 	}
+ 
+ 	return 0;
+ }
+ 
+ 
+ /* The following functions define the column width of an ISO 10646
+  * character as follows:
+  *
+  *    - The null character (U+0000) has a column width of 0.
+  *
+  *    - Other C0/C1 control characters and DEL will lead to a return
+  *      value of -1.
+  *
+  *    - Non-spacing and enclosing combining characters (general
+  *      category code Mn or Me in the Unicode database) have a
+  *      column width of 0.
+  *
+  *    - Other format characters (general category code Cf in the Unicode
+  *      database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
+  *
+  *    - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
+  *      have a column width of 0.
+  *
+  *    - Spacing characters in the East Asian Wide (W) or East Asian
+  *      FullWidth (F) category as defined in Unicode Technical
+  *      Report #11 have a column width of 2.
+  *
+  *    - All remaining characters (including all printable
+  *      ISO 8859-1 and WGL4 characters, Unicode control characters,
+  *      etc.) have a column width of 1.
+  *
+  * This implementation assumes that wchar_t characters are encoded
+  * in ISO 10646.
+  */
+ 
+ static int
+ ucs_wcwidth(pg_wchar ucs)
+ {
+ 	/* sorted list of non-overlapping intervals of non-spacing characters */
+ 	static const struct mbinterval combining[] = {
+ 		{ 0x0300, 0x034E }, { 0x0360, 0x0362 }, { 0x0483, 0x0486 },
+ 		{ 0x0488, 0x0489 }, { 0x0591, 0x05A1 }, { 0x05A3, 0x05B9 },
+ 		{ 0x05BB, 0x05BD }, { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 },
+ 		{ 0x05C4, 0x05C4 }, { 0x064B, 0x0655 }, { 0x0670, 0x0670 },
+ 		{ 0x06D6, 0x06E4 }, { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED },
+ 		{ 0x070F, 0x070F }, { 0x0711, 0x0711 }, { 0x0730, 0x074A },
+ 		{ 0x07A6, 0x07B0 }, { 0x0901, 0x0902 }, { 0x093C, 0x093C },
+ 		{ 0x0941, 0x0948 }, { 0x094D, 0x094D }, { 0x0951, 0x0954 },
+ 		{ 0x0962, 0x0963 }, { 0x0981, 0x0981 }, { 0x09BC, 0x09BC },
+ 		{ 0x09C1, 0x09C4 }, { 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 },
+ 		{ 0x0A02, 0x0A02 }, { 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 },
+ 		{ 0x0A47, 0x0A48 }, { 0x0A4B, 0x0A4D }, { 0x0A70, 0x0A71 },
+ 		{ 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC }, { 0x0AC1, 0x0AC5 },
+ 		{ 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD }, { 0x0B01, 0x0B01 },
+ 		{ 0x0B3C, 0x0B3C }, { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B43 },
+ 		{ 0x0B4D, 0x0B4D }, { 0x0B56, 0x0B56 }, { 0x0B82, 0x0B82 },
+ 		{ 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD }, { 0x0C3E, 0x0C40 },
+ 		{ 0x0C46, 0x0C48 }, { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 },
+ 		{ 0x0CBF, 0x0CBF }, { 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD },
+ 		{ 0x0D41, 0x0D43 }, { 0x0D4D, 0x0D4D }, { 0x0DCA, 0x0DCA },
+ 		{ 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 }, { 0x0E31, 0x0E31 },
+ 		{ 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E }, { 0x0EB1, 0x0EB1 },
+ 		{ 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC }, { 0x0EC8, 0x0ECD },
+ 		{ 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 }, { 0x0F37, 0x0F37 },
+ 		{ 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E }, { 0x0F80, 0x0F84 },
+ 		{ 0x0F86, 0x0F87 }, { 0x0F90, 0x0F97 }, { 0x0F99, 0x0FBC },
+ 		{ 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 }, { 0x1032, 0x1032 },
+ 		{ 0x1036, 0x1037 }, { 0x1039, 0x1039 }, { 0x1058, 0x1059 },
+ 		{ 0x1160, 0x11FF }, { 0x17B7, 0x17BD }, { 0x17C6, 0x17C6 },
+ 		{ 0x17C9, 0x17D3 }, { 0x180B, 0x180E }, { 0x18A9, 0x18A9 },
+ 		{ 0x200B, 0x200F }, { 0x202A, 0x202E }, { 0x206A, 0x206F },
+ 		{ 0x20D0, 0x20E3 }, { 0x302A, 0x302F }, { 0x3099, 0x309A },
+ 		{ 0xFB1E, 0xFB1E }, { 0xFE20, 0xFE23 }, { 0xFEFF, 0xFEFF },
+ 		{ 0xFFF9, 0xFFFB }
+ 	};
+ 
+ 	/* test for 8-bit control characters */
+ 	if (ucs == 0) {
+ 		return 0;
+ 	}
+ 
+ 	if (ucs < 32 || (ucs >= 0x7f && ucs < 0xa0) || ucs > 0x0010ffff) {
+ 		return -1;
+ 	}
+ 
+ 	/* binary search in table of non-spacing characters */
+ 	if (mbbisearch(ucs, combining,
+ 				   sizeof(combining) / sizeof(struct mbinterval) - 1)) {
+ 		return 0;
+ 	}
+ 
+ 	/* if we arrive here, ucs is not a combining or C0/C1 control character */
+ 
+ 	return 1 + 
+ 		(ucs >= 0x1100 &&
+ 		 (ucs <= 0x115f ||                    /* Hangul Jamo init. consonants */
+ 		  (ucs >= 0x2e80 && ucs <= 0xa4cf && (ucs & ~0x0011) != 0x300a &&
+ 		   ucs != 0x303f) ||                  /* CJK ... Yi */
+ 		  (ucs >= 0xac00 && ucs <= 0xd7a3) || /* Hangul Syllables */
+ 		  (ucs >= 0xf900 && ucs <= 0xfaff) || /* CJK Compatibility Ideographs */
+ 		  (ucs >= 0xfe30 && ucs <= 0xfe6f) || /* CJK Compatibility Forms */
+ 		  (ucs >= 0xff00 && ucs <= 0xff5f) || /* Fullwidth Forms */
+ 		  (ucs >= 0xffe0 && ucs <= 0xffe6) ||
+ 		  (ucs >= 0x20000 && ucs <= 0x2ffff)));
+ }
+ 
+ pg_wchar
+ utf2ucs(const unsigned char *c)
+ {
+ 	/* one char version of pg_utf2wchar_with_len.
+ 	 * no control here, c must point to a large enough string
+ 	 */
+ 	if ((*c & 0x80) == 0) {
+ 		return (pg_wchar)c[0];
+ 	}
+ 	else if ((*c & 0xe0) == 0xc0) {
+ 		return (pg_wchar)(((c[0] & 0x1f) << 6) |
+ 						 (c[1] & 0x3f));
+ 	}
+ 	else if ((*c & 0xf0) == 0xe0) {
+ 		return (pg_wchar)(((c[0] & 0x0f) << 12) |
+ 						  ((c[1] & 0x3f) << 6)  |
+ 						  (c[2] & 0x3f));
+ 	}
+ 	else if ((*c & 0xf0) == 0xf0) {
+ 		return (pg_wchar)(((c[0] & 0x07) << 18) |
+ 						  ((c[1] & 0x3f) << 12) |
+ 						  ((c[2] & 0x3f) << 6)  |
+ 						  (c[3] & 0x3f));
+ 	}
+ 	else {
+ 		/* that is an invalid code on purpose */
+ 		return 0xffffffff;
+ 	}
+ }
+ 
+ /* mb_utf_wcwidth : calculate column length for the utf8 string pwcs
+  */
+ static int
+ mb_utf_wcswidth(unsigned char *pwcs, int len)
+ {
+ 	int w, l = 0;
+ 	int width = 0;
+ 
+ 	for (;*pwcs && len > 0; pwcs+=l) {
+ 		l = pg_utf_mblen(pwcs);
+ 		if ((len < l) || ((w = ucs_wcwidth(utf2ucs(pwcs))) < 0)) {
+ 			return width;
+ 		}
+ 		len -= l;
+ 		width += w;
+ 	}
+ 	return width;
+ }
+ 
+ static int
+ utf_charcheck(const unsigned char *c)
+ {
+ 	/* Unicode 3.1 compliant validation :
+ 	 * for each category, it checks the combination of each byte to make sur
+ 	 * it maps to a valid range. It also returns -1 for the following UCS values:
+ 	 * ucs > 0x10ffff
+ 	 * ucs & 0xfffe = 0xfffe
+ 	 * 0xfdd0 < ucs < 0xfdef
+ 	 * ucs & 0xdb00 = 0xd800 (surrogates)
+ 	 */
+ 	if ((*c & 0x80) == 0) {
+ 		return 1;
+ 	}
+ 	else if ((*c & 0xe0) == 0xc0) {
+ 		/* two-byte char */
+ 		if(((c[1] & 0xc0) == 0x80) && ((c[0] & 0x1f) > 0x01)) {
+ 			return 2;
+ 		}
+ 		return -1;
+ 	}
+ 	else if ((*c & 0xf0) == 0xe0) {
+ 		/* three-byte char */
+ 		if (((c[1] & 0xc0) == 0x80) &&
+ 			(((c[0] & 0x0f) != 0x00) || ((c[1] & 0x20) == 0x20)) &&
+ 			((c[2] & 0xc0) == 0x80)) {
+ 			int z = c[0] & 0x0f;
+ 			int yx = ((c[1] & 0x3f) << 6) | (c[0] & 0x3f);
+ 			int lx = yx & 0x7f;
+ 
+ 			/* check 0xfffe/0xffff, 0xfdd0..0xfedf range, surrogates */
+ 			if (((z == 0x0f) &&
+ 				 (((yx & 0xffe) == 0xffe) ||
+ 				  (((yx & 0xf80) == 0xd80) && (lx >= 0x30) && (lx <= 0x4f)))) ||
+ 				((z == 0x0d) && ((yx & 0xb00) == 0x800))) {
+ 				return -1;
+ 			}
+ 			return 3;
+ 		}
+ 		return -1;
+ 	}
+ 	else if ((*c & 0xf8) == 0xf0) {
+ 		int u = ((c[0] & 0x07) << 2) | ((c[1] & 0x30) >> 4);
+ 
+ 		/* four-byte char */
+ 		if (((c[1] & 0xc0) == 0x80) &&
+ 			(u > 0x00) && (u <= 0x10) &&
+ 			((c[2] & 0xc0) == 0x80) && ((c[3] & 0xc0) == 0x80)) {
+ 			/* test for 0xzzzzfffe/0xzzzzfffff */
+ 			if (((c[1] & 0x0f) == 0x0f) && ((c[2] & 0x3f) == 0x3f) &&
+ 				((c[3] & 0x3e) == 0x3e)) {
+ 				return -1;
+ 			}
+ 			return 4;
+ 		}
+ 		return -1;
+ 	}
+ 	return -1;
+ }
+  
+ static unsigned char *
+ mb_utf_validate(unsigned char *pwcs)
+ {
+ 	int l = 0;
+ 	unsigned char *p = pwcs;
+ 	unsigned char *p0 = pwcs;
+ 
+ 	while( *pwcs ) {
+ 		if ((l = utf_charcheck(pwcs)) > 0) {
+ 			if (p != pwcs) {
+ 				int i;
+ 				for( i = 0; i < l; i++) {
+ 					*p++ = *pwcs++;
+ 				}
+ 			}
+ 			else {
+ 				pwcs += l;
+ 				p += l;
+ 			}
+ 		}
+ 		else {
+ 			/* we skip the char */
+ 			pwcs++;
+ 		}
+ 	}
+ 	if (p != pwcs) {
+ 		*p = '\0';
+ 	}
+ 	return p0;
+ }
+ 
+ /*
+  * public functions : wcswidth and mbvalidate
+  */
+ 
+ int
+ pg_wcswidth(unsigned char *pwcs, int len) {
+ 	if (pset.encoding == PG_UTF8) {
+ 		return mb_utf_wcswidth(pwcs, len);
+ 	}
+ 	else {
+ 		/* obviously, other encodings may want to fix this, but I don't know them
+ 		 * myself, unfortunately.
+ 		 */
+ 		return len;
+ 	}
+ }
+ 
+ unsigned char *
+ mbvalidate(unsigned char *pwcs) {
+ 	if (pset.encoding == PG_UTF8) {
+ 		return mb_utf_validate(pwcs);
+ 	}
+ 	else {
+ 		/* other encodings needing validation should add their own routines here
+ 		 */
+ 		return pwcs;
+ 	}
+ }
+ #else /* !MULTIBYTE */
+ 
+ /* in single-byte environment, all cells take 1 column */
+ int pg_wcswidth(unsigned char *pwcs, int len) {
+ 	return len;
+ }
+ #endif
+ 
+ 
*** ./pgsql/src/bin/psql/mbprint.h.orig	Fri Oct  5 21:10:45 2001
--- ./pgsql/src/bin/psql/mbprint.h	Fri Oct  5 00:37:11 2001
***************
*** 0 ****
--- 1,18 ----
+ /* $Id: $ */
+ #ifndef MBPRINT_H
+ #define MBPRINT_H
+ 
+ 
+ #ifdef MULTIBYTE
+ 
+ #include "mb/pg_wchar.h"
+ 
+ pg_wchar utf2ucs(const unsigned char *c);
+ 
+ unsigned char *mbvalidate(unsigned char *pwcs);
+ #endif /* MULTIBYTE */
+ 
+ int pg_wcswidth(unsigned char *pwcs, int len);
+ 
+ #endif /* MBPRINT_H */
+ 
*** ./pgsql/src/bin/psql/print.c.orig	Fri Oct  5 01:18:36 2001
--- ./pgsql/src/bin/psql/print.c	Fri Oct  5 19:16:35 2001
***************
*** 33,38 ****
--- 33,39 ----
  #include <termios.h>
  #endif
  
+ #include "mbprint.h"
  
  /*************************/
  /* Unaligned text		 */
***************
*** 209,218 ****
  static void
  print_aligned_text(const char *title, const char *const * headers,
  				   const char *const * cells, const char *const * footers,
! const char *opt_align, bool opt_barebones, unsigned short int opt_border,
  				   FILE *fout)
  {
  	unsigned int col_count = 0;
  	unsigned int i,
  				tmp;
  	unsigned int *widths,
--- 210,226 ----
  static void
  print_aligned_text(const char *title, const char *const * headers,
  				   const char *const * cells, const char *const * footers,
! 				   const char *opt_align, bool opt_barebones,
! 				   unsigned short int opt_border,
  				   FILE *fout)
  {
  	unsigned int col_count = 0;
+ 
+ #ifdef MULTIBYTE
+ 	unsigned int cell_count = 0;
+ 	unsigned int *head_w, *cell_w;
+ #endif
+ 
  	unsigned int i,
  				tmp;
  	unsigned int *widths,
***************
*** 230,244 ****
  		exit(EXIT_FAILURE);
  	}
  
  	/* calc column widths */
! 	for (i = 0; i < col_count; i++)
! 		if ((tmp = strlen(headers[i])) > widths[i])
! 			widths[i] = tmp;	/* don't wanna call strlen twice */
  
! 	for (i = 0, ptr = cells; *ptr; ptr++, i++)
! 		if ((tmp = strlen(*ptr)) > widths[i % col_count])
  			widths[i % col_count] = tmp;
! 
  	if (opt_border == 0)
  		total_w = col_count - 1;
  	else if (opt_border == 1)
--- 238,281 ----
  		exit(EXIT_FAILURE);
  	}
  
+ #ifdef MULTIBYTE
+ 	head_w = calloc(col_count, sizeof(*head_w));
+ 	if (!head_w) {
+ 	  perror("calloc");
+ 	  exit(EXIT_FAILURE);
+ 	}
+ 
+ 	/* count rows */
+ 	for (ptr = cells; *ptr; ptr++) {
+ 	  cell_count++;
+ 	}
+ 
+ 	cell_w = calloc(cell_count, sizeof(*cell_w));
+ 	if (!cell_w) {
+ 	  perror("calloc");
+ 	  exit(EXIT_FAILURE);
+ 	}
+ #endif
+ 	
+ 
  	/* calc column widths */
! 	for (i = 0; i < col_count; i++) {
! 	    if ((tmp = pg_wcswidth((unsigned char *)headers[i], strlen(headers[i]))) > widths[i]) {
! 			widths[i] = tmp;
! 		}
! #ifdef MULTIBYTE
! 		head_w[i] = tmp;
! #endif
! 	}
  
! 	for (i = 0, ptr = cells; *ptr; ptr++, i++) {
! 	    if ((tmp = pg_wcswidth((unsigned char *)*ptr, strlen(*ptr))) > widths[i % col_count]) {
  			widths[i % col_count] = tmp;
! 		}
! #ifdef MULTIBYTE
! 		cell_w[i] = tmp;
! #endif
! 	}
  	if (opt_border == 0)
  		total_w = col_count - 1;
  	else if (opt_border == 1)
***************
*** 250,261 ****
  		total_w += widths[i];
  
  	/* print title */
! 	if (title && !opt_barebones)
! 	{
! 		if (strlen(title) >= total_w)
  			fprintf(fout, "%s\n", title);
! 		else
! 			fprintf(fout, "%-*s%s\n", (int) (total_w - strlen(title)) / 2, "", title);
  	}
  
  	/* print headers */
--- 287,300 ----
  		total_w += widths[i];
  
  	/* print title */
! 	if (title && !opt_barebones) {
! 	    int tlen;
! 		if ((tlen = pg_wcswidth((unsigned char *)title, strlen(title))) >= total_w) {
  			fprintf(fout, "%s\n", title);
! 		}
! 		else {
! 			fprintf(fout, "%-*s%s\n", (int) (total_w - tlen) / 2, "", title);
! 		}
  	}
  
  	/* print headers */
***************
*** 271,280 ****
  
  		for (i = 0; i < col_count; i++)
  		{
  			/* centered */
! 			fprintf(fout, "%-*s%s%-*s",
! 				 (int) floor((widths[i] - strlen(headers[i])) / 2.0), "",
! 					headers[i], (int) ceil((widths[i] - strlen(headers[i])) / 2.0), "");
  
  			if (i < col_count - 1)
  			{
--- 310,325 ----
  
  		for (i = 0; i < col_count; i++)
  		{
+ 		    int nbspace;
+ #ifdef MULTIBYTE
+ 		    nbspace = widths[i] - head_w[i];
+ #else
+ 		    nbspace = widths[i] - strlen(headers[i]);
+ #endif
+ 
  			/* centered */
! 			fprintf(fout, "%-*s%s%-*s", 
! 					nbspace / 2, "", headers[i], (nbspace+1) / 2, "");
  
  			if (i < col_count - 1)
  			{
***************
*** 307,320 ****
  		}
  
  		/* content */
! 		if (opt_align[(i) % col_count] == 'r')
  			fprintf(fout, "%*s", widths[i % col_count], cells[i]);
  		else
  		{
  			if ((i + 1) % col_count == 0 && opt_border != 2)
  				fputs(cells[i], fout);
! 			else
  				fprintf(fout, "%-*s", widths[i % col_count], cells[i]);
  		}
  
  		/* divider */
--- 352,377 ----
  		}
  
  		/* content */
! 		if (opt_align[(i) % col_count] == 'r') {
! #ifdef MULTIBYTE
! 		  fprintf(fout, "%*s%s",
! 				  widths[i % col_count] - cell_w[i], "", cells[i] );
! #else
  			fprintf(fout, "%*s", widths[i % col_count], cells[i]);
+ #endif
+ 		}
  		else
  		{
  			if ((i + 1) % col_count == 0 && opt_border != 2)
  				fputs(cells[i], fout);
! 			else {
! #ifdef MULTIBYTE
! 			  fprintf(fout, "%-s%*s", cells[i], 
! 					  widths[i % col_count] - cell_w[i], "" );
! #else
  				fprintf(fout, "%-*s", widths[i % col_count], cells[i]);
+ #endif
+ 			}
  		}
  
  		/* divider */
***************
*** 345,350 ****
--- 402,411 ----
  	fputc('\n', fout);
  
  	/* clean up */
+ #ifdef MULTIBYTE
+ 	free(cell_w);
+ 	free(head_w);
+ #endif
  	free(widths);
  }
  
***************
*** 360,369 ****
  	unsigned int record = 1;
  	const char *const * ptr;
  	unsigned int i,
! 				tmp,
  				hwidth = 0,
  				dwidth = 0;
  	char	   *divider;
  
  	if (cells[0] == NULL)
  	{
--- 421,434 ----
  	unsigned int record = 1;
  	const char *const * ptr;
  	unsigned int i,
! 				tmp = 0,
  				hwidth = 0,
  				dwidth = 0;
  	char	   *divider;
+ #ifdef MULTIBYTE
+ 	unsigned int cell_count = 0;
+ 	unsigned int *cell_w,*head_w;
+ #endif
  
  	if (cells[0] == NULL)
  	{
***************
*** 371,376 ****
--- 436,475 ----
  		return;
  	}
  
+ #ifdef MULTIBYTE
+ 	/* pre-count headers */
+ 	for (ptr = headers; *ptr; ptr++) {
+ 	  col_count++;
+ 	}
+ 	head_w = calloc(col_count, sizeof(*head_w));
+ 	if (!head_w) {
+ 	  perror("calloc");
+ 	  exit(EXIT_FAILURE);
+ 	}
+ 	for (i = 0; i < col_count; i++)
+ 	{
+ 		if ((tmp = pg_wcswidth((unsigned char *)headers[i], strlen(headers[i]))) > hwidth)
+ 			hwidth = tmp;
+ 		head_w[i] = tmp;
+ 	}
+ 	for (ptr = cells; *ptr; ptr++) {
+ 	  cell_count++;
+ 	}
+ 
+ 	cell_w = calloc(cell_count, sizeof(*cell_w));
+ 	if (!cell_w) {
+ 	  perror("calloc");
+ 	  exit(EXIT_FAILURE);
+ 	}	
+ 
+ 	/* find longest data cell */
+ 	for (i = 0, ptr = cells; *ptr; ptr++, i++) {
+ 	  if ((tmp = pg_wcswidth((unsigned char *)*ptr, strlen(*ptr))) > dwidth) {
+ 			dwidth = tmp;
+ 	  }
+ 	  cell_w[i] = tmp;
+ 	}
+ #else
  	/* count columns and find longest header */
  	for (ptr = headers; *ptr; ptr++)
  	{
***************
*** 380,388 ****
  	}
  
  	/* find longest data cell */
! 	for (ptr = cells; *ptr; ptr++)
  		if ((tmp = strlen(*ptr)) > dwidth)
  			dwidth = tmp;
  
  	/* print title */
  	if (!opt_barebones && title)
--- 479,489 ----
  	}
  
  	/* find longest data cell */
! 	for (ptr = cells; *ptr; ptr++) {
  		if ((tmp = strlen(*ptr)) > dwidth)
  			dwidth = tmp;
+ 	}
+ #endif
  
  	/* print title */
  	if (!opt_barebones && title)
***************
*** 456,462 ****
--- 557,568 ----
  
  		if (opt_border == 2)
  			fputs("| ", fout);
+ #if MULTIBYTE
+ 		fprintf(fout, "%-s%*s", headers[i % col_count],
+ 				hwidth - head_w[i % col_count], "");
+ #else
  		fprintf(fout, "%-*s", hwidth, headers[i % col_count]);
+ #endif
  
  		if (opt_border > 0)
  			fputs(" | ", fout);
***************
*** 465,472 ****
  
  		if (opt_border < 2)
  			fprintf(fout, "%s\n", *ptr);
! 		else
  			fprintf(fout, "%-*s |\n", dwidth, *ptr);
  	}
  
  	if (opt_border == 2)
--- 571,583 ----
  
  		if (opt_border < 2)
  			fprintf(fout, "%s\n", *ptr);
! 		else {
! #ifdef MULTIBYTE
! 			fprintf(fout, "%-s%*s |\n", *ptr, dwidth - cell_w[i], "");
! #else
  			fprintf(fout, "%-*s |\n", dwidth, *ptr);
+ #endif
+ 		}
  	}
  
  	if (opt_border == 2)
***************
*** 485,490 ****
--- 596,605 ----
  
  	fputc('\n', fout);
  	free(divider);
+ 
+ #ifdef MULTIBYTE
+ 	free(cell_w);
+ #endif
  }
  
  
***************
*** 1037,1044 ****
  		exit(EXIT_FAILURE);
  	}
  
! 	for (i = 0; i < nfields; i++)
  		headers[i] = PQfname(result, i);
  
  	/* set cells */
  
--- 1152,1164 ----
  		exit(EXIT_FAILURE);
  	}
  
! 	for (i = 0; i < nfields; i++) {
! #ifdef MULTIBYTE
! 		headers[i] = mbvalidate(PQfname(result, i));
! #else
  		headers[i] = PQfname(result, i);
+ #endif
+ 	}
  
  	/* set cells */
  
***************
*** 1053,1060 ****
  	{
  		if (PQgetisnull(result, i / nfields, i % nfields))
  			cells[i] = opt->nullPrint ? opt->nullPrint : "";
! 		else
  			cells[i] = PQgetvalue(result, i / nfields, i % nfields);
  	}
  
  	/* set footers */
--- 1173,1185 ----
  	{
  		if (PQgetisnull(result, i / nfields, i % nfields))
  			cells[i] = opt->nullPrint ? opt->nullPrint : "";
! 		else {
! #ifdef MULTIBYTE
! 			cells[i] = mbvalidate(PQgetvalue(result, i / nfields, i % nfields));
! #else
  			cells[i] = PQgetvalue(result, i / nfields, i % nfields);
+ #endif
+ 		}
  	}
  
  	/* set footers */

#39

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Patrice Hédé (#38)

Re: [HACKERS] Unicode combining characters

- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).

After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2
bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4
bytes, we are in trouble. Current regex implementaion does not handle
4 byte width charsets.
--
Tatsuo Ishii

#40

Patrice Hédé

phede-ml@islande.org

over 24 years ago

In reply to: Tatsuo Ishii (#39)

Re: [HACKERS] Unicode combining characters

* Tatsuo Ishii <t-ishii@sra.co.jp> [011009 18:38]:

- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).

After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2
bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4
bytes, we are in trouble. Current regex implementaion does not handle
4 byte width charsets.

*sigh* yes, it does encode to four bytes :(

Three solutions then :

1) we support these supplementary characters, knowing that they won't
work with regexes,

2) I back out the change, but then anyone using these characters will
get something weird, since the decoding would be faulty (they would
be handled as 3 bytes UTF-8 chars, and then the fourth byte would
become a "faulty char"... not very good, as the 3-byte version is
still not a valid UTF-8 code !),

3) we fix the regex engine within the next 24 hours, before the beta
deadline is activated :/

I must say that I doubt that anyone will use these characters in the
next few months : these are mostly chinese extended characters, with
old italic, deseret, and gothic scripts, and bysantine and western
musical symbols, as well as the mathematical alphanumerical symbols.

I would prefer solution 1), as I think it is better to allow these
characters, even with a temporary restriction on the regex, than to
fail completely on them. As for solution 3), we may still work at it
in the next few months :) [I haven't even looked at the regex engine
yet, so I don't know the implications of what I have just said !]

What do you think ?

Patrice

--
Patrice Hï¿œdï¿œ
email: patrice hede ï¿œ islande org
www : http://www.islande.org/

#41

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Patrice Hédé (#40)

Re: [HACKERS] Unicode combining characters

After applying your patches, do the 4-bytes UTF-8 convert to UCS-2 (2
bytes) or UCS-4 (4 bytes) in pg_utf2wchar_with_len()? If it were 4
bytes, we are in trouble. Current regex implementaion does not handle
4 byte width charsets.

*sigh* yes, it does encode to four bytes :(

Three solutions then :

1) we support these supplementary characters, knowing that they won't
work with regexes,

2) I back out the change, but then anyone using these characters will
get something weird, since the decoding would be faulty (they would
be handled as 3 bytes UTF-8 chars, and then the fourth byte would
become a "faulty char"... not very good, as the 3-byte version is
still not a valid UTF-8 code !),

3) we fix the regex engine within the next 24 hours, before the beta
deadline is activated :/

I must say that I doubt that anyone will use these characters in the
next few months : these are mostly chinese extended characters, with
old italic, deseret, and gothic scripts, and bysantine and western
musical symbols, as well as the mathematical alphanumerical symbols.

I would prefer solution 1), as I think it is better to allow these
characters, even with a temporary restriction on the regex, than to
fail completely on them. As for solution 3), we may still work at it
in the next few months :) [I haven't even looked at the regex engine
yet, so I don't know the implications of what I have just said !]

What do you think ?

I think 2) is not very good, and we should reject these 4-bytes UTF-8
strings. After all, we are not ready for them.

BTW, other part of your patches looks good. Peter, what do you think?
--
Tatsuo Ishii

#42

Patrice Hédé

phede-ml@islande.org

over 24 years ago

In reply to: Tatsuo Ishii (#41)

Re: [HACKERS] Unicode combining characters

1) we support these supplementary characters, knowing that they won't
work with regexes,

2) I back out the change, but then anyone using these characters will
get something weird, since the decoding would be faulty (they would
be handled as 3 bytes UTF-8 chars, and then the fourth byte would
become a "faulty char"... not very good, as the 3-byte version is
still not a valid UTF-8 code !),

3) we fix the regex engine within the next 24 hours, before the beta
deadline is activated :/

What do you think ?

I think 2) is not very good, and we should reject these 4-bytes UTF-8
strings. After all, we are not ready for them.

If we still recognise them as 4-byte UTF-8 chars (in order to parse
the next char correctly) and reject them as invalid chars, that should
be OK :)

BTW, other part of your patches looks good. Peter, what do you think?

Nice to hear :)

Patrice

--
Patrice Hï¿œdï¿œ
email: patrice hede ï¿œ islande org
www : http://www.islande.org/

#43

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Patrice Hédé (#42)

Re: Unicode combining characters

Can I ask about the status of this?

Hi all,

while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :

lang | length | length | text | text
------+--------+--------+-----------+-----------
isl | 7 | 6 | _l_ta | _leit
isl | 7 | 7 | _l_ta | _litum
isl | 7 | 7 | _l_ta | _liti_
isl | 5 | 4 | ma_ur | mann
isl | 5 | 7 | ma_ur | m_nnum
isl | 5 | 5 | ma_ur | manna
isl | 5 | 4 | _ska | -a_i

[the misalignment is what I got, it's not a copy-paste error]

This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
querying a database created with -E UNICODE (by the way, these are
icelandic words :) ).

What you see above is misleading, since it's not possible to see that
'_', '_', '_' and '_' are using combining marks, while '_' is not.

As a reminder, a combining mark in Unicode is that _ is actually
encoded as a + ' (where ' is the acute combining mark).

Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
instead of <c3 a1> [UTF16: 00E1].

The "length" fields are what is returned by length(a.text) and
length(b.text).

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

=> regarding the backend, it may be more complex, as the underlaying
system may not provide any UTF-8 locale to use (!= from being UTF-8
aware : an administrator may have decided that UTF-8 locales are
useless on a server, as only root connections are made, and he wants
only the C locale on the console - I've seen that quite often ;) ).

This brings me to another subject : I will need to support the full
Unicode collation algorithm (UCA, as described in TR#10 [1] of the
Unicode consortium), and I will need to be able to sort according to
locales which may not be installed on the backend server (some of
which may not even be recognised by GNU libc, which supports already
more than 140 locales -- artificial languages would be an example). I
will also need to be able to normalise the unicode strings (TR#15 [2])
so that I don't have some characters in legacy codepoints [as 00E1
above], and others with combining marks.

There is today an implementation in perl of the needed functionality,
in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
:( ). But as they are Perl modules, the untrusted version of perl,
plperlu, will be needed, and it's a pity for what I consider a core
functionality in the future (not that plperlu isn't a good thing - I
can't wait for it ! - but that an untrusted pl language is needed to
support normalisation and collation).

Note also that there are a lot of data associated with these
algorithms, as you could expect.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

- implement it in perl (by reusing Unicode:: work), in a trusted plperl,

- implement it in perl, calling Unicode:: modules, in an untrusted
plperl.

and then :

- provide the data in tables (system and/or user) - which should be
available across databases,

- load the data from the original text files provided in Unicode (and
other as needed), if the functionality is compiled into the server.

- I believe the basic unicode information should be standard, and the
locales should be provided as contrib/ files to be plugged in as
needed.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,
have a way to override the collating tables by user tables - actually,
this would be certainly the best solution if it's in the core, as the
tables will put an extra burden on the distribution and the
installation footprint, especially if the tables are already there,
for glibc, for perl5.6+, for other software dealing with Unicode).

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

Patrice.

[1] http://www.unicode.org/unicode/reports/tr10/

[2] http://www.unicode.org/unicode/reports/tr15/

--
Patrice H_D_ ------------------------------- patrice _ islande.org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Import Notes

Reply to msg id not found: 20010924235204.L1316@idf.netISO-8859-15Qfrom_Patrice_HE9dE9_at_Sep_242C_2001_113A523A04_pm | Resolved by subject fallback

#44

Patrice Hédé

phede-ml@islande.org

over 24 years ago

In reply to: Bruce Momjian (#43)

Re: Unicode combining characters

* Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]:

Can I ask about the status of this?

I have sent a patch a few days ago solving the client-side issue (on
the pgsql-patches mailing list) for review. I think Tatsuo said it
looked OK, however he should confirm/infirm this.

There is still the issue about unicode characters which have code
points above U00FFFF, which probably should be rejected on the server
side. I have yet to update my patch for that. I'll probably do that
tomorrow, as I don't have more time tonight, but I think this will be
trivial, so maybe Tatsuo can do it, if he has some time before that :)

If there are other issues, I'd like to know :)

Regarding the implementation of Unicode functionality (normalisation,
collation, Unicode-aware regexes, uc/lc/tc (title-case) functions,...)
on the server side, it's definitely something for 7.3 (though it might
be available sooner). It will probably be just a contributed extension
first. I'm currently making an alpha version of the project I'm
working on in order to have sufficient "real-life" Unicode data to
work with, and make sure the design choices make sense :)

Patrice.

BTW, I tried to find web-accessible archives of pgsql-patches, are
there some, or should each and every discussion be followed-up on
pgsql-hackers (even though the description for pgsql-patches includes
discussions on patches) ?

Hi all,

while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :

lang | length | length | text | text
------+--------+--------+-----------+-----------
isl | 7 | 6 | _l_ta | _leit
isl | 7 | 7 | _l_ta | _litum
isl | 7 | 7 | _l_ta | _liti_
isl | 5 | 4 | ma_ur | mann
isl | 5 | 7 | ma_ur | m_nnum
isl | 5 | 5 | ma_ur | manna
isl | 5 | 4 | _ska | -a_i

[the misalignment is what I got, it's not a copy-paste error]

This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
querying a database created with -E UNICODE (by the way, these are
icelandic words :) ).

What you see above is misleading, since it's not possible to see that
'_', '_', '_' and '_' are using combining marks, while '_' is not.

As a reminder, a combining mark in Unicode is that _ is actually
encoded as a + ' (where ' is the acute combining mark).

Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
instead of <c3 a1> [UTF16: 00E1].

The "length" fields are what is returned by length(a.text) and
length(b.text).

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

=> regarding the backend, it may be more complex, as the underlaying
system may not provide any UTF-8 locale to use (!= from being UTF-8
aware : an administrator may have decided that UTF-8 locales are
useless on a server, as only root connections are made, and he wants
only the C locale on the console - I've seen that quite often ;) ).

This brings me to another subject : I will need to support the full
Unicode collation algorithm (UCA, as described in TR#10 [1] of the
Unicode consortium), and I will need to be able to sort according to
locales which may not be installed on the backend server (some of
which may not even be recognised by GNU libc, which supports already
more than 140 locales -- artificial languages would be an example). I
will also need to be able to normalise the unicode strings (TR#15 [2])
so that I don't have some characters in legacy codepoints [as 00E1
above], and others with combining marks.

There is today an implementation in perl of the needed functionality,
in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
:( ). But as they are Perl modules, the untrusted version of perl,
plperlu, will be needed, and it's a pity for what I consider a core
functionality in the future (not that plperlu isn't a good thing - I
can't wait for it ! - but that an untrusted pl language is needed to
support normalisation and collation).

Note also that there are a lot of data associated with these
algorithms, as you could expect.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

- implement it in perl (by reusing Unicode:: work), in a trusted plperl,

- implement it in perl, calling Unicode:: modules, in an untrusted
plperl.

and then :

- provide the data in tables (system and/or user) - which should be
available across databases,

- load the data from the original text files provided in Unicode (and
other as needed), if the functionality is compiled into the server.

- I believe the basic unicode information should be standard, and the
locales should be provided as contrib/ files to be plugged in as
needed.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,
have a way to override the collating tables by user tables - actually,
this would be certainly the best solution if it's in the core, as the
tables will put an extra burden on the distribution and the
installation footprint, especially if the tables are already there,
for glibc, for perl5.6+, for other software dealing with Unicode).

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

Patrice.

[1] http://www.unicode.org/unicode/reports/tr10/

[2] http://www.unicode.org/unicode/reports/tr15/

--
Patrice Hï¿œdï¿œ
email: patrice hede ï¿œ islande org
www : http://www.islande.org/

#45

Tatsuo Ishii

t-ishii@sra.co.jp

over 24 years ago

In reply to: Patrice Hédé (#44)

Re: Unicode combining characters

* Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]:

Can I ask about the status of this?

I have sent a patch a few days ago solving the client-side issue (on
the pgsql-patches mailing list) for review. I think Tatsuo said it
looked OK, however he should confirm/infirm this.

I've been waiting for Peter's opnion. His understanding of psql is
much better than me.

There is still the issue about unicode characters which have code
points above U00FFFF, which probably should be rejected on the server
side. I have yet to update my patch for that. I'll probably do that
tomorrow, as I don't have more time tonight, but I think this will be
trivial, so maybe Tatsuo can do it, if he has some time before that :)

Rejecting over U00FFFF is considered a bug fix, that means we could
fix after the beta test begins:-)

BTW, have you tried my updates for supporting ISO 8859 characters?
Please let me know if you have troubles especially with "euro". I
don't understand none of these charsets...
--
Tatsuo Ishii

#46

Bruce Momjian

pgman@candle.pha.pa.us

over 24 years ago

In reply to: Tatsuo Ishii (#45)

Re: Unicode combining characters

* Bruce Momjian <pgman@candle.pha.pa.us> [011011 22:49]:

Can I ask about the status of this?

I have sent a patch a few days ago solving the client-side issue (on
the pgsql-patches mailing list) for review. I think Tatsuo said it
looked OK, however he should confirm/infirm this.

OK, I saw the client encoding function appear today. I assume Tatsuo
and friends will finish this up.

Regarding the implementation of Unicode functionality (normalisation,
collation, Unicode-aware regexes, uc/lc/tc (title-case) functions,...)
on the server side, it's definitely something for 7.3 (though it might
be available sooner). It will probably be just a contributed extension
first. I'm currently making an alpha version of the project I'm
working on in order to have sufficient "real-life" Unicode data to
work with, and make sure the design choices make sense :)

If you would like to add some TODO items, please let me know. Good to
document them even if you can't get to them for a while.

BTW, I tried to find web-accessible archives of pgsql-patches, are
there some, or should each and every discussion be followed-up on
pgsql-hackers (even though the description for pgsql-patches includes
discussions on patches) ?

I use:

http://fts.postgresql.org/db/mw/

You can discuss on patches or hackers. I usually do patch discussion on
patches unless I need a larger audience for comments.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Import Notes

Reply to msg id not found: 20011011232336.N14587@idf.netISO-8859-15Qfrom_Patrice_HE9dE9_at_Oct_112C_2001_113A233A36_pm | Resolved by subject fallback

#47

Tatsuo Ishii

t-ishii@sra.co.jp

about 24 years ago

In reply to: Patrice Hédé (#38)

Re: [PATCHES] Unicode combining characters

I have committed part of Patrice's patches with minor fixes.
Uncommitted changes are related to the backend side, and the reason
could be found in the previous discussions (basically this is due to
the fact that current regex code does not support UTF-8 chars >=
0x10000). Instead pg_veryfymbstr() now rejects UTF-8 chars >= 0x10000.
--
Tatsuo Ishii

Show quoted text

Hi,

I should have sent the patch earlier, but got delayed by other stuff.
Anyway, here is the patch:

- most of the functionality is only activated when MULTIBYTE is
defined,

- check valid UTF-8 characters, client-side only yet, and only on
output, you still can send invalid UTF-8 to the server (so, it's
only partly compliant to Unicode 3.1, but that's better than
nothing).

- formats with the correct number of columns (that's why I made it in
the first place after all), but only for UNICODE. However, the code
allows to plug-in routines for other encodings, as Tatsuo did for
the other multibyte functions.

- corrects a bit the UTF-8 code from Tatsuo to allow Unicode 3.1
characters (characters with values >= 0x10000, which are encoded on
four bytes).

- doesn't depend on the locale capabilities of the glibc (useful for
remote telnet).

I would like somebody to check it closely, as it is my first patch to
pgsql. Also, I created dummy .orig files, so that the two files I
created are included, I hope that's the right way.

Now, a lot of functionality is NOT included here, but I will keep that
for 7.3 :) That includes all string checking on the server side (which
will have to be a bit more optimised ;) ), and the input checking on
the client side for UTF-8, though that should not be difficult. It's
just to send the strings through mbvalidate() before sending them to
the server. Strong checking on UTF-8 strings is mandatory to be
compliant with Unicode 3.1+ .

Do I have time to look for a patch to include iso-8859-15 for 7.2 ?
The euro is coming 1. january 2002 (before 7.3 !) and over 280
millions people in Europe will need the euro sign and only iso-8859-15
and iso-8859-16 have it (and unfortunately, I don't think all Unices
will switch to Unicode in the meantime)....

err... yes, I know that this is not every single person in Europe that
uses PostgreSql, so it's not exactly 280m, but it's just a matter of
time ! ;)

I'll come back (on pgsql-hackers) later to ask a few questions
regarding the full unicode support (normalisation, collation,
regexes,...) on the server side :)

Here is the patch !

Patrice.

--
Patrice H��D�� ------------------------------- patrice �� islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----