Unicode combining characters

Started by Patrice Hédéover 24 years ago47 messageshackers
Jump to latest
#1Patrice Hédé
phede-ml@islande.org

Hi all,

while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :

lang | length | length | text | text
------+--------+--------+-----------+-----------
isl | 7 | 6 | ᅵlᅵta | ᅵleit
isl | 7 | 7 | ᅵlᅵta | ᅵlitum
isl | 7 | 7 | ᅵlᅵta | ᅵlitiᅵ
isl | 5 | 4 | maᅵur | mann
isl | 5 | 7 | maᅵur | mᅵnnum
isl | 5 | 5 | maᅵur | manna
isl | 5 | 4 | ᅵska | -aᅵi

[the misalignment is what I got, it's not a copy-paste error]

This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
querying a database created with -E UNICODE (by the way, these are
icelandic words :) ).

What you see above is misleading, since it's not possible to see that
'ᅵ', 'ᅵ', 'ᅵ' and 'ᅵ' are using combining marks, while 'ᅵ' is not.

As a reminder, a combining mark in Unicode is that ᅵ is actually
encoded as a + ' (where ' is the acute combining mark).

Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
instead of <c3 a1> [UTF16: 00E1].

The "length" fields are what is returned by length(a.text) and
length(b.text).

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

=> regarding the backend, it may be more complex, as the underlaying
system may not provide any UTF-8 locale to use (!= from being UTF-8
aware : an administrator may have decided that UTF-8 locales are
useless on a server, as only root connections are made, and he wants
only the C locale on the console - I've seen that quite often ;) ).

This brings me to another subject : I will need to support the full
Unicode collation algorithm (UCA, as described in TR#10 [1]http://www.unicode.org/unicode/reports/tr10/ of the
Unicode consortium), and I will need to be able to sort according to
locales which may not be installed on the backend server (some of
which may not even be recognised by GNU libc, which supports already
more than 140 locales -- artificial languages would be an example). I
will also need to be able to normalise the unicode strings (TR#15 [2]http://www.unicode.org/unicode/reports/tr15/)
so that I don't have some characters in legacy codepoints [as 00E1
above], and others with combining marks.

There is today an implementation in perl of the needed functionality,
in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
:( ). But as they are Perl modules, the untrusted version of perl,
plperlu, will be needed, and it's a pity for what I consider a core
functionality in the future (not that plperlu isn't a good thing - I
can't wait for it ! - but that an untrusted pl language is needed to
support normalisation and collation).

Note also that there are a lot of data associated with these
algorithms, as you could expect.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

- implement it in perl (by reusing Unicode:: work), in a trusted plperl,

- implement it in perl, calling Unicode:: modules, in an untrusted
plperl.

and then :

- provide the data in tables (system and/or user) - which should be
available across databases,

- load the data from the original text files provided in Unicode (and
other as needed), if the functionality is compiled into the server.

- I believe the basic unicode information should be standard, and the
locales should be provided as contrib/ files to be plugged in as
needed.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,
have a way to override the collating tables by user tables - actually,
this would be certainly the best solution if it's in the core, as the
tables will put an extra burden on the distribution and the
installation footprint, especially if the tables are already there,
for glibc, for perl5.6+, for other software dealing with Unicode).

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

Patrice.

[1]: http://www.unicode.org/unicode/reports/tr10/

[2]: http://www.unicode.org/unicode/reports/tr15/

--
Patrice HᅵDᅵ ------------------------------- patrice ᅵ islande.org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Patrice Hédé (#1)
Re: Unicode combining characters

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

This is a known limitation.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

Sounds great.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

I'm very interested in your point. I will start studying [1][2] after
the beta freeze.

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

This may be a good starting point.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,

I totally agree here.

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

--
Tatsuo Ishii

#3Oleg Bartunov
oleg@sai.msu.su
In reply to: Patrice Hédé (#1)
Re: Unicode combining characters

Looks like a good project for 7.3
Probably the best starting point would be to develope contrib/unicode
with smooth transition to core.

Oleg
On Mon, 24 Sep 2001, Patrice [iso-8859-15] HО©╫dО©╫ wrote:

Hi all,

while working on a new project involving PostgreSQL and making some
tests, I have come up with the following output from psql :

lang | length | length | text | text
------+--------+--------+-----------+-----------
isl | 7 | 6 | О©╫lО©╫ta | О©╫leit
isl | 7 | 7 | О©╫lО©╫ta | О©╫litum
isl | 7 | 7 | О©╫lО©╫ta | О©╫litiО©╫
isl | 5 | 4 | maО©╫ur | mann
isl | 5 | 7 | maО©╫ur | mО©╫nnum
isl | 5 | 5 | maО©╫ur | manna
isl | 5 | 4 | О©╫ska | -aО©╫i

[the misalignment is what I got, it's not a copy-paste error]

This is pasted from a UTF-8 xterm running psql under a UTF-8 locale,
querying a database created with -E UNICODE (by the way, these are
icelandic words :) ).

What you see above is misleading, since it's not possible to see that
'О©╫', 'О©╫', 'О©╫' and 'О©╫' are using combining marks, while 'О©╫' is not.

As a reminder, a combining mark in Unicode is that О©╫ is actually
encoded as a + ' (where ' is the acute combining mark).

Encoded in UTF-8, it's then <61 cc 81> [UTF16: 0061 0301],
instead of <c3 a1> [UTF16: 00E1].

The "length" fields are what is returned by length(a.text) and
length(b.text).

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

=> regarding the backend, it may be more complex, as the underlaying
system may not provide any UTF-8 locale to use (!= from being UTF-8
aware : an administrator may have decided that UTF-8 locales are
useless on a server, as only root connections are made, and he wants
only the C locale on the console - I've seen that quite often ;) ).

This brings me to another subject : I will need to support the full
Unicode collation algorithm (UCA, as described in TR#10 [1] of the
Unicode consortium), and I will need to be able to sort according to
locales which may not be installed on the backend server (some of
which may not even be recognised by GNU libc, which supports already
more than 140 locales -- artificial languages would be an example). I
will also need to be able to normalise the unicode strings (TR#15 [2])
so that I don't have some characters in legacy codepoints [as 00E1
above], and others with combining marks.

There is today an implementation in perl of the needed functionality,
in Unicode::Collate and Unicode::Normalize (which I haven't tried yet
:( ). But as they are Perl modules, the untrusted version of perl,
plperlu, will be needed, and it's a pity for what I consider a core
functionality in the future (not that plperlu isn't a good thing - I
can't wait for it ! - but that an untrusted pl language is needed to
support normalisation and collation).

Note also that there are a lot of data associated with these
algorithms, as you could expect.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

- implement it in perl (by reusing Unicode:: work), in a trusted plperl,

- implement it in perl, calling Unicode:: modules, in an untrusted
plperl.

and then :

- provide the data in tables (system and/or user) - which should be
available across databases,

- load the data from the original text files provided in Unicode (and
other as needed), if the functionality is compiled into the server.

- I believe the basic unicode information should be standard, and the
locales should be provided as contrib/ files to be plugged in as
needed.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,
have a way to override the collating tables by user tables - actually,
this would be certainly the best solution if it's in the core, as the
tables will put an extra burden on the distribution and the
installation footprint, especially if the tables are already there,
for glibc, for perl5.6+, for other software dealing with Unicode).

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

Patrice.

[1] http://www.unicode.org/unicode/reports/tr10/

[2] http://www.unicode.org/unicode/reports/tr15/

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

#4Patrice Hédé
phede-ml@islande.org
In reply to: Tatsuo Ishii (#2)
Re: Unicode combining characters

Hi,

* Tatsuo Ishii <t-ishii@sra.co.jp> [010925 18:18]:

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I
have the same result with char_length()], and returns the number
of chars (as it is however advertised to do), rather the length
of the string.

This is a known limitation.

To solve this, we could use wcwidth() (there is a custom
implementation for the systems which don't have it in the glibc). I'll
have a look at it later.

- the psql frontend makes the same mistake.

Same thing here.

I have just installed the CVS and downloaded the development version
(thanks Baldvin), tested that the stock version compiles fine, and
I'll now have a look at how to make this work. :) I'll send a patch
when I have this working here.

Sounds great.

[Unicode normalisation and collation in the backend]

I'm very interested in your point. I will start studying [1][2] after
the beta freeze.

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

This may be a good starting point.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,

I totally agree here.

As Oleg suggested, I will try to aim for 7.3, first with a version in
contrib, and later, if the implementation is fine, it could be moved
to the core (or not ? Though it would be nice to make sure every
PostgreSQL installation which supports unicode has it, so that users
won't need to have administrative rights to use the functionality).

I think I will go for a C version, and probably the collation and
normalisation data in tables, with some way to override the defaults
with secondary tables... I'll report as soon as I have something +/-
working.

--
Tatsuo Ishii

Patrice.

--
Patrice HᅵDᅵ ------------------------------- patrice ᅵ islande org -----
-- Isn't it weird how scientists can imagine all the matter of the
universe exploding out of a dot smaller than the head of a pin, but they
can't come up with a more evocative name for it than "The Big Bang" ?
-- What would _you_ call the creation of the universe ?
-- "The HORRENDOUS SPACE KABLOOIE !" - Calvin and Hobbes
------------------------------------------ http://www.islande.org/ -----

#5Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Patrice Hédé (#4)
Re: Unicode combining characters

- length() on the server side doesn't handle correctly Unicode [I
have the same result with char_length()], and returns the number
of chars (as it is however advertised to do), rather the length
of the string.

This is a known limitation.

To solve this, we could use wcwidth() (there is a custom
implementation for the systems which don't have it in the glibc). I'll
have a look at it later.

And wcwidth() depends on the locale. That is the another reason we
could not use it.

As Oleg suggested, I will try to aim for 7.3, first with a version in
contrib, and later, if the implementation is fine, it could be moved
to the core (or not ? Though it would be nice to make sure every
PostgreSQL installation which supports unicode has it, so that users
won't need to have administrative rights to use the functionality).

I would like to see SQL99's charset, collate functionality for 7.3 (or
later). If this happens, current multibyte implementation would be
dramatically changed. That would be a good timing to merge your
Unicode stuffs into the main source tree.
--
Tatsuo Ishii

#6Thomas Lockhart
lockhart@fourpalms.org
In reply to: Patrice Hédé (#1)
Re: Unicode combining characters

I would like to see SQL99's charset, collate functionality for 7.3 (or
later). If this happens, current multibyte implementation would be
dramatically changed...

I'm *still* interested in working on this (an old story I know). I'm
working on date/time stuff for 7.2, but hopefully 7.3 will see some
advances in the SQL99 direction on charset etc.

- Thomas

#7Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Thomas Lockhart (#6)
Re: Unicode combining characters

I would like to see SQL99's charset, collate functionality for 7.3 (or
later). If this happens, current multibyte implementation would be
dramatically changed...

I'm *still* interested in working on this (an old story I know). I'm
working on date/time stuff for 7.2, but hopefully 7.3 will see some
advances in the SQL99 direction on charset etc.

BTW, I see "CHARACTER SET" in gram.y. Does current already support
that syntax?
--
Tatsuo Ishii

#8Thomas Lockhart
lockhart@fourpalms.org
In reply to: Patrice Hédé (#4)
Re: Unicode combining characters

BTW, I see "CHARACTER SET" in gram.y. Does current already support
that syntax?

Yes and no. gram.y knows about CHARACTER SET, but only for the long
form, the clause is in the wrong position (it preceeds the length
specification) and it does not do much useful (generates a data type
based on the character set name which does not get recognized farther
back). Examples:

thomas=# create table t1 (c varchar(20) character set sql_ascii);
ERROR: parser: parse error at or near "character"
thomas=# create table t1 (c character varying character set sql_ascii
(20));
ERROR: Unable to locate type name 'varsql_ascii' in catalog

I'm pretty sure I'll get shift/reduce troubles when trying to move that
clause to *after* the length specifier. I'll try to do something with
the syntax for 7.2 once I've finished the date/time stuff.

- Thomas

#9Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#2)
Re: Unicode combining characters

Can someone give me TODO items for this discussion?

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

This is a known limitation.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

Sounds great.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

I'm very interested in your point. I will start studying [1][2] after
the beta freeze.

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

This may be a good starting point.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,

I totally agree here.

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#10Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#9)
Re: Unicode combining characters

Can someone give me TODO items for this discussion?

What about:

Improve Unicode combined character handling
--
Tatsuo Ishii

Show quoted text

So, this shows two problems :

- length() on the server side doesn't handle correctly Unicode [I have
the same result with char_length()], and returns the number of chars
(as it is however advertised to do), rather the length of the
string.

This is a known limitation.

- the psql frontend makes the same mistake.

I am using version 7.1.3 (debian sid), so it may have been corrected
in the meantime (in this case, I apologise, but I have only recently
started again to use PostgreSQL and I haven't followed -hackers long
enough).

=> I think fixing psql shouldn't be too complicated, as the glibc
should be providing the locale, and return the right values (is this
the case ? and what happens for combined latin + chinese characters
for example ? I'll have to try that later). If it's not fixed already,
do you want me to look at this ? [it will take some time, as I haven't
set up any development environment for postgres yet, and I'm away for
one week from thursday].

Sounds great.

I was wondering if some people have already thought about this, or
already done something, or if some of you are interested in this. If
nobody does anything, I'll do something eventually, probably before
Christmas (I don't have much time for this, and I don't need the
functionality right now), but if there is an interest, I could team
with others and develop it faster :)

I'm very interested in your point. I will start studying [1][2] after
the beta freeze.

Anyway, I'm open to suggestions :

- implement it in C, in the core,

- implement it in C, as contributed custom functions,

This may be a good starting point.

I can't really accept a solution which would rely on the underlaying
libc, as it may not provide the necessary locales (or maybe, then,

I totally agree here.

The main functions I foresee are :

- provide a normalisation function to all 4 forms,

- provide a collation_key(text, language) function, as the calculation
of the key may be expensive, some may want to index on the result (I
would :) ),

- provide a collation algorithm, using the two previous facilities,
which can do primary to tertiary collation (cf TR#10 for a detailed
explanation).

I haven't looked at PostgreSQL code yet (shame !), so I may be
completely off-track, in which case I'll retract myself and won't
bother you again (on that subject, that is ;) )...

Comments ?

--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#11Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#10)
Re: Unicode combining characters

Can someone give me TODO items for this discussion?

What about:

Improve Unicode combined character handling

Done. I can't update the web version because I don't have permission.

Also, have we decided if multibyte should be the configure default now?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#12Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#11)
Re: Unicode combining characters

Also, have we decided if multibyte should be the configure default now?

Not sure.

Anyway I have tested LIKE/REGEX query test using current. The query
executed is:

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) like 'aaa';

and

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) ~ 'aaa';

Here is the result:

no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX 0.09 msec 0.10 msec

LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.
--
Tatsuo Ishii

#13Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#12)
Re: Unicode combining characters

If no one can find a case where multibyte is slower, I think we should
enable it by default. Comments?

Also, have we decided if multibyte should be the configure default now?

Not sure.

Anyway I have tested LIKE/REGEX query test using current. The query
executed is:

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) like 'aaa';

and

explain analyze select '0000000 5089 474e...( 16475
bytes long text containing only 0-9a-z chars) ~ 'aaa';

Here is the result:

no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX 0.09 msec 0.10 msec

LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.
--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#13)
Re: Unicode combining characters

Bruce Momjian <pgman@candle.pha.pa.us> writes:

If no one can find a case where multibyte is slower, I think we should
enable it by default. Comments?

Well, he just did point out such a case:

no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX 0.09 msec 0.10 msec

But I agree with your conclusion. If the worst penalty we can find is
that a regex comparison operator is 10% slower, we may as well turn it
on by default. Most people will never notice the difference, and anyone
who really cares can always turn it off again.

regards, tom lane

#15Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#14)
Re: Unicode combining characters

Bruce Momjian <pgman@candle.pha.pa.us> writes:

If no one can find a case where multibyte is slower, I think we should
enable it by default. Comments?

Well, he just did point out such a case:

no MB with MB
LIKE 0.09 msec 0.08 msec
REGEX 0.09 msec 0.10 msec

But I agree with your conclusion. If the worst penalty we can find is
that a regex comparison operator is 10% slower, we may as well turn it
on by default. Most people will never notice the difference, and anyone
who really cares can always turn it off again.

But the strange thing is that LIKE is faster, perhaps meaning his
measurements can't even see the difference, or is it because the LIKE
optimization is off for multibyte.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#15)
Re: Unicode combining characters

Bruce Momjian <pgman@candle.pha.pa.us> writes:

But the strange thing is that LIKE is faster, perhaps meaning his
measurements can't even see the difference,

Yeah, I suspect there's 10% or more noise in these numbers. But then
one could read the results as saying we can't reliably measure any
difference at all ...

I'd feel more confident if the measurements were done using operators
repeated enough times to yield multiple-second runtimes. I don't
trust fractional-second time measurements on Unix boxen; too much chance
of bogus results due to activity of other processes.

regards, tom lane

#17Peter Eisentraut
peter_e@gmx.net
In reply to: Tatsuo Ishii (#12)
Re: Unicode combining characters

Tatsuo Ishii writes:

LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.

Could this conversion be optimized to recognize when it's dealing with a
single-byte character encoding?

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#18Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#16)
Re: Unicode combining characters

Yeah, I suspect there's 10% or more noise in these numbers. But then
one could read the results as saying we can't reliably measure any
difference at all ...

I'd feel more confident if the measurements were done using operators
repeated enough times to yield multiple-second runtimes. I don't
trust fractional-second time measurements on Unix boxen; too much chance
of bogus results due to activity of other processes.

Any idea to do that? I tried to do a measurements using something like
"SELECT * FROM t1 WHERE very-long-string-column LIKE 'aaa'", but I'm
afraid the I/O time masks the difference...
--
Tatsuo Ishii

#19Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Peter Eisentraut (#17)
Re: Unicode combining characters

LIKE with MB seemed to be resonably fast, but REGEX with MB seemed a
little bit slow. Probably this is due the wide character conversion
overhead.

Could this conversion be optimized to recognize when it's dealing with a
single-byte character encoding?

Not sure, will look into...
--
Tatsuo Ishii

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#18)
Re: Unicode combining characters

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I'd feel more confident if the measurements were done using operators
repeated enough times to yield multiple-second runtimes.

Any idea to do that?

Maybe something like this: declare a plpgsql function that takes two
text parameters and has a body like

for (i = 0 to a million)
boolvar := $1 like $2;

Then call it with strings of different lengths and see how the runtime
varies. You need to apply the LIKE to function parameters, else the
system will probably collapse the LIKE operation to a constant...

regards, tom lane

#21Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#20)
#22Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tatsuo Ishii (#21)
#23Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Zeugswetter Andreas SB SD (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#23)
#25Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#24)
#26Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#24)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#26)
#28Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#28)
#30Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#24)
#31Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#30)
#32Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#23)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#32)
#34Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#34)
#36Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#35)
#37Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#32)
#38Patrice Hédé
phede-ml@islande.org
In reply to: Tatsuo Ishii (#23)
#39Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Patrice Hédé (#38)
#40Patrice Hédé
phede-ml@islande.org
In reply to: Tatsuo Ishii (#39)
#41Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Patrice Hédé (#40)
#42Patrice Hédé
phede-ml@islande.org
In reply to: Tatsuo Ishii (#41)
#43Bruce Momjian
bruce@momjian.us
In reply to: Patrice Hédé (#42)
#44Patrice Hédé
phede-ml@islande.org
In reply to: Bruce Momjian (#43)
#45Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Patrice Hédé (#44)
#46Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#45)
#47Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Patrice Hédé (#38)