Patch for collation using ICU
Hi!
I've put together a patch for using IBM's ICU package for collation.
If your OS does not have full support for collation ur uppercase/lowercase
in multibyte locales, this might be useful. If you are using a multibyte
character encoding in your database and want collation, i.e. order by, and
also lower(), upper() and initcap() to work properly, this patch will do
just that.
This patch is needed for FreeBSD, since this OS has no support for
collation of for example unicode locales (that is, wcscoll(3) does not do
what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the
patch is *not* necessary for Linux, although IBM claims ICU collation to be
about twice as fast as glibc for simple western locales.
It adds a configure switch, `--with-icu', which will set up the code to use
ICU instead of wchar_t and wcscoll.
This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
seems to run well. I've not had the time to do any comparative performance
tests yet, but it seems it is at least not slower than using LATIN1 with
sv_SE.ISO8859-1 locale, perhaps even faster.
I'd be delighted if some more experienced postgresql hackers would review
this stuff. The patch is pretty compact, so it's fast reading :) I'm
planning to add this patch as an option (tagged "experimental") to
FreeBSD's postgresql port. Any ideas about whether this is a good idea or
not?
Any thoughts or ideas are welcome!
Cheers,
Palle
Patch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>
ICU at sourceforge: <http://icu.sf.net/>
Useful if it's going to support earlier releases of ICU....
Not all os's come with ICU3.2, debian for example, currently has 2.1 in
testing, and 2.6 in unstable.
... John
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
Palle Girgensohn
Sent: Friday, March 25, 2005 10:40 AM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Patch for collation using ICUHi!
I've put together a patch for using IBM's ICU package for collation.
If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might be
useful. If you are using a multibyte character encoding in
your database and want collation, i.e. order by, and also
lower(), upper() and initcap() to work properly, this patch
will do just that.This patch is needed for FreeBSD, since this OS has no
support for collation of for example unicode locales (that
is, wcscoll(3) does not do what you expect if you set
LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not*
necessary for Linux, although IBM claims ICU collation to be
about twice as fast as glibc for simple western locales.It adds a configure switch, `--with-icu', which will set up
the code to use ICU instead of wchar_t and wcscoll.This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable,
where it seems to run well. I've not had the time to do any
comparative performance tests yet, but it seems it is at
least not slower than using LATIN1 with
sv_SE.ISO8859-1 locale, perhaps even faster.I'd be delighted if some more experienced postgresql hackers
would review this stuff. The patch is pretty compact, so it's
fast reading :) I'm planning to add this patch as an option
(tagged "experimental") to FreeBSD's postgresql port. Any
ideas about whether this is a good idea or not?Any thoughts or ideas are welcome!
Cheers,
PallePatch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
005-03-14.diff>
Show quoted text
ICU at sourceforge: <http://icu.sf.net/>
---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Import Notes
Resolved by subject fallback
--On fredag, mars 25, 2005 16.34.41 +1100 John Hansen <john@geeknet.com.au>
wrote:
Useful if it's going to support earlier releases of ICU....
Not all os's come with ICU3.2, debian for example, currently has 2.1 in
testing, and 2.6 in unstable.
Oh, OK. FreeBSD has only the 3.2 as port. I can check the older version, I
doubt it would too much difference. Some autoconf sorcery needed, perhaps.
/Palle
Show quoted text
... John
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of
Palle Girgensohn
Sent: Friday, March 25, 2005 10:40 AM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Patch for collation using ICUHi!
I've put together a patch for using IBM's ICU package for collation.
If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might be
useful. If you are using a multibyte character encoding in
your database and want collation, i.e. order by, and also
lower(), upper() and initcap() to work properly, this patch
will do just that.This patch is needed for FreeBSD, since this OS has no
support for collation of for example unicode locales (that
is, wcscoll(3) does not do what you expect if you set
LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not*
necessary for Linux, although IBM claims ICU collation to be
about twice as fast as glibc for simple western locales.It adds a configure switch, `--with-icu', which will set up
the code to use ICU instead of wchar_t and wcscoll.This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable,
where it seems to run well. I've not had the time to do any
comparative performance tests yet, but it seems it is at
least not slower than using LATIN1 with
sv_SE.ISO8859-1 locale, perhaps even faster.I'd be delighted if some more experienced postgresql hackers
would review this stuff. The patch is pretty compact, so it's
fast reading :) I'm planning to add this patch as an option
(tagged "experimental") to FreeBSD's postgresql port. Any
ideas about whether this is a good idea or not?Any thoughts or ideas are welcome!
Cheers,
PallePatch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>
ICU at sourceforge: <http://icu.sf.net/>
---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
--On fredag, mars 25, 2005 16.34.41 +1100 John Hansen
<john@geeknet.com.au>
wrote:Useful if it's going to support earlier releases of ICU....
Not all os's come with ICU3.2, debian for example,
currently has 2.1
in testing, and 2.6 in unstable.
Oh, OK. FreeBSD has only the 3.2 as port. I can check the
older version, I doubt it would too much difference. Some
autoconf sorcery needed, perhaps.
Naww, it's no biggie, we'll just need to include ICU with pg I think.
I tried that, there are several functions from ICU that you use, that
are not in ICU2.1
Dono about 2.6.
However, ICU3.2 compiles on debian with a small change to the
debian/rules file.
debian/tmp/etc is missing, so add mkdir debian/tmp/etc
... John
Show quoted text
/Palle
... John
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Palle
Girgensohn
Sent: Friday, March 25, 2005 10:40 AM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Patch for collation using ICUHi!
I've put together a patch for using IBM's ICU package for
collation.
If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might be useful. If
you are using a multibyte character encoding in your database and
want collation, i.e. order by, and also lower(), upper() and
initcap() to work properly, this patch will do just that.This patch is needed for FreeBSD, since this OS has no support for
collation of for example unicode locales (that is, wcscoll(3) does
not do what you expect if you set LC_ALL=sv_SE.UTF-8, forexample).
AFAIK the patch is *not* necessary for Linux, although IBM
claims ICU
collation to be about twice as fast as glibc for simple western
locales.It adds a configure switch, `--with-icu', which will set
up the code
to use ICU instead of wchar_t and wcscoll.
This has been tested only on FreeBSD-4.11 &
FreeBSD-5-stable, where
it seems to run well. I've not had the time to do any comparative
performance tests yet, but it seems it is at least not slower than
using LATIN1 with
sv_SE.ISO8859-1 locale, perhaps even faster.I'd be delighted if some more experienced postgresql hackers would
review this stuff. The patch is pretty compact, so it'sfast reading
:) I'm planning to add this patch as an option (tagged
"experimental") to FreeBSD's postgresql port. Any ideasabout whether
this is a good idea or not?
Any thoughts or ideas are welcome!
Cheers,
PallePatch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>
ICU at sourceforge: <http://icu.sf.net/>
---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Import Notes
Resolved by subject fallback
Ok,.. tested on debian sarge with ICU 3.2
UNICODE Database, C locale.
upper() and lower() returns an empty string for any input, including
7bit ascii, regardless of client_encoding, so something is obviously
broken.
Have you tested this patch on a UNICODE DB with locale C/POSIX ?
... John
Show quoted text
-----Original Message-----
From: John Hansen
Sent: Friday, March 25, 2005 10:27 PM
To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org'
Subject: RE: [HACKERS] Patch for collation using ICU--On fredag, mars 25, 2005 16.34.41 +1100 John Hansen
<john@geeknet.com.au>
wrote:Useful if it's going to support earlier releases of ICU....
Not all os's come with ICU3.2, debian for example,
currently has 2.1
in testing, and 2.6 in unstable.
Oh, OK. FreeBSD has only the 3.2 as port. I can check the older
version, I doubt it would too much difference. Someautoconf sorcery
needed, perhaps.
Naww, it's no biggie, we'll just need to include ICU with pg I think.
I tried that, there are several functions from ICU that you
use, that are not in ICU2.1Dono about 2.6.
However, ICU3.2 compiles on debian with a small change to the
debian/rules file.
debian/tmp/etc is missing, so add mkdir debian/tmp/etc... John
/Palle
... John
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Palle
Girgensohn
Sent: Friday, March 25, 2005 10:40 AM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Patch for collation using ICUHi!
I've put together a patch for using IBM's ICU package for
collation.
If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might beuseful. If
you are using a multibyte character encoding in your
database and
want collation, i.e. order by, and also lower(), upper() and
initcap() to work properly, this patch will do just that.This patch is needed for FreeBSD, since this OS has no
support for
collation of for example unicode locales (that is,
wcscoll(3) does
not do what you expect if you set LC_ALL=sv_SE.UTF-8, for
example).
AFAIK the patch is *not* necessary for Linux, although IBM
claims ICU
collation to be about twice as fast as glibc for simple western
locales.It adds a configure switch, `--with-icu', which will set
up the code
to use ICU instead of wchar_t and wcscoll.
This has been tested only on FreeBSD-4.11 &
FreeBSD-5-stable, where
it seems to run well. I've not had the time to do any
comparative
performance tests yet, but it seems it is at least not
slower than
using LATIN1 with
sv_SE.ISO8859-1 locale, perhaps even faster.I'd be delighted if some more experienced postgresql
hackers would
review this stuff. The patch is pretty compact, so it's
fast reading
:) I'm planning to add this patch as an option (tagged
"experimental") to FreeBSD's postgresql port. Any ideasabout whether
this is a good idea or not?
Any thoughts or ideas are welcome!
Cheers,
PallePatch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>
ICU at sourceforge: <http://icu.sf.net/>
---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Import Notes
Resolved by subject fallback
--On fredag, mars 25, 2005 23.39.33 +1100 John Hansen <john@geeknet.com.au>
wrote:
Ok,.. tested on debian sarge with ICU 3.2
UNICODE Database, C locale.upper() and lower() returns an empty string for any input, including
7bit ascii, regardless of client_encoding, so something is obviously
broken.Have you tested this patch on a UNICODE DB with locale C/POSIX ?
No, honestly not. Mostly tested it with my needs, sv_SE.UTF-8 and UNICODE,
and also de_DE.UTF-8.
How will PostgreSQL react to this combo? A database cluster initdb:ed with
locale=C/POSIX, and then a database in UNICODE (really utf-8)
representation... hmm... I think I might have made a false assumption that
the locale string would contain the character encoding. I do something like
encoding = strchr(locale, '.') + 1... That code will be confused by a 'C'
locale, indeed. I'll check it out!
/Palle
Show quoted text
... John
-----Original Message-----
From: John Hansen
Sent: Friday, March 25, 2005 10:27 PM
To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org'
Subject: RE: [HACKERS] Patch for collation using ICU--On fredag, mars 25, 2005 16.34.41 +1100 John Hansen
<john@geeknet.com.au>
wrote:Useful if it's going to support earlier releases of ICU....
Not all os's come with ICU3.2, debian for example,
currently has 2.1
in testing, and 2.6 in unstable.
Oh, OK. FreeBSD has only the 3.2 as port. I can check the older
version, I doubt it would too much difference. Someautoconf sorcery
needed, perhaps.
Naww, it's no biggie, we'll just need to include ICU with pg I think.
I tried that, there are several functions from ICU that you
use, that are not in ICU2.1Dono about 2.6.
However, ICU3.2 compiles on debian with a small change to the
debian/rules file.
debian/tmp/etc is missing, so add mkdir debian/tmp/etc... John
/Palle
... John
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Palle
Girgensohn
Sent: Friday, March 25, 2005 10:40 AM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Patch for collation using ICUHi!
I've put together a patch for using IBM's ICU package for
collation.
If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might beuseful. If
you are using a multibyte character encoding in your
database and
want collation, i.e. order by, and also lower(), upper() and
initcap() to work properly, this patch will do just that.This patch is needed for FreeBSD, since this OS has no
support for
collation of for example unicode locales (that is,
wcscoll(3) does
not do what you expect if you set LC_ALL=sv_SE.UTF-8, for
example).
AFAIK the patch is *not* necessary for Linux, although IBM
claims ICU
collation to be about twice as fast as glibc for simple western
locales.It adds a configure switch, `--with-icu', which will set
up the code
to use ICU instead of wchar_t and wcscoll.
This has been tested only on FreeBSD-4.11 &
FreeBSD-5-stable, where
it seems to run well. I've not had the time to do any
comparative
performance tests yet, but it seems it is at least not
slower than
using LATIN1 with
sv_SE.ISO8859-1 locale, perhaps even faster.I'd be delighted if some more experienced postgresql
hackers would
review this stuff. The patch is pretty compact, so it's
fast reading
:) I'm planning to add this patch as an option (tagged
"experimental") to FreeBSD's postgresql port. Any ideasabout whether
this is a good idea or not?
Any thoughts or ideas are welcome!
Cheers,
PallePatch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>
ICU at sourceforge: <http://icu.sf.net/>
---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Palle Girgensohn <girgen@pingpong.net> writes:
hmm... I think I might have made a false assumption that
the locale string would contain the character encoding.
You certainly cannot assume that. Would that it were so easy to find
out the character set for a locale :-(.
There's some code in initdb that you might emulate, though I can't say
that I trust it a whole lot.
regards, tom lane
Tom Lane wrote:
Palle Girgensohn <girgen@pingpong.net> writes:
hmm... I think I might have made a false assumption that
the locale string would contain the character encoding.You certainly cannot assume that. Would that it were so easy to find
out the character set for a locale :-(.There's some code in initdb that you might emulate, though I can't say
that I trust it a whole lot.
Me either. On FC3 I saw this the other day, haven't had time to dig into
what's going on:
[andrew inst]$ echo $LANG
en_US.UTF-8
[andrew inst]$ bin/initdb floobl
The files belonging to this database system will be owned by user "andrew".
This user must also own the server process.
The database cluster will be initialized with locale en_US.UTF-8.
initdb: could not find suitable encoding for locale "en_US.UTF-8"
Rerun initdb with the -E option.
Try "initdb --help" for more information.
cheers
andrew
--On fredag, mars 25, 2005 09.53.38 -0500 Tom Lane <tgl@sss.pgh.pa.us>
wrote:
Palle Girgensohn <girgen@pingpong.net> writes:
hmm... I think I might have made a false assumption that
the locale string would contain the character encoding.You certainly cannot assume that. Would that it were so easy to find
out the character set for a locale :-(.There's some code in initdb that you might emulate, though I can't say
that I trust it a whole lot.
Best thing would of course be to check the encoding of the database. I
realized I've made a bad assumption that the initdb locale has anything to
do with the character encoding of the database - it doesn't. There must be
some already existsing trusted way to get the character encoding of a
database?
/Palle
--On fredag, mars 25, 2005 23.39.33 +1100 John Hansen
<john@geeknet.com.au>
wrote:Ok,.. tested on debian sarge with ICU 3.2 UNICODE Database,
C locale.
upper() and lower() returns an empty string for any input,
including
7bit ascii, regardless of client_encoding, so something is
obviously
broken.
Have you tested this patch on a UNICODE DB with locale C/POSIX ?
FYI, I also found that initdb crashes with error 139 on any locale other
than C/POSIX.
Show quoted text
No, honestly not. Mostly tested it with my needs, sv_SE.UTF-8
and UNICODE, and also de_DE.UTF-8.How will PostgreSQL react to this combo? A database cluster
initdb:ed with locale=C/POSIX, and then a database in UNICODE
(really utf-8) representation... hmm... I think I might have
made a false assumption that the locale string would contain
the character encoding. I do something like encoding =
strchr(locale, '.') + 1... That code will be confused by a 'C'
locale, indeed. I'll check it out!/Palle
... John
-----Original Message-----
From: John Hansen
Sent: Friday, March 25, 2005 10:27 PM
To: 'Palle Girgensohn'; 'pgsql-hackers@postgresql.org'
Subject: RE: [HACKERS] Patch for collation using ICU--On fredag, mars 25, 2005 16.34.41 +1100 John Hansen
<john@geeknet.com.au>
wrote:Useful if it's going to support earlier releases of ICU....
Not all os's come with ICU3.2, debian for example,
currently has 2.1
in testing, and 2.6 in unstable.
Oh, OK. FreeBSD has only the 3.2 as port. I can check the older
version, I doubt it would too much difference. Someautoconf sorcery
needed, perhaps.
Naww, it's no biggie, we'll just need to include ICU with
pg I think.
I tried that, there are several functions from ICU that
you use, that
are not in ICU2.1
Dono about 2.6.
However, ICU3.2 compiles on debian with a small change to the
debian/rules file.
debian/tmp/etc is missing, so add mkdir debian/tmp/etc... John
/Palle
... John
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On BehalfOf Palle
Girgensohn
Sent: Friday, March 25, 2005 10:40 AM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Patch for collation using ICUHi!
I've put together a patch for using IBM's ICU package for
collation.
If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might beuseful. If
you are using a multibyte character encoding in your
database and
want collation, i.e. order by, and also lower(), upper() and
initcap() to work properly, this patch will do just that.This patch is needed for FreeBSD, since this OS has no
support for
collation of for example unicode locales (that is,
wcscoll(3) does
not do what you expect if you set LC_ALL=sv_SE.UTF-8, for
example).
AFAIK the patch is *not* necessary for Linux, although IBM
claims ICU
collation to be about twice as fast as glibc for
simple western
locales.
It adds a configure switch, `--with-icu', which will set
up the code
to use ICU instead of wchar_t and wcscoll.
This has been tested only on FreeBSD-4.11 &
FreeBSD-5-stable, where
it seems to run well. I've not had the time to do any
comparative
performance tests yet, but it seems it is at least not
slower than
using LATIN1 with
sv_SE.ISO8859-1 locale, perhaps even faster.I'd be delighted if some more experienced postgresql
hackers would
review this stuff. The patch is pretty compact, so it's
fast reading
:) I'm planning to add this patch as an option (tagged
"experimental") to FreeBSD's postgresql port. Any ideasabout whether
this is a good idea or not?
Any thoughts or ideas are welcome!
Cheers,
PallePatch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff>
ICU at sourceforge: <http://icu.sf.net/>
---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings
Import Notes
Resolved by subject fallback
--On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn
<girgen@pingpong.net> wrote:
Hi!
I've put together a patch for using IBM's ICU package for collation.
If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might be useful. If you
are using a multibyte character encoding in your database and want
collation, i.e. order by, and also lower(), upper() and initcap() to work
properly, this patch will do just that.This patch is needed for FreeBSD, since this OS has no support for
collation of for example unicode locales (that is, wcscoll(3) does not do
what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the
patch is *not* necessary for Linux, although IBM claims ICU collation to
be about twice as fast as glibc for simple western locales.It adds a configure switch, `--with-icu', which will set up the code to
use ICU instead of wchar_t and wcscoll.This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it
seems to run well. I've not had the time to do any comparative
performance tests yet, but it seems it is at least not slower than using
LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.I'd be delighted if some more experienced postgresql hackers would review
this stuff. The patch is pretty compact, so it's fast reading :) I'm
planning to add this patch as an option (tagged "experimental") to
FreeBSD's postgresql port. Any ideas about whether this is a good idea or
not?Any thoughts or ideas are welcome!
Cheers,
PallePatch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.d
iff>ICU at sourceforge: <http://icu.sf.net/>
Hi!
There's a new patch to fix some reported problems.
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-26.diff>
This version uses the DatabaseEncoding and sets the ICU encoding at the
same time. I had to create a conversion table from PostgreSQL's own,
somewhat odd and non-standard, names of encodings, into the prefered IANA
names. On or two of the more odd ones might be slightly incorrect,
hopefully not too far off anyway?
I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:
- ORDER BY is case insensitive when using ICU. This might break the SQL
standard (?), but sure is nice :)
- When the database is initialized using the C locale, upper() and lower()
normally does not work at all for non-ASCII characters even if the
database's encoding is say LATIN1 or UNICODE. (does not work for me anyway,
on FreeBSD, and this is probably correct since the locale is still `C', I
believe?). The ICU patch changes nothing for the LATIN1 case, since it does
not act on single byte encodings, but for the UNICODE representation, it
works and does what I expect it to, namely upper() and lower() neatly
upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') -> 'åäö'.
This is a good thing, although I'm surprised that upper/lower is dragged
along with the LC_COLLATE fixation at initdb. I never run initdb in the C
locale, but only now do I realize how broken that really is if you need to
store anything else than English :-)
I'd be delighted to get more feedback about this stuff.
Thanks,
Palle
--On lördag, mars 26, 2005 10.42.19 +1100 John Hansen <john@geeknet.com.au>
wrote:
FYI, I also found that initdb crashes with error 139 on any locale other
than C/POSIX.
Odd, not for me, but I did make a bad assumption about character encoding.
Perhaps the new patch will help? (see previous mail)
What is error 139, anyone?
/Palle
-----Original Message-----
From: Palle Girgensohn [mailto:girgen@pingpong.net]
Sent: Saturday, March 26, 2005 1:10 PM
To: pgsql-hackers@postgresql.org
Cc: John Hansen; Andrew Dunstan
Subject: Re: [HACKERS] Patch for collation using ICU--On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn
<girgen@pingpong.net> wrote:Hi!
I've put together a patch for using IBM's ICU package for collation.
If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might beuseful. If you
are using a multibyte character encoding in your database and want
collation, i.e. order by, and also lower(), upper() andinitcap() to
work properly, this patch will do just that.
This patch is needed for FreeBSD, since this OS has no support for
collation of for example unicode locales (that is,wcscoll(3) does not
do what you expect if you set LC_ALL=sv_SE.UTF-8, for
example). AFAIK
the patch is *not* necessary for Linux, although IBM claims ICU
collation to be about twice as fast as glibc for simplewestern locales.
It adds a configure switch, `--with-icu', which will set up
the code
to use ICU instead of wchar_t and wcscoll.
This has been tested only on FreeBSD-4.11 &
FreeBSD-5-stable, where it
seems to run well. I've not had the time to do any comparative
performance tests yet, but it seems it is at least not slower than
using
LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.I'd be delighted if some more experienced postgresql hackers would
review this stuff. The patch is pretty compact, so it'sfast reading
:) I'm planning to add this patch as an option (tagged
"experimental") to FreeBSD's postgresql port. Any ideasabout whether
this is a good idea or not?
Any thoughts or ideas are welcome!
Cheers,
PallePatch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1
4.d
iff>ICU at sourceforge: <http://icu.sf.net/>
Hi!
There's a new patch to fix some reported problems.
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
005-03-26.diff>
This version uses the DatabaseEncoding and sets the ICU
encoding at the same time. I had to create a conversion table
from PostgreSQL's own, somewhat odd and non-standard, names
of encodings, into the prefered IANA names. On or two of the
more odd ones might be slightly incorrect, hopefully not too
far off anyway?I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:- ORDER BY is case insensitive when using ICU. This might
break the SQL standard (?), but sure is nice :)
This would mean that indexes are also case insensitive right?
Which makes it a Bad Thing(tm).
- When the database is initialized using the C locale,
upper() and lower() normally does not work at all for
non-ASCII characters even if the database's encoding is say
LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD,
and this is probably correct since the locale is still `C', I
believe?). The ICU patch changes nothing for the LATIN1 case,
since it does not act on single byte encodings, but for the
UNICODE representation, it works and does what I expect it
to, namely upper() and lower() neatly
upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ')
-> 'åäö'.
This is a good thing, although I'm surprised that upper/lower
is dragged along with the LC_COLLATE fixation at initdb. I
never run initdb in the C locale, but only now do I realize
how broken that really is if you need to store anything else
than English :-)
That is what I would have expected. However, it probably won't work for the more exotic cases, like turkish I, which depends on the locale.
Show quoted text
I'd be delighted to get more feedback about this stuff.
Thanks,
Palle
Import Notes
Resolved by subject fallback
Still doesn't work for me :(
UNICODE DB
C locale
set client_encoding = iso88591;
select upper('æ');
upper
-------
(1 row)
Trying to initdb with en_IN.utf8
/usr/lib/postgresql/bin/initdb -D /var/lib/postgres/data/ -E UNICODE --locale=en_IN.utf8
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale en_IN.utf8.
fixing permissions on existing directory /var/lib/postgres/data ... ok
creating directory /var/lib/postgres/data/global ... ok
creating directory /var/lib/postgres/data/pg_xlog ... ok
creating directory /var/lib/postgres/data/pg_xlog/archive_status ... ok
creating directory /var/lib/postgres/data/pg_clog ... ok
creating directory /var/lib/postgres/data/pg_subtrans ... ok
creating directory /var/lib/postgres/data/base ... ok
creating directory /var/lib/postgres/data/base/1 ... ok
creating directory /var/lib/postgres/data/pg_tblspc ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 1000
creating configuration files ... ok
creating template1 database in /var/lib/postgres/data/base/1 ... ok
initializing pg_shadow ... ok
enabling unlimited row size for system tables ... ok
initializing pg_depend ... ok
creating system views ... ok
loading pg_description ... ok
creating conversions ... ok
setting privileges on built-in objects ... ok
creating information schema ... child process exited with exit code 139
initdb: removing contents of data directory "/var/lib/postgres/data"
... John
Show quoted text
-----Original Message-----
From: Palle Girgensohn [mailto:girgen@pingpong.net]
Sent: Saturday, March 26, 2005 1:10 PM
To: pgsql-hackers@postgresql.org
Cc: John Hansen; Andrew Dunstan
Subject: Re: [HACKERS] Patch for collation using ICU--On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn
<girgen@pingpong.net> wrote:Hi!
I've put together a patch for using IBM's ICU package for collation.
If your OS does not have full support for collation ur
uppercase/lowercase in multibyte locales, this might beuseful. If you
are using a multibyte character encoding in your database and want
collation, i.e. order by, and also lower(), upper() andinitcap() to
work properly, this patch will do just that.
This patch is needed for FreeBSD, since this OS has no support for
collation of for example unicode locales (that is,wcscoll(3) does not
do what you expect if you set LC_ALL=sv_SE.UTF-8, for
example). AFAIK
the patch is *not* necessary for Linux, although IBM claims ICU
collation to be about twice as fast as glibc for simplewestern locales.
It adds a configure switch, `--with-icu', which will set up
the code
to use ICU instead of wchar_t and wcscoll.
This has been tested only on FreeBSD-4.11 &
FreeBSD-5-stable, where it
seems to run well. I've not had the time to do any comparative
performance tests yet, but it seems it is at least not slower than
using
LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.I'd be delighted if some more experienced postgresql hackers would
review this stuff. The patch is pretty compact, so it'sfast reading
:) I'm planning to add this patch as an option (tagged
"experimental") to FreeBSD's postgresql port. Any ideasabout whether
this is a good idea or not?
Any thoughts or ideas are welcome!
Cheers,
PallePatch at:
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-1
4.d
iff>ICU at sourceforge: <http://icu.sf.net/>
Hi!
There's a new patch to fix some reported problems.
<http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2
005-03-26.diff>This version uses the DatabaseEncoding and sets the ICU
encoding at the same time. I had to create a conversion table
from PostgreSQL's own, somewhat odd and non-standard, names
of encodings, into the prefered IANA names. On or two of the
more odd ones might be slightly incorrect, hopefully not too
far off anyway?I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:- ORDER BY is case insensitive when using ICU. This might
break the SQL standard (?), but sure is nice :)- When the database is initialized using the C locale,
upper() and lower() normally does not work at all for
non-ASCII characters even if the database's encoding is say
LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD,
and this is probably correct since the locale is still `C', I
believe?). The ICU patch changes nothing for the LATIN1 case,
since it does not act on single byte encodings, but for the
UNICODE representation, it works and does what I expect it
to, namely upper() and lower() neatly
upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ')
-> 'åäö'.
This is a good thing, although I'm surprised that upper/lower
is dragged along with the LC_COLLATE fixation at initdb. I
never run initdb in the C locale, but only now do I realize
how broken that really is if you need to store anything else
than English :-)I'd be delighted to get more feedback about this stuff.
Thanks,
Palle
Import Notes
Resolved by subject fallback
On Sat, 26 Mar 2005, Palle Girgensohn wrote:
I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:- ORDER BY is case insensitive when using ICU. This might break the SQL
standard (?), but sure is nice :)
Err, I think if your system implements strcoll correctly 8.0.1 can do this
if the chosen collation is set up that way (or at least naive tests I've
done seem to imply that). Or are you speaking about C locale?
--On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo
<sszabo@megazone.bigpanda.com> wrote:
On Sat, 26 Mar 2005, Palle Girgensohn wrote:
I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:- ORDER BY is case insensitive when using ICU. This might break the SQL
standard (?), but sure is nice :)Err, I think if your system implements strcoll correctly 8.0.1 can do this
if the chosen collation is set up that way (or at least naive tests I've
done seem to imply that). Or are you speaking about C locale?
No, I doubt this.
Example: set up a cluster:
$ initdb -E LATIN1 --locale=sv_SE.ISO8859-1
$ createdb foo
CREATE DATABASE
$ psql foo
foo=# create table bar (val text);
CREATE TABLE
foo=# insert into bar values ('aaa');
INSERT 18354409 1
foo=# insert into bar values ('BBB');
INSERT 18354412 1
foo=# select val from bar order by val;
val
-----
BBB
aaa
(2 rows)
Order by is not case insensitive. It shouldn't be for any system, AFAIK. As
John Hansen noted, this might be a bad thing. I'm not sure about that,
though...
As for general collation of unicode, the reason for me to use ICU is that
my system does not support strcoll correctly for multibyte locales, as I
mentioned earlier. I also noted that even for systems that do handle
strcoll correctly for unicode, ICU claims to be a couple of magnitudes
faster, so this patch might be useful for other systems (read Linux) as
well. See previous emails for details.
Regards,
Palle
On L, 2005-03-26 at 03:09 +0100, Palle Girgensohn wrote:
Hi!
...
I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:- ORDER BY is case insensitive when using ICU. This might break the SQL
standard (?), but sure is nice :)
How does your patch interact with the ability to use indexes for
anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use index) ?
--
Hannu Krosing <hannu@tm.ee>
On Sun, 27 Mar 2005, Palle Girgensohn wrote:
--On l�rdag, mars 26, 2005 08.16.01 -0800 Stephan Szabo
<sszabo@megazone.bigpanda.com> wrote:On Sat, 26 Mar 2005, Palle Girgensohn wrote:
I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:- ORDER BY is case insensitive when using ICU. This might break the SQL
standard (?), but sure is nice :)Err, I think if your system implements strcoll correctly 8.0.1 can do this
if the chosen collation is set up that way (or at least naive tests I've
done seem to imply that). Or are you speaking about C locale?No, I doubt this.
Example: set up a cluster:
$ initdb -E LATIN1 --locale=sv_SE.ISO8859-1
$ createdb foo
CREATE DATABASE
$ psql foo
foo=# create table bar (val text);
CREATE TABLE
foo=# insert into bar values ('aaa');
INSERT 18354409 1
foo=# insert into bar values ('BBB');
INSERT 18354412 1
foo=# select val from bar order by val;
val
-----
BBB
aaa
(2 rows)Order by is not case insensitive. It shouldn't be for any system, AFAIK. As
It is on my machine... for the same test:
foo=# select val from bar order by val;
val
-----
aaa
BBB
(2 rows)
I think this just implies even greater breakage of either the collation or
strcoll on the system you're trying on. ;) Which, of course, is a fairly
reasonable reason to offer an alternative. Especially if it's generically
useful.
--On lördag, mars 26, 2005 13.59.19 +1100 John Hansen <john@geeknet.com.au>
wrote:
- ORDER BY is case insensitive when using ICU. This might
break the SQL standard (?), but sure is nice :)This would mean that indexes are also case insensitive right?
Which makes it a Bad Thing(tm).
Well, no, not really. Indices use collation rules, yes, but upper and lower
case strings are not considered *equal*, just "closer related". In
collation, characters are compared at four levels. See [1]<http://icu.sourceforge.net/userguide/Collate_Concepts.html#Comparison_Levels> for a good
explaination. This means that indices will use a case insensitive sort
order, but equality will not be different, so it shouldn't break anything.
- When the database is initialized using the C locale,
upper() and lower() normally does not work at all for
non-ASCII characters even if the database's encoding is say
LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD,
and this is probably correct since the locale is still `C', I
believe?). The ICU patch changes nothing for the LATIN1 case,
since it does not act on single byte encodings, but for the
UNICODE representation, it works and does what I expect it
to, namely upper() and lower() neatly
upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ')
-> 'åäö'.
This is a good thing, although I'm surprised that upper/lower
is dragged along with the LC_COLLATE fixation at initdb. I
never run initdb in the C locale, but only now do I realize
how broken that really is if you need to store anything else
than English :-)That is what I would have expected. However, it probably won't work for
the more exotic cases, like turkish I, which depends on the locale.
Nope, Turkish must of course have its locale to for example handle their
special capital "i". Let's just say it is less broken :)
/Palle
[1]: <http://icu.sourceforge.net/userguide/Collate_Concepts.html#Comparison_Levels>
<http://icu.sourceforge.net/userguide/Collate_Concepts.html#Comparison_Levels>
--On lördag, mars 26, 2005 17.40.01 -0800 Stephan Szabo
<sszabo@megazone.bigpanda.com> wrote:
On Sun, 27 Mar 2005, Palle Girgensohn wrote:
--On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo
<sszabo@megazone.bigpanda.com> wrote:On Sat, 26 Mar 2005, Palle Girgensohn wrote:
I've noticed a couple of things about using the ICU patch vs. pristine
pg-8.0.1:- ORDER BY is case insensitive when using ICU. This might break the
SQL standard (?), but sure is nice :)Err, I think if your system implements strcoll correctly 8.0.1 can do
this if the chosen collation is set up that way (or at least naive
tests I've done seem to imply that). Or are you speaking about C
locale?No, I doubt this.
Example: set up a cluster:
$ initdb -E LATIN1 --locale=sv_SE.ISO8859-1
$ createdb foo
CREATE DATABASE
$ psql foo
foo=# create table bar (val text);
CREATE TABLE
foo=# insert into bar values ('aaa');
INSERT 18354409 1
foo=# insert into bar values ('BBB');
INSERT 18354412 1
foo=# select val from bar order by val;
val
-----
BBB
aaa
(2 rows)Order by is not case insensitive. It shouldn't be for any system, AFAIK.
AsIt is on my machine... for the same test:
foo=# select val from bar order by val;
val
-----
aaa
BBB
(2 rows)I think this just implies even greater breakage of either the collation or
strcoll on the system you're trying on. ;) Which, of course, is a fairly
reasonable reason to offer an alternative. Especially if it's generically
useful.
Interesting! Indeed, just tried on an old Linux Redhat system... BTW,
that's pretty odd for a unix system. "ls -l" sorts aaa before BBB, I've
never seen the likes of it! Call me old fashion if you like ;-)
Still, as you say, FreeBSD does it capital letters first, and does not
handle unicode locales' collation, so I need an alternative. Perhaps the
best way would be to inject ICU into BSD instead :-)
/Palle