Per-column collation, proof of concept

Started by Peter Eisentrautalmost 16 years ago15 messageshackers
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

Here is a proof of concept for per-column collation support.

Here is how it works: When creating a table, an optional COLLATE clause
can specify a collation name, which is stored (by OID) in pg_attribute.
This becomes part of the type information and is propagated through the
expression parse analysis, like typmod. When an operator or function
call is parsed (transformed), the collations of the arguments are
unified, using some rules (like type analysis, but different in detail).
The collations of the function/operator arguments come either from Var
nodes which in turn got them from pg_attribute, or from other
function and operator calls, or you can override them with explicit
COLLATE clauses (not yet implemented, but will work a bit like
RelabelType). At the end, each function or operator call gets one
collation to use.

The function call itself can then look up the collation using the
fcinfo->flinfo->fn_expr field. (Works for operator calls, but doesn't
work for sort operations, needs more thought.)

A collation is in this implementation defined as an lc_collate string
and an lc_ctype string. The implementation of functions interested in
that information, such as comparison operators, or upper and lower
functions, will take the collation OID that is passed in, look up the
locale string, and use the xlocale.h interface (newlocale(),
strcoll_l()) to compute the result.

(Note that the xlocale stuff is only 10 or so lines in this patch. It
should be feasible to allow other appropriate locale libraries to be
used.)

Loose ends:

- Support function calls (currently only operator calls) (easy)

- Implementation of sort clauses

- Indexing support/integration

- Domain support (should be straightforward)

- Make all expression node types deal with collation information
appropriately

- Explicit COLLATE clause on expressions

- Caching and not leaking memory of locale lookups

- I have typcollatable to mark which types can accept collation
information, but perhaps there should also be proicareaboutcollation
to skip collation resolution when none of the functions in the
expression tree care.

You can start by reading the collate.sql regression test file to see
what it can do. Btw., regression tests only work with "make check
MULTIBYTE=UTF8". And it (probably) only works with glibc for now.

Attachments:

collate.patchtext/x-patch; charset=UTF-8; name=collate.patchDownload+861-257
#2Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Peter Eisentraut (#1)
Re: Per-column collation, proof of concept

Peter Eisentraut <peter_e@gmx.net> wrote:

Here is a proof of concept for per-column collation support.

Did you want a WIP review of that patch? (CF closing to new
submissions soon....)

-Kevin

#3Pavel Stehule
pavel.stehule@gmail.com
In reply to: Peter Eisentraut (#1)
Re: Per-column collation, proof of concept

Hello

I have only one question - If I understand well you can use collate
just for sort. What is your plan for range search operation? Sort is
interesting and I am sure important for multilangual applications, for
me - more important is case sensitive, case insensitive, accent
sensitive, insensitive filtering - do you have a plan for it?

Regards

Pavel Stehule

2010/7/13 Peter Eisentraut <peter_e@gmx.net>:

Here is a proof of concept for per-column collation support.

Here is how it works: When creating a table, an optional COLLATE clause
can specify a collation name, which is stored (by OID) in pg_attribute.
This becomes part of the type information and is propagated through the
expression parse analysis, like typmod.  When an operator or function
call is parsed (transformed), the collations of the arguments are
unified, using some rules (like type analysis, but different in detail).
The collations of the function/operator arguments come either from Var
nodes which in turn got them from pg_attribute, or from other
function and operator calls, or you can override them with explicit
COLLATE clauses (not yet implemented, but will work a bit like
RelabelType).  At the end, each function or operator call gets one
collation to use.

what about DISTINCT clause, maybe GROUP BY clause ?

regards

Pavel

Show quoted text

The function call itself can then look up the collation using the
fcinfo->flinfo->fn_expr field.  (Works for operator calls, but doesn't
work for sort operations, needs more thought.)

A collation is in this implementation defined as an lc_collate string
and an lc_ctype string.  The implementation of functions interested in
that information, such as comparison operators, or upper and lower
functions, will take the collation OID that is passed in, look up the
locale string, and use the xlocale.h interface (newlocale(),
strcoll_l()) to compute the result.

(Note that the xlocale stuff is only 10 or so lines in this patch.  It
should be feasible to allow other appropriate locale libraries to be
used.)

Loose ends:

- Support function calls (currently only operator calls) (easy)

- Implementation of sort clauses

- Indexing support/integration

- Domain support (should be straightforward)

- Make all expression node types deal with collation information
 appropriately

- Explicit COLLATE clause on expressions

- Caching and not leaking memory of locale lookups

- I have typcollatable to mark which types can accept collation
 information, but perhaps there should also be proicareaboutcollation
 to skip collation resolution when none of the functions in the
 expression tree care.

You can start by reading the collate.sql regression test file to see
what it can do.  Btw., regression tests only work with "make check
MULTIBYTE=UTF8".  And it (probably) only works with glibc for now.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Pavel Stehule (#3)
Re: Per-column collation, proof of concept

On ons, 2010-07-14 at 19:35 +0200, Pavel Stehule wrote:

I have only one question - If I understand well you can use collate
just for sort. What is your plan for range search operation?

My patch does range searches. Sorting uses the same operators, so both
will be supported. (Sorting is not yet implemented, as I had written.)

Sort is
interesting and I am sure important for multilangual applications, for
me - more important is case sensitive, case insensitive, accent
sensitive, insensitive filtering - do you have a plan for it?

You may be able to do some of these by using appropriate locale
definitions. I'd need some examples to be able to tell for sure.

what about DISTINCT clause, maybe GROUP BY clause ?

DISTINCT and GROUP BY work with equality, which is not affected by
locales (at least under the current rules).

#5Pavel Stehule
pavel.stehule@gmail.com
In reply to: Peter Eisentraut (#4)
Re: Per-column collation, proof of concept

2010/7/14 Peter Eisentraut <peter_e@gmx.net>:

On ons, 2010-07-14 at 19:35 +0200, Pavel Stehule wrote:

I have only one question - If I understand well you can use collate
just for sort. What is your plan for range search operation?

My patch does range searches.  Sorting uses the same operators, so both
will be supported.  (Sorting is not yet implemented, as I had written.)

Sort is
interesting and I am sure important for multilangual applications, for
me - more important is case sensitive, case insensitive, accent
sensitive, insensitive filtering - do you have a plan for it?

You may be able to do some of these by using appropriate locale
definitions.  I'd need some examples to be able to tell for sure.

what about DISTINCT clause, maybe GROUP BY clause ?

DISTINCT and GROUP BY work with equality, which is not affected by
locales (at least under the current rules).

:( maybe we have to enhance a locales - or do some work in this way.
In Czech's IS is relative often operation some like

name = 'Stěhule' COLLATION cs_CZ_cs_ai -- compare case insensitive
accent insensitive

PostgreSQL is last db, that doesn't integreated support for it

Regards

Pavel

Show quoted text
#6Peter Eisentraut
peter_e@gmx.net
In reply to: Pavel Stehule (#5)
Re: Per-column collation, proof of concept

On tor, 2010-07-15 at 05:57 +0200, Pavel Stehule wrote:

:( maybe we have to enhance a locales - or do some work in this way.
In Czech's IS is relative often operation some like

name = 'Stěhule' COLLATION cs_CZ_cs_ai -- compare case insensitive
accent insensitive

PostgreSQL is last db, that doesn't integreated support for it

Well, the comparison function varstr_cmp() contains this comment:

/*
* In some locales strcoll() can claim that nonidentical strings are
* equal. Believing that would be bad news for a number of reasons,
* so we follow Perl's lead and sort "equal" strings according to
* strcmp().
*/

This might not be strictly necessary, seeing that citext obviously
doesn't work that way, but resolving this is really an orthogonal issue.
If you fix that and you have a locale that does what you want, my patch
will help you get your example working.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#6)
Re: Per-column collation, proof of concept

Peter Eisentraut <peter_e@gmx.net> writes:

Well, the comparison function varstr_cmp() contains this comment:

/*
* In some locales strcoll() can claim that nonidentical strings are
* equal. Believing that would be bad news for a number of reasons,
* so we follow Perl's lead and sort "equal" strings according to
* strcmp().
*/

This might not be strictly necessary, seeing that citext obviously
doesn't work that way, but resolving this is really an orthogonal issue.

The problem with not doing that is it breaks hashing --- hash joins and
hash aggregation being the real pain points.

citext works around this in a rather klugy fashion by decreeing that two
strings are equal iff their str_tolower() conversions are bitwise equal.
So it can hash the str_tolower() representation. But that's kinda slow
and it fails in the general case anyhow, I think.

regards, tom lane

#8Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#7)
Re: Per-column collation, proof of concept

On Thu, Jul 15, 2010 at 4:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The problem with not doing that is it breaks hashing --- hash joins and
hash aggregation being the real pain points.

citext works around this in a rather klugy fashion by decreeing that two
strings are equal iff their str_tolower() conversions are bitwise equal.
So it can hash the str_tolower() representation.  But that's kinda slow
and it fails in the general case anyhow, I think.

I think the general equivalent would be to call strxfrm and hash the
result of that.

--
greg

#9Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Peter Eisentraut (#1)
Re: Per-column collation, proof of concept

On Tue, Jul 13, 2010 at 1:25 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

Here is a proof of concept for per-column collation support.

Hi,

i was looking at this.

nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
clause similar as the lc_collate?
i mean, is lc_collate what we will use as a default?

if yes, then probably we need to use pg_collation there too because
lc_collate and the new collate clause use different collation names.
"""
postgres=# create database test with lc_collate 'en_US.UTF-8';
CREATE DATABASE
test=# create table t1 (col1 text collate "en_US.UTF-8");
ERROR: collation "en_US.UTF-8" does not exist
test=# create table t1 (col1 text collate "en_US.utf8");
CREATE TABLE
"""

also i got errors from regression tests when MULTIBYTE=UTF8
(attached). it seems i was trying to create locales that weren't
defined on locales.txt (from were was fed that file?). i added a line
to that file (for es_EC.utf8) then i create a table with a column
using that collate and execute "select * from t2 where col1 > 'n'; "
and i got this error: "ERROR: could not create locale "es_EC.utf8""
(of course, that last part was me messing the things up, but it show
we shouldn't be using a file locales.txt, i think)

i can attach a collate to a domain but i can't see where are we
storing that info (actually it says it's not collatable):

--
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL

Attachments:

regression.diffsapplication/octet-stream; name=regression.diffsDownload+17-13
#10Peter Eisentraut
peter_e@gmx.net
In reply to: Jaime Casanova (#9)
Re: Per-column collation, proof of concept

On mån, 2010-08-02 at 01:43 -0500, Jaime Casanova wrote:

nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
clause similar as the lc_collate?
i mean, is lc_collate what we will use as a default?

Yes, if you do not specify anything per column, the database default is
used.

How to integrate the per-database or per-cluster configuration with the
new system is something to figure out in the future.

if yes, then probably we need to use pg_collation there too because
lc_collate and the new collate clause use different collation names.
"""
postgres=# create database test with lc_collate 'en_US.UTF-8';
CREATE DATABASE
test=# create table t1 (col1 text collate "en_US.UTF-8");
ERROR: collation "en_US.UTF-8" does not exist
test=# create table t1 (col1 text collate "en_US.utf8");
CREATE TABLE
"""

This is something that libc does for you. The locale as listed by
locale -a is called "en_US.utf8", but apparently libc takes
"en_US.UTF-8" as well.

also i got errors from regression tests when MULTIBYTE=UTF8
(attached). it seems i was trying to create locales that weren't
defined on locales.txt (from were was fed that file?). i added a line
to that file (for es_EC.utf8) then i create a table with a column
using that collate and execute "select * from t2 where col1 > 'n'; "
and i got this error: "ERROR: could not create locale "es_EC.utf8""
(of course, that last part was me messing the things up, but it show
we shouldn't be using a file locales.txt, i think)

It might be that you don't have those locales installed in your system.
locales.txt is created by using locale -a. Check what that gives you.

i can attach a collate to a domain but i can't see where are we
storing that info (actually it says it's not collatable):

Domain support is not done yet.

#11Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Peter Eisentraut (#10)
Re: Per-column collation, proof of concept

Hi,

sorry for the delay...
btw, the patch no longer apply cleanly but most are just hunks the
worst it's in src/backend/catalog/namespace.c because
FindConversionByName() is now called get_conversion_oid()... so maybe
this function should be named get_collation_oid(), i guess

On Tue, Aug 3, 2010 at 11:32 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

On mån, 2010-08-02 at 01:43 -0500, Jaime Casanova wrote:

nowadays, CREATE DATABASE has a lc_collate clause. is the new collate
clause similar as the lc_collate?
i mean, is lc_collate what we will use as a default?

Yes, if you do not specify anything per column, the database default is
used.

How to integrate the per-database or per-cluster configuration with the
new system is something to figure out in the future.

well at least pg_collation should be a shared catalog, no?
and i think we shouldn't be thinking in this without think first how
to integrate this with at least per-database configuration

if yes, then probably we need to use pg_collation there too because
lc_collate and the new collate clause use different collation names.
"""
postgres=# create database test with lc_collate 'en_US.UTF-8';
CREATE DATABASE
test=# create table t1 (col1 text collate "en_US.UTF-8");
ERROR:  collation "en_US.UTF-8" does not exist
test=# create table t1 (col1 text collate "en_US.utf8");
CREATE TABLE
"""

This is something that libc does for you.  The locale as listed by
locale -a is called "en_US.utf8", but apparently libc takes
"en_US.UTF-8" as well.

ok, but at least this is confusing

also, it doesn't recognize C collate although it is in the locales.txt
"""
test3=# create database test4 with template=template0 encoding 'utf-8'
lc_collate='C';
CREATE DATABASE
test3=# create table t3 (col1 text collate "C" );
ERROR: collation "C" does not exist
"""

BTW, why the double quotes?

also i got errors from regression tests when MULTIBYTE=UTF8
(attached). it seems i was trying to create locales that weren't
defined on locales.txt (from were was fed that file?). i added a line
to that file (for es_EC.utf8) then i create a table with a column
using that collate and execute "select * from t2 where col1 > 'n'; "
and i got this error: "ERROR:  could not create locale "es_EC.utf8""
(of course, that last part was me messing the things up, but it show
we shouldn't be using a file locales.txt, i think)

It might be that you don't have those locales installed in your system.
locales.txt is created by using locale -a.  Check what that gives you.

sorry to state the obvious but this doesn't work on windows, does it?
and for some reason it also didn't work on a centos 5 (this error
ocurred when initdb'ing)
"""
loading system objects' descriptions ... ok
creating collations ...FATAL: invalid byte sequence for encoding
"UTF8": 0xe56c09
CONTEXT: COPY tmp_pg_collation, line 86
STATEMENT: COPY tmp_pg_collation FROM
E'/usr/local/pgsql/9.1/share/locales.txt';
"""

--
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL

#12Peter Eisentraut
peter_e@gmx.net
In reply to: Jaime Casanova (#11)
Re: Per-column collation, proof of concept

On lör, 2010-08-14 at 02:05 -0500, Jaime Casanova wrote:

btw, the patch no longer apply cleanly but most are just hunks the
worst it's in src/backend/catalog/namespace.c because
FindConversionByName() is now called get_conversion_oid()... so maybe
this function should be named get_collation_oid(), i guess

OK, that will need to be adjusted.

well at least pg_collation should be a shared catalog, no?
and i think we shouldn't be thinking in this without think first how
to integrate this with at least per-database configuration

Good point. But one might also want to create "private" collations, so
a collation in a schema would also be useful. Tricky.

also, it doesn't recognize C collate although it is in the locales.txt
"""
test3=# create database test4 with template=template0 encoding 'utf-8'
lc_collate='C';
CREATE DATABASE
test3=# create table t3 (col1 text collate "C" );
ERROR: collation "C" does not exist
"""

I've fixed this in the meantime. Your version of the patch doesn't
support the C locale yet.

BTW, why the double quotes?

Because the name contains upper case letters?

sorry to state the obvious but this doesn't work on windows, does it?

Probably not, but hopefully there is some similar API that could be used
on Windows.

and for some reason it also didn't work on a centos 5 (this error
ocurred when initdb'ing)
"""
loading system objects' descriptions ... ok
creating collations ...FATAL: invalid byte sequence for encoding
"UTF8": 0xe56c09
CONTEXT: COPY tmp_pg_collation, line 86
STATEMENT: COPY tmp_pg_collation FROM
E'/usr/local/pgsql/9.1/share/locales.txt';
"""

Hmm, what is in that file on that line?

#13Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Peter Eisentraut (#12)
Re: Per-column collation, proof of concept

On Mon, Aug 16, 2010 at 10:56 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On lör, 2010-08-14 at 02:05 -0500, Jaime Casanova wrote:

BTW, why the double quotes?

Because the name contains upper case letters?

why everything seems so obvious once someone else state it? :)

sorry to state the obvious but this doesn't work on windows, does it?

Probably not, but hopefully there is some similar API that could be used
on Windows.

good luck with that! ;)
seriously, maybe this helps
http://msdn.microsoft.com/en-us/library/system.windows.forms.inputlanguage.installedinputlanguages.aspx
but probably you will need to write the code yourself... at least i
don't think there is something like "locale -a"

and for some reason it also didn't work on a centos 5 (this error
ocurred when initdb'ing)
"""
loading system objects' descriptions ... ok
creating collations ...FATAL:  invalid byte sequence for encoding
"UTF8": 0xe56c09
CONTEXT:  COPY tmp_pg_collation, line 86
STATEMENT:  COPY tmp_pg_collation FROM
E'/usr/local/pgsql/9.1/share/locales.txt';
"""

Hmm, what is in that file on that line?

bokmål ISO-8859-1

(i'm attaching the locales.txt just in case)

--
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL

Attachments:

locales.txttext/plain; name=locales.txtDownload
#14Peter Eisentraut
peter_e@gmx.net
In reply to: Jaime Casanova (#13)
Re: Per-column collation, proof of concept

On tis, 2010-08-17 at 01:16 -0500, Jaime Casanova wrote:

creating collations ...FATAL: invalid byte sequence for encoding
"UTF8": 0xe56c09
CONTEXT: COPY tmp_pg_collation, line 86
STATEMENT: COPY tmp_pg_collation FROM
E'/usr/local/pgsql/9.1/share/locales.txt';
"""

Hmm, what is in that file on that line?

bokmål ISO-8859-1

Hey, that borders on genius: Use a non-ASCII letter in the name of a
locale whose purpose it is to configure how non-ASCII letters are
interpreted. :-/

Interestingly, I don't see this on a Debian system. Good thing to know
that this needs separate testing on different Linux variants.

#15Jaime Casanova
jcasanov@systemguards.com.ec
In reply to: Peter Eisentraut (#14)
Re: Per-column collation, proof of concept

On Wed, Aug 18, 2010 at 11:29 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

On tis, 2010-08-17 at 01:16 -0500, Jaime Casanova wrote:

creating collations ...FATAL:  invalid byte sequence for encoding
"UTF8": 0xe56c09
CONTEXT:  COPY tmp_pg_collation, line 86
STATEMENT:  COPY tmp_pg_collation FROM
E'/usr/local/pgsql/9.1/share/locales.txt';
"""

Hmm, what is in that file on that line?

bokmål  ISO-8859-1

Hey, that borders on genius: Use a non-ASCII letter in the name of a
locale whose purpose it is to configure how non-ASCII letters are
interpreted. :-/

Interestingly, I don't see this on a Debian system.  Good thing to know
that this needs separate testing on different Linux variants.

Yeah! and when installing centos 5 i don't have a chance to choose
what locales i want, it just installs all of them

--
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL