[WIP] patch - Collation at database level

Started by Radek Strnadalmost 18 years ago31 messageshackers
Jump to latest
#1Radek Strnad
radek.strnad@gmail.com

Hello,

I'm sending part of the code that I've done and is available for reviewing
and I'm asking for your comments and some help because I'm new to
PostgreSQL.

Proposal: http://archives.postgresql.org/pgsql-hackers/2008-05/msg00857.php

The code contains changes of parser, pg_collation catalog, parsenodes.h for
CREATE COLLATION... and DROP COLLATION statements.

Questions that I would like to know answers:

1) Should I move DROP statement from DropCollationStmt to DropStmt? Some
statements are seperated and some are included in DropStmt? Is there any
reason for that?
2) What type should all names in CREATE and DROP statement in gram.y have?
I've chosen qualified_name but I know it's not the best choice.
3) All collations are created from existing collations. How do I ensure that
the collation already exists? Is there any possibility to define it in
gram.y?
4) For further functionality development is there anything more needed than
adding T_CreateCollationStmt and T_DropCollationStmt to
/src/backend/tcop/utility.c, and write functionality into collation.c?
5) Also can you look at the pg_catalog and tell me if anything is wrong with
it?

Thank you for all your replies in advance.

Regards

Radek Strnad

Attachments:

collation.patchtext/x-diff; name=collation.patchDownload+278-45
#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Radek Strnad (#1)
Re: [WIP] patch - Collation at database level

Radek Strnad escribi�:

2) What type should all names in CREATE and DROP statement in gram.y have?
I've chosen qualified_name but I know it's not the best choice.

I think it should be ColId.

3) All collations are created from existing collations. How do I ensure that
the collation already exists? Is there any possibility to define it in
gram.y?

Certainly not -- shouldn't they come from a catalog? In that case, it
must come in parse analysis (parser/analyze.c I guess) or perhaps later,
when you actually execute the function to create the new collation.

5) Also can you look at the pg_catalog and tell me if anything is wrong with
it?

Why does a collation have a schema?

What's the "existing collation"?

It seems a bit silly to have enum for what are basically boolean
variables. Why not just use "true" and "false"?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: [WIP] patch - Collation at database level

Alvaro Herrera <alvherre@commandprompt.com> writes:

Why does a collation have a schema?

Because the SQL spec says so. Also, if we don't put them in schemas,
we have no nice way to distinguish built-in and user-defined collations,
which creates a problem for pg_dump.

regards, tom lane

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: [WIP] patch - Collation at database level

Tom Lane escribi�:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Why does a collation have a schema?

Because the SQL spec says so. Also, if we don't put them in schemas,
we have no nice way to distinguish built-in and user-defined collations,
which creates a problem for pg_dump.

Oh, I see :-) In that case, qualified_name would seem the right symbol
to use in the parser.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#5Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#3)
Re: [WIP] patch - Collation at database level

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Why does a collation have a schema?

Because the SQL spec says so. Also, if we don't put them in schemas,
we have no nice way to distinguish built-in and user-defined collations,
which creates a problem for pg_dump.

Out of curiosity, what is a "user-defined collation"? Are there SQL statements
to go around declaring what order code points should be sorted in? That seems
like it would be... quite tedious!

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#6Martijn van Oosterhout
kleptog@svana.org
In reply to: Bruce Momjian (#5)
Re: [WIP] patch - Collation at database level

On Wed, Jul 02, 2008 at 07:22:10PM +0100, Gregory Stark wrote:

Because the SQL spec says so. Also, if we don't put them in schemas,
we have no nice way to distinguish built-in and user-defined collations,
which creates a problem for pg_dump.

Out of curiosity, what is a "user-defined collation"? Are there SQL statements
to go around declaring what order code points should be sorted in? That seems
like it would be... quite tedious!

Not that we'll ever use it, but ICU for example allows users to say:
"use collation X but move this code point somewhere else", essentially
allowing users tweak the collation on a small scale. In any case,
whatever collation library is used, we're unlikely to predefine every
possible collation in the system, there's too many (assuming they're
denumerable).

Have a niceday,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#7Radek Strnad
radek.strnad@gmail.com
In reply to: Bruce Momjian (#5)
Re: [WIP] patch - Collation at database level

My patch should be "sort of" wrapper that will implement guts for further
development (collation at column level) like catalogs, creating collations
etc. When creating collation user will be able to choose which function to
use (by statement STRCOLFN - not in SQL standard). In the first stage I'll
implement function that will use system locales. Adding ICU or any other
library won't be that big deal.

Radek Strnad

On Wed, Jul 2, 2008 at 8:22 PM, Gregory Stark <stark@enterprisedb.com>
wrote:

Show quoted text

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Why does a collation have a schema?

Because the SQL spec says so. Also, if we don't put them in schemas,
we have no nice way to distinguish built-in and user-defined collations,
which creates a problem for pg_dump.

Out of curiosity, what is a "user-defined collation"? Are there SQL
statements
to go around declaring what order code points should be sorted in? That
seems
like it would be... quite tedious!

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#5)
Re: [WIP] patch - Collation at database level

Gregory Stark <stark@enterprisedb.com> writes:

Out of curiosity, what is a "user-defined collation"? Are there SQL statements
to go around declaring what order code points should be sorted in? That seems
like it would be... quite tedious!

Hm, that's a good point. SQL99 has

<collation definition> ::=
CREATE COLLATION <collation name> FOR
<character set specification>
FROM <existing collation name>
[ <pad characteristic> ]

<existing collation name> ::= <collation name>

<pad characteristic> ::=
NO PAD
| PAD SPACE

which seems pretty stupid if you ask me --- all the mechanism required
to manage a new object type, just to enable PAD SPACE or not?
(Especially when PAD SPACE itself is an utterly broken, useless concept
... but I digress.) You might as well just provide all the standard
collations in both variants and be done with it.

The statement looks the same in last year's 200n draft, so it's not
like they were just about to add some more capability.

We might be best off to treat collations like index access methods,
ie, they're theoretically add-able but there's no infrastructure for
managing them, and what's expected is that all the ones you need are
created by initdb.

regards, tom lane

#9Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#8)
Re: [WIP] patch - Collation at database level

Tom Lane napsal(a):

Gregory Stark <stark@enterprisedb.com> writes:

Out of curiosity, what is a "user-defined collation"? Are there SQL statements
to go around declaring what order code points should be sorted in? That seems
like it would be... quite tedious!

Hm, that's a good point. SQL99 has

<collation definition> ::=
CREATE COLLATION <collation name> FOR
<character set specification>
FROM <existing collation name>
[ <pad characteristic> ]

<existing collation name> ::= <collation name>

<pad characteristic> ::=
NO PAD
| PAD SPACE

which seems pretty stupid if you ask me --- all the mechanism required
to manage a new object type, just to enable PAD SPACE or not?
(Especially when PAD SPACE itself is an utterly broken, useless concept
... but I digress.) You might as well just provide all the standard
collations in both variants and be done with it.

The statement looks the same in last year's 200n draft, so it's not
like they were just about to add some more capability.

The proposed syntax of CREATE COLLATION is:

CREATE COLLATION <collation name> FOR <character set specification>
FROM <existing collation name> [STRCOLFN <fn name>]
[ <pad characteristic> ] [ <case sensitive> ] [ <accent sensitive> ]
[ LCCOLLATE <lc_collate> ] [ LCCTYPE <lc_ctype> ]

Which extends ANSI specification.

We might be best off to treat collations like index access methods,
ie, they're theoretically add-able but there's no infrastructure for
managing them, and what's expected is that all the ones you need are
created by initdb.

I think you cannot create all collation at bootstrap. You can only create record
for actual LC_COLLATION, because I there is not standard way how to obtain
complete list of supported collations and there is also problem if you install
new locales after initdb.

When I looked to another DB (MS SQL, MySQL, DB2, Firebird) then only Firebird
supports CREATE COLLATION command. Other databases has hard coded list of
locales. Hardcoded solution means to use some lib (e.g ICU) with unified names
or has locale name mapping for all supported OS.

I personally prefer open solution when I can create own collation and specify
collation function to handle it.

Zdenek

#10Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#8)
Re: [WIP] patch - Collation at database level

Tom Lane napsal(a):

Gregory Stark <stark@enterprisedb.com> writes:

Out of curiosity, what is a "user-defined collation"? Are there SQL statements
to go around declaring what order code points should be sorted in? That seems
like it would be... quite tedious!

<snip>

We might be best off to treat collations like index access methods,
ie, they're theoretically add-able but there's no infrastructure for
managing them, and what's expected is that all the ones you need are
created by initdb.

I though more about it and I discussed it with Radek yesterday. The problem is
that collation must be created before user want to use CREATE DATABASE ...
COLLATE ... command. It inclines to have have pg_collation as a global catalog,
but ANSI specifies to use schema name in collation specification and schemes are
database specific ... It means that pg_collation have to be non-shared catalog
and new database only inherits collation from template db. And CREATE DATABASE
have to check list of collation in template database :(.

My conclusion is that CREATE COLLATION does not make much sense. I see two
possible solutions:

1) have global an local catalog for collation and have modified variants of
create cmd:

CREATE COLLATION ... GLOBAL|LOCAL

CREATE DATABASE will use only collation from global catalog

Local catalog will be useful when full support of collation will be available
mostly for specifying case sensitivity of collation.

2) Use Tom's suggested approach. Create list of collations in initdb phase. But
there is problem how to obtain list of supported collation on the server. I
think, only what is possible to do is to use default locale for creating default
collation for template1 database.

Any suggestion?

thanks Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#11Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Zdenek Kotala (#10)
Re: [WIP] patch - Collation at database level

Zdenek Kotala napsal(a):

I though more about it and I discussed it with Radek yesterday. The
problem is that collation must be created before user want to use CREATE
DATABASE ... COLLATE ... command. It inclines to have have pg_collation
as a global catalog, but ANSI specifies to use schema name in collation
specification and schemes are database specific ... It means that
pg_collation have to be non-shared catalog and new database only
inherits collation from template db. And CREATE DATABASE have to check
list of collation in template database :(.

thinking more ...
It must be shared catalog because pg_database will depend on it.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#12Martijn van Oosterhout
kleptog@svana.org
In reply to: Zdenek Kotala (#11)
Re: [WIP] patch - Collation at database level

On Tue, Jul 08, 2008 at 11:27:35AM +0200, Zdenek Kotala wrote:

Zdenek Kotala napsal(a):

I though more about it and I discussed it with Radek yesterday. The
problem is that collation must be created before user want to use CREATE
DATABASE ... COLLATE ... command. It inclines to have have pg_collation
as a global catalog, but ANSI specifies to use schema name in collation
specification and schemes are database specific ... It means that
pg_collation have to be non-shared catalog and new database only
inherits collation from template db. And CREATE DATABASE have to check
list of collation in template database :(.

thinking more ...
It must be shared catalog because pg_database will depend on it.

Not necessarily. pg_class is not shared yet without it you can't even
find pg_database. Same deal with pg_type. All it means is that
pg_collation in template1 must contain all the collations used in
template1, which shouldn't be hard to arrange.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#13Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Martijn van Oosterhout (#12)
Re: [WIP] patch - Collation at database level

Martijn van Oosterhout napsal(a):

On Tue, Jul 08, 2008 at 11:27:35AM +0200, Zdenek Kotala wrote:

Zdenek Kotala napsal(a):

I though more about it and I discussed it with Radek yesterday. The
problem is that collation must be created before user want to use CREATE
DATABASE ... COLLATE ... command. It inclines to have have pg_collation
as a global catalog, but ANSI specifies to use schema name in collation
specification and schemes are database specific ... It means that
pg_collation have to be non-shared catalog and new database only
inherits collation from template db. And CREATE DATABASE have to check
list of collation in template database :(.

thinking more ...
It must be shared catalog because pg_database will depend on it.

Not necessarily. pg_class is not shared yet without it you can't even
find pg_database. Same deal with pg_type. All it means is that
pg_collation in template1 must contain all the collations used in
template1, which shouldn't be hard to arrange.

I think, Collation situation is different, becasue pg_database will contains
column colname. pg_class (and all bootstrap catalog) only contains row which
specify that shared table exists and content is cloned to the new database from
template database. In corner case you can get context specific dependency for
example if Czech collation will have oid=10 in database test01 and Swedish
collation will have oid=10 in database test02. How to handle CREATE DATABASE and
connect database? OK it shouldn't happen in normal situation but ...

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#14Martijn van Oosterhout
kleptog@svana.org
In reply to: Zdenek Kotala (#13)
Re: [WIP] patch - Collation at database level

On Tue, Jul 08, 2008 at 12:00:34PM +0200, Zdenek Kotala wrote:

Not necessarily. pg_class is not shared yet without it you can't even
find pg_database. Same deal with pg_type. All it means is that
pg_collation in template1 must contain all the collations used in
template1, which shouldn't be hard to arrange.

I think, Collation situation is different, becasue pg_database will
contains column colname. pg_class (and all bootstrap catalog) only contains
row which specify that shared table exists and content is cloned to the new
database from template database. In corner case you can get context
specific dependency for example if Czech collation will have oid=10 in
database test01 and Swedish collation will have oid=10 in database test02.
How to handle CREATE DATABASE and connect database? OK it shouldn't happen
in normal situation but ...

Oh I see, you're referring to the storage of the default collation for
a database. I was jumping ahead to the per-column collation state, when
the collation default is attached to columns, types and domains, and
not at the database level. So there the problem does not exist.

To be honest, I'd suggest storing the collation in pg_database as a
string, rather than as an identifier. This sidesteps the problem
entirly.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#15Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Martijn van Oosterhout (#14)
Re: [WIP] patch - Collation at database level

Martijn van Oosterhout napsal(a):

Oh I see, you're referring to the storage of the default collation for
a database. I was jumping ahead to the per-column collation state, when
the collation default is attached to columns, types and domains, and
not at the database level. So there the problem does not exist.

Yeah, but you still need one source/one collation list for database, scheme,
table and column. And of course shared tables need also collation for their indexes.

To be honest, I'd suggest storing the collation in pg_database as a
string, rather than as an identifier. This sidesteps the problem
entirly.

I don't think that string is good idea. You need to use same approach on all
levels by my opinion.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#13)
Re: [WIP] patch - Collation at database level

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

Martijn van Oosterhout napsal(a):

Not necessarily. pg_class is not shared yet without it you can't even
find pg_database. Same deal with pg_type. All it means is that
pg_collation in template1 must contain all the collations used in
template1, which shouldn't be hard to arrange.

I think, Collation situation is different,

All the argument here is based on the premise that we should have
database-level collation specifications, which AFAICS is not required
nor suggested by the SQL spec. I wonder why we are allowing a
nonstandard half-measure to drive our thinking, rather than solving the
real problem which is column-level collations.

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zdenek Kotala (#15)
Re: [WIP] patch - Collation at database level

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

... And of course shared tables need also collation for their indexes.

No, they don't, because the only textual indexes on shared catalogs are
on "name" columns, which are intentionally not locale aware, and
wouldn't be collation aware either.

regards, tom lane

#18Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#16)
Re: [WIP] patch - Collation at database level

Tom Lane wrote:

All the argument here is based on the premise that we should have
database-level collation specifications, which AFAICS is not required
nor suggested by the SQL spec. I wonder why we are allowing a
nonstandard half-measure to drive our thinking, rather than solving the
real problem which is column-level collations.

Agreed. Are we even sure that we want per-database collations as a
half-way house? Unless we can be sure that we want all the required
catalog changes for the full requirement, it seems to me a rather messy
way of getting to where we want to go.

cheers

andrew

#19Martijn van Oosterhout
kleptog@svana.org
In reply to: Andrew Dunstan (#18)
Re: [WIP] patch - Collation at database level

On Tue, Jul 08, 2008 at 10:54:28AM -0400, Andrew Dunstan wrote:

Agreed. Are we even sure that we want per-database collations as a
half-way house? Unless we can be sure that we want all the required
catalog changes for the full requirement, it seems to me a rather messy
way of getting to where we want to go.

Given that the current projected timeframe for full COLLATE support is
something like 2-3 major releases, I don't really see the problem with
doing this now. Who knows, it might prompt people to do something
sooner.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#20Josh Berkus
josh@agliodbs.com
In reply to: Andrew Dunstan (#18)
Re: [WIP] patch - Collation at database level

Andrew, Tom,

Agreed. Are we even sure that we want per-database collations as a
half-way house? Unless we can be sure that we want all the required
catalog changes for the full requirement, it seems to me a rather messy
way of getting to where we want to go.

Given that we don't have a delivery date for table or column level collations,
we don't want to turn down database-level collations. If nothing else,
Radek's work will expose what areas of our code are collation-dependant and
hopefully make the work of more granular collations easier. And if it takes
us 3 years to get more granular collations, at least people can use
database-level ones in the meantime so that they don't need to have separate
PostgreSQL binaries for every language they want to support fully.

Also ... this is a Summer of Code Project, which we accepted, which at least
in Google and the student's eyes means we're not going to discard the entire
premise of the patch. I'm not exaggerating when I say doing something like
that could get PostgreSQL permanently banned from Google SoC.

Tom, I think you need to be on the SoC committee in the future, just to raise
objections. Some 15+ PostgreSQL contributors on the SoC committee approved
Radek's project.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

#21Michael Paesold
mpaesold@gmx.at
In reply to: Tom Lane (#16)
#22Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#17)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#20)
#24Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Tom Lane (#16)
#25Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Andrew Dunstan (#18)
#26Martijn van Oosterhout
kleptog@svana.org
In reply to: Zdenek Kotala (#24)
#27Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Martijn van Oosterhout (#26)
#28Peter Eisentraut
peter_e@gmx.net
In reply to: Radek Strnad (#1)
#29Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#28)
#30Radek Strnad
radek.strnad@gmail.com
In reply to: Peter Eisentraut (#28)
#31Martijn van Oosterhout
kleptog@svana.org
In reply to: Radek Strnad (#30)