WIP patch: Collation support
I'm implementing collation support. Proposal can be found at
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00557.php
Progress so far:
- created catalogs pg_collation a pg_charset which are filled with three
standard collations
- initdb changes rows called "DEFAULT" in both catalogs during the bki
bootstrap phase with current system LC_COLLATE and LC_CTYPE or those set by
command line.
- new collations can be defined with command CREATE COLLATION <collation
name> FOR <character set specification> FROM <existing collation name>
[STRCOLFN <fn name>]
[ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ] [
LCCTYPE <lc_ctype> ]
- because of pg_collation and pg_charset are catalogs individual for each
database, if you want to create a database with collation other than
specified, create it in template1 and then create database
- when connecting to database, it retrieves locales from pg_database and
sets them
Design & functionality changes left:
- move retrieveing collation from pg_database to pg_type
- get case sensitivity and pad characteristic working
- wrap locale setting and text comparison into functions that will be
referenced from pg_collation as regproc (make the code nice)
- when creating database with different collation than database cluster, the
database has to be reindexed. Any idea how to do it? Function
ReindexDatabase works only when database is opened.
- minor tweaks & fixes
Ideas? Comments?
Regards
Radek Strnad
Attachments:
wip-collation.patchtext/x-diff; name=wip-collation.patchDownload+1199-235
Radek Strnad escribi�:
- when creating database with different collation than database cluster, the
database has to be reindexed. Any idea how to do it? Function
ReindexDatabase works only when database is opened.
We have this Todo item:
Set proper permissions on non-system schemas during db creation
Currently all schemas are owned by the super-user because they are
copied from the template1 database. However, since all objects are
inherited from the template database, it is not clear that setting
schemas to the db owner is correct.
When this was discussed years ago, one proposed idea was that on the
first connection the backend should, as a first task, ensure that
certain administrative chores be done. The first of those was changing
the ownership of schemas. It sounds like this reindexing you propose
falls into the same category.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Ok, so do you suggest to leave it with a notice "reindex database" or start
to solve it somehow?
Regards
Radek Strnad
On Mon, Sep 1, 2008 at 12:08 AM, Alvaro Herrera
<alvherre@commandprompt.com>wrote:
Show quoted text
Radek Strnad escribió:
- when creating database with different collation than database cluster,
the
database has to be reindexed. Any idea how to do it? Function
ReindexDatabase works only when database is opened.We have this Todo item:
Set proper permissions on non-system schemas during db creation
Currently all schemas are owned by the super-user because they are
copied from the template1 database. However, since all objects are
inherited from the template database, it is not clear that setting
schemas to the db owner is correct.When this was discussed years ago, one proposed idea was that on the
first connection the backend should, as a first task, ensure that
certain administrative chores be done. The first of those was changing
the ownership of schemas. It sounds like this reindexing you propose
falls into the same category.--
Alvaro Herrera
http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Radek Strnad wrote:
- new collations can be defined with command CREATE COLLATION <collation
name> FOR <character set specification> FROM <existing collation name>
[STRCOLFN <fn name>]
[ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ]
[ LCCTYPE <lc_ctype> ]
How do you plan to make a collation case sensitive or accent sensitive?
I have previously commented that this is not a realistic view on how
collations work. Since you are apparently planning to use the system
locales, I don't see how you can make this work.
On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote:
Radek Strnad wrote:
- new collations can be defined with command CREATE COLLATION <collation
name> FOR <character set specification> FROM <existing collation name>
[STRCOLFN <fn name>]
[ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ]
[ LCCTYPE <lc_ctype> ]How do you plan to make a collation case sensitive or accent sensitive?
I have previously commented that this is not a realistic view on how
collations work. Since you are apparently planning to use the system
locales, I don't see how you can make this work.
While it's true POSIX locales don't handle this, other collation
libraries do and we should support them if the user wants.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.
I think at least case sensitivity can be done by comparing two strings
converted to upper case with toupper() function.
Regards
Radek Strnad
Show quoted text
On Tue, Sep 2, 2008 at 2:00 PM, Martijn van Oosterhout <kleptog@svana.org>wrote:
On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote:
Radek Strnad wrote:
- new collations can be defined with command CREATE COLLATION
<collation
name> FOR <character set specification> FROM <existing collation name>
[STRCOLFN <fn name>]
[ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate>]
[ LCCTYPE <lc_ctype> ]
How do you plan to make a collation case sensitive or accent sensitive?
I have previously commented that this is not a realistic view on how
collations work. Since you are apparently planning to use the system
locales, I don't see how you can make this work.While it's true POSIX locales don't handle this, other collation
libraries do and we should support them if the user wants.Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)iD8DBQFIvSrIIB7bNG8LQkwRAnkWAJ9FaiR9cOHFN2vkVmQaK5y7N9OJoQCbB+Ks
e0E4722hY/Q+Cz8tpzA0CGs=
=2Svh
-----END PGP SIGNATURE-----
Import Notes
Reply to msg id not found: de5165440809020502m184c4631i1d093c517b869a27@mail.gmail.com
Martijn van Oosterhout wrote:
On Tue, Sep 02, 2008 at 02:50:47PM +0300, Peter Eisentraut wrote:
Radek Strnad wrote:
- new collations can be defined with command CREATE COLLATION <collation
name> FOR <character set specification> FROM <existing collation name>
[STRCOLFN <fn name>]
[ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ]
[ LCCTYPE <lc_ctype> ]How do you plan to make a collation case sensitive or accent sensitive?
I have previously commented that this is not a realistic view on how
collations work. Since you are apparently planning to use the system
locales, I don't see how you can make this work.While it's true POSIX locales don't handle this, other collation
libraries do and we should support them if the user wants.
Do they handle exactly those two attributes specifically? Can you point
out references? Or do you mean, other collation libraries allow their
collations to be configured/customized? I think linguistically it is a
very narrow view of the world to hardcode those two attributes.
On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote:
While it's true POSIX locales don't handle this, other collation
libraries do and we should support them if the user wants.Do they handle exactly those two attributes specifically? Can you point
out references? Or do you mean, other collation libraries allow their
collations to be configured/customized? I think linguistically it is a
very narrow view of the world to hardcode those two attributes.
Well, yes. Accents and case are attributes of a character. (I'm using
the unicode model here). So, to do a case insensetive match you take
the characters, strip the attributes and then do the comparison. There
are specialised routines which handle the denormalisation of the string
for you so in theory you could even get specific about which accents
you ignore. In practice I don't think people do that.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout <kleptog@svana.org> writes:
On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote:
While it's true POSIX locales don't handle this, other collation
libraries do and we should support them if the user wants.
I think that's backwards. We have to go with the lowest common denominator
functionality of those libraries if we're going to be portable. As long as
it's a superset of the SQL standard functionality. If we support features of
some of them that can't be emulated with others then users end up with SQL
code that will only work on some builds and not others. That might be worth it
for some features but I'm not sure this is one.
Well, yes. Accents and case are attributes of a character. (I'm using
the unicode model here). So, to do a case insensetive match you take
the characters, strip the attributes and then do the comparison. There
are specialised routines which handle the denormalisation of the string
for you so in theory you could even get specific about which accents
you ignore. In practice I don't think people do that.
I don't think composable unicode characters are really about collations. I
think it had more to do with representing glyphs in UTF32 before they gave up
on that. Does anyone still use composable characters?
Note that we don't currently support composable characters at all. I'm not
sure if that's a "nobody really cares" issue or a bug we should aim to fix
with real collation support.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!
On Tue, Sep 02, 2008 at 05:42:13PM +0100, Gregory Stark wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
On Tue, Sep 02, 2008 at 04:46:16PM +0300, Peter Eisentraut wrote:
While it's true POSIX locales don't handle this, other collation
libraries do and we should support them if the user wants.I think that's backwards. We have to go with the lowest common denominator
functionality of those libraries if we're going to be portable.
And I think that's backwards. Why can we only use a feature once every
OS out there implements it? We still run on systems that don't have SSL
support. LC_TYPE settings are not portable between systems, yet that
doesn't bother anyone. Why should we have a problem with collate
settings not being portable?
I don't think composable unicode characters are really about collations. I
think it had more to do with representing glyphs in UTF32 before they gave up
on that. Does anyone still use composable characters?
Lookup the various normalisations forms:
http://en.wikipedia.org/wiki/Unicode_normalization
In particular Normal Form D.
Sure, composable characters have nothing to do with collation, but they
provide a uniform way of doing accent insensetive collation.
Note that we don't currently support composable characters at all.
Any character which is an accent on a latin character is a decomposable
character. And last I checked we supported those.
Have a niceday,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.
Radek Strnad escribi�:
Ok, so do you suggest to leave it with a notice "reindex database" or start
to solve it somehow?
I don't know. If there are two tasks that need the same treatment, it
seems a safe conclusion that they need a common solution.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Martijn van Oosterhout <kleptog@svana.org> writes:
And I think that's backwards. Why can we only use a feature once every
OS out there implements it? We still run on systems that don't have SSL
support. LC_TYPE settings are not portable between systems, yet that
doesn't bother anyone. Why should we have a problem with collate
settings not being portable?
If we're going to approach it that way, we need a syntax for CREATE
COLLATION that doesn't hard-wire what the possible collation modifiers
are.
regards, tom lane
Radek Strnad wrote:
Progress so far:
- created catalogs pg_collation a pg_charset which are filled with three
standard collations
- initdb changes rows called "DEFAULT" in both catalogs during the bki
bootstrap phase with current system LC_COLLATE and LC_CTYPE or those set by
command line.
- new collations can be defined with command CREATE COLLATION <collation
name> FOR <character set specification> FROM <existing collation name>
[STRCOLFN <fn name>]
[ <pad characteristic> ] [ <case sensitive> ] [ LCCOLLATE <lc_collate> ] [
LCCTYPE <lc_ctype> ]
- because of pg_collation and pg_charset are catalogs individual for each
database, if you want to create a database with collation other than
specified, create it in template1 and then create database
I have to wonder, is all that really necessary? The feature you're
trying to implement is to support database-level collation at first, and
perhaps column-level collation later. We don't need support for
user-defined collations and charsets for that.
If leave all that out of the patch for now, we'll have a much slimmer,
and just as useful patch, implementing database-level collation. We can
add those catalogs later if we need them, but I don't think there's much
point in adding all that infrastructure if they just reflect the locales
installed in the operating system.
- when connecting to database, it retrieves locales from pg_database and
sets them
This is the real gist of this patch.
Design & functionality changes left:
- move retrieveing collation from pg_database to pg_type
I don't understand this item. What will you move?
- get case sensitivity and pad characteristic working
I feel we should leave this to the collation implementation.
- when creating database with different collation than database cluster, the
database has to be reindexed. Any idea how to do it? Function
ReindexDatabase works only when database is opened.
That's a tricky one. One idea is to prohibit choosing a different
collation than the one in the template database, unless we know it's
safe to do so without reindexing. The problem is that we don't know
whether it's safe. A simple but limiting solution would be to require
that the template database has the same collation as the database that's
being created, except that template0 can always be used as template.
template0 is safe, because there's no indexes on text columns there.
Note that we already have the same problem with encodings. If you create
a database with LATIN1 encoding, load it with data, and then use that as
a template for a database with UTF-8 encoding, the text data will be
incorrectly encoded. We should probably fix that too.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote:
Radek Strnad wrote:
- because of pg_collation and pg_charset are catalogs individual for each
database, if you want to create a database with collation other than
specified, create it in template1 and then create databaseI have to wonder, is all that really necessary? The feature you're
trying to implement is to support database-level collation at first, and
perhaps column-level collation later. We don't need support for
user-defined collations and charsets for that.
Since the set of collations isn't exactly denumerable, we need some way
to allow the user to specify the collation they want. The only
collation PostgreSQL knows about is the C collation. Anything else is
user-defined.
Design & functionality changes left:
- move retrieveing collation from pg_database to pg_typeI don't understand this item. What will you move?
Long term, the collation is a property of the type, but I agree, I'm not
sure why this patch needs it.
That's a tricky one. One idea is to prohibit choosing a different
collation than the one in the template database, unless we know it's
safe to do so without reindexing.
But that put us back where we started: every database having the same
collation. We're trying to move away from that. Just reindex everything
and be done with it.
Note that we already have the same problem with encodings. If you create
a database with LATIN1 encoding, load it with data, and then use that as
a template for a database with UTF-8 encoding, the text data will be
incorrectly encoded. We should probably fix that too.
I'd say forbid more than one encoding in a cluster, but that's just my
opinion :)
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout wrote:
On Wed, Sep 10, 2008 at 11:29:14AM +0300, Heikki Linnakangas wrote:
Radek Strnad wrote:
- because of pg_collation and pg_charset are catalogs individual for each
database, if you want to create a database with collation other than
specified, create it in template1 and then create databaseI have to wonder, is all that really necessary? The feature you're
trying to implement is to support database-level collation at first, and
perhaps column-level collation later. We don't need support for
user-defined collations and charsets for that.Since the set of collations isn't exactly denumerable, we need some way
to allow the user to specify the collation they want. The only
collation PostgreSQL knows about is the C collation. Anything else is
user-defined.
Let's just use the name of the OS locale, like we do now. Having a
pg_collation catalog just moves the problem elsewhere: we'd still need
something in pg_collation to tie the collation to the OS locale.
Design & functionality changes left:
- move retrieveing collation from pg_database to pg_typeI don't understand this item. What will you move?
Long term, the collation is a property of the type, ...
You might want to provide a default collation for a type as well, but
the very finest grade is that you can specify collation for every (text)
comparison operator in your query. Of course you don't want to do that
for every query, which is why we should provide defaults at different
levels: columns, tables, database. And perhaps types as well, but that's
not the most interesting case.
I'm not sure what the SQL spec says about that, but I believe it
provides syntax and rules for all that.
That's a tricky one. One idea is to prohibit choosing a different
collation than the one in the template database, unless we know it's
safe to do so without reindexing.But that put us back where we started: every database having the same
collation. We're trying to move away from that. Just reindex everything
and be done with it.
That's easier said than done, unfortunately.
Note that we already have the same problem with encodings. If you create
a database with LATIN1 encoding, load it with data, and then use that as
a template for a database with UTF-8 encoding, the text data will be
incorrectly encoded. We should probably fix that too.I'd say forbid more than one encoding in a cluster, but that's just my
opinion :)
Yeah, that's pretty useless, at least without support for different
locales on different databases. But might as well keep it unless there's
a pressing reason to drop it.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas napsal(a):
Design & functionality changes left:
- move retrieveing collation from pg_database to pg_type
The problem there is that pg_collation is local catalog, but pg_database
is global catalog. IIRC, It was discussed during last commitfest. I
think it is bad idea to make this kind of dependency. It seems to me
better to implement something what we will use later to avoid useless
and confusing dependency.
Zdenek
On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote:
Since the set of collations isn't exactly denumerable, we need some way
to allow the user to specify the collation they want. The only
collation PostgreSQL knows about is the C collation. Anything else is
user-defined.Let's just use the name of the OS locale, like we do now. Having a
pg_collation catalog just moves the problem elsewhere: we'd still need
something in pg_collation to tie the collation to the OS locale.
There's not a one-to-one mapping between collation and locale name. A
locale name includes information about the charset and a collation may
have paramters like case-sensetivity and pad-attribute which are not
present in the locale name. You need a mapping anyway, which is what
this table is for.
The difference in collation between CHAR() and VARCHAR() is the usual
example here.
Long term, the collation is a property of the type, ...
I'm not sure what the SQL spec says about that, but I believe it
provides syntax and rules for all that.
The spec is quite detailed about and I posted code to do it years ago.
The point is that we don't need to go that far with this patch.
But that put us back where we started: every database having the same
collation. We're trying to move away from that. Just reindex everything
and be done with it.That's easier said than done, unfortunately.
I don't see an alternative.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout wrote:
On Wed, Sep 10, 2008 at 12:51:02PM +0300, Heikki Linnakangas wrote:
Since the set of collations isn't exactly denumerable, we need some way
to allow the user to specify the collation they want. The only
collation PostgreSQL knows about is the C collation. Anything else is
user-defined.Let's just use the name of the OS locale, like we do now. Having a
pg_collation catalog just moves the problem elsewhere: we'd still need
something in pg_collation to tie the collation to the OS locale.There's not a one-to-one mapping between collation and locale name. A
locale name includes information about the charset and a collation may
have paramters like case-sensetivity and pad-attribute which are not
present in the locale name. You need a mapping anyway, which is what
this table is for.
Ideally, we would delegate the case-sensitivity and padding to the
collation implementation (ie. OS setlocale() or ICU). That said, I don't
think operating systems normally ship case-insensitive variants of
locales by default, so I agree it would be nice if we could implement
that ourselves. Still, we could identify case-sensitive locale names for
example by a suffix, like "en_GB.UTF8.case-insensitive".
I agree we will eventually need a way to give shorthand names for
collations, and a pg_collation catalog will then come handy. But that
can wait until we have the basic infrastructure ready to support column
and query-level collation.
But that put us back where we started: every database having the same
collation. We're trying to move away from that. Just reindex everything
and be done with it.That's easier said than done, unfortunately.
I don't see an alternative.
Well, I proposed disallowing using a different collation than the source
database, except for using template0 as the source. That's pretty
limited, but is trivial to implement and still let's you have databases
with different collations in the same cluster.
I worked a bit on Radek's patch, stripping out all the pg_collate and
pg_charset catalog changes and commands, leaving just the core
functionality of database-level collations. It needs some cleanup and
documentation, but something like this I'd like to commit in this commit
fest. The new catalogs can wait until we have a real need for them.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
wip-collation-nocatalogs-1.patchtext/x-diff; name=wip-collation-nocatalogs-1.patchDownload+228-142
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Well, I proposed disallowing using a different collation than the source
database, except for using template0 as the source. That's pretty limited, but
is trivial to implement and still let's you have databases with different
collations in the same cluster.
+ if (strcmp(dbtemplate, "template0") != 0 &&
+ (strcmp(lc_collate, src_collation) || strcmp(lc_ctype, src_ctype)))
+ ereport(NOTICE,
+ (errmsg("database \"%s\" needs to be reindexed manually (REINDEX DATABASE)",
+ dbname)));
+
This isn't what you described but I think I prefer it this way as just a
warning not an error. I can easily imagine cases where the admin knows there
are either no indexes or no data in their database or where they're perfectly
happy to reindex. A flat prohibition seems annoying.
That said it's worth noting that this would tie our hands with proposals like
retail vacuum which risk database corruption if they fail to refind an index
pointer for a tuple...
Incidentally it seems like the warning should actually explain *why* it needs
to be reindexed manually and perhaps what the consequences are until it is.
AFAIK we can't easily connect to the new database and do some fiddling with
it, can we? If we could we could check if there are any non-empty indexes
which depend on the collation and only print the warning if we find any (and
even mark them invalid).
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark napsal(a):
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
<snip>
AFAIK we can't easily connect to the new database and do some fiddling with
it, can we? If we could we could check if there are any non-empty indexes
which depend on the collation and only print the warning if we find any (and
even mark them invalid).
Autovacum uses InitPostgres function to swith to another database. I'm
not sure how much safe it is in create database command and when we are
already switched we can reindex affected indexes.
Zdenek