Per-column collation

pavel.stehule@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#1)

Re: Per-column collation

Hello

I am checking a patch. I found a problem with initdb

[postgres@pavel-stehule postgresql]$ /usr/local/pgsql/bin/initdb -D
/usr/local/pgsql/data/
could not change directory to "/home/pavel/src/postgresql"
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale en_US.utf8.
The default database encoding has accordingly been set to UTF8.
The default text search configuration will be set to "english".

fixing permissions on existing directory /usr/local/pgsql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 24MB
creating configuration files ... ok
creating template1 database in /usr/local/pgsql/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ...initdb: locale name has non-ASCII characters,
skipped: bokm�linitdb: locale name has non-ASCII characters, skipped:
fran�aiscould not determine encoding for locale "hy_AM.armscii8":
codeset is "ARMSCII-8"
could not determine encoding for locale "ka_GE": codeset is "GEORGIAN-PS"
could not determine encoding for locale "ka_GE.georgianps": codeset is
"GEORGIAN-PS"
could not determine encoding for locale "kk_KZ": codeset is "PT154"
could not determine encoding for locale "kk_KZ.pt154": codeset is "PT154"
could not determine encoding for locale "tg_TJ": codeset is "KOI8-T"
could not determine encoding for locale "tg_TJ.koi8t": codeset is "KOI8-T"
could not determine encoding for locale "thai": codeset is "TIS-620"
could not determine encoding for locale "th_TH": codeset is "TIS-620"
could not determine encoding for locale "th_TH.tis620": codeset is "TIS-620"
could not determine encoding for locale "vi_VN.tcvn": codeset is "TCVN5712-1"
FATAL: invalid byte sequence for encoding "UTF8": 0xe56c27
child process exited with exit code 1
initdb: removing contents of data directory "/usr/local/pgsql/data

tested on fedora 13

[postgres@pavel-stehule local]$ locale -a| wc -l
731

Regards

Pavel Stehule

2010/11/15 Peter Eisentraut <peter_e@gmx.net>:

Show quoted text

Here is the next patch in this epic series. [0]

I have addressed most of the issues pointed out in previous reviews and
removed all major outstanding problems that were marked in the code. So
it might just almost really work.

The documentation now also covers everything that's interesting, so
newcomers can start with that.

For those who have previously reviewed this, two major changes:

* The locales to be loaded are now computed by initdb, no longer during
the build process.

* The regression test file has been removed from the main test set. To
run it, use

make check MULTIBYTE=UTF8 EXTRA_TESTS=collate

Stuff that still cannot be expected to work:

* no CREATE COLLATION yet, maybe later

* no support for regular expression searches

* not text search support

These would not be release blockers, I think.

[0] http://archives.postgresql.org/message-id/1284583568.4696.20.camel@vanquo.pezone.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

peter_e@gmx.net

over 15 years ago

In reply to: Pavel Stehule (#2)

Re: Per-column collation

On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:

I am checking a patch. I found a problem with initdb

Ah, late night brain farts, it appears. Here is a corrected version.

pavel.stehule@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#3)

Re: Per-column collation

Hello

2010/11/15 Peter Eisentraut <peter_e@gmx.net>:

On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:

I am checking a patch. I found a problem with initdb

Ah, late night brain farts, it appears. Here is a corrected version.

yes, it's ok now.

I see still a few issues:

a) default encoding for collate isn't same as default encoding of database

it's minimally not friendly - mostly used encoding is UTF8, but in
most cases users should to write locale.utf8.

b) there is bug - default collate (database collate is ignored)

postgres=# show lc_collate;
lc_collate
────────────
cs_CZ.UTF8
(1 row)

Time: 0.518 ms
postgres=# select * from jmena order by v;
v
───────────
Chromečka
Crha
Drobný
Čečetka
(4 rows)

postgres=# select * from jmena order by v collate "cs_CZ.utf8";
v
───────────
Crha
Čečetka
Drobný
Chromečka
(4 rows)

both result should be same.

isn't there problem in case sensitive collate name? When I use a
lc_collate value, I got a error message

postgres=# select * from jmena order by v collate "cs_CZ.UTF8";
ERROR: collation "cs_CZ.UTF8" for current database encoding "UTF8"
does not exist
LINE 1: select * from jmena order by v collate "cs_CZ.UTF8";

problem is when table is created without explicit collate.

Regards

Pavel Stehule

peter_e@gmx.net

over 15 years ago

In reply to: Pavel Stehule (#4)

Re: Per-column collation

On mån, 2010-11-15 at 23:13 +0100, Pavel Stehule wrote:

a) default encoding for collate isn't same as default encoding of database

it's minimally not friendly - mostly used encoding is UTF8, but in
most cases users should to write locale.utf8.

I don't understand what you are trying to say. Please provide more
detail.

b) there is bug - default collate (database collate is ignored)

postgres=# show lc_collate;
lc_collate
────────────
cs_CZ.UTF8
(1 row)

Time: 0.518 ms
postgres=# select * from jmena order by v;
v
───────────
Chromečka
Crha
Drobný
Čečetka
(4 rows)

postgres=# select * from jmena order by v collate "cs_CZ.utf8";
v
───────────
Crha
Čečetka
Drobný
Chromečka
(4 rows)

both result should be same.

I tried to reproduce this here but got the expected results. Could you
try to isolate a complete test script?

isn't there problem in case sensitive collate name? When I use a
lc_collate value, I got a error message

postgres=# select * from jmena order by v collate "cs_CZ.UTF8";
ERROR: collation "cs_CZ.UTF8" for current database encoding "UTF8"
does not exist
LINE 1: select * from jmena order by v collate "cs_CZ.UTF8";

problem is when table is created without explicit collate.

Well, I agree it's not totally nice, but we have to do something, and I
think it's logical to use the locale names as collation names by
default, and collation names are SQL identifiers. Do you have any ideas
for improving this?

pavel.stehule@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#5)

Re: Per-column collation

Hello

2010/11/16 Peter Eisentraut <peter_e@gmx.net>:

On mån, 2010-11-15 at 23:13 +0100, Pavel Stehule wrote:

a) default encoding for collate isn't same as default encoding of database

it's minimally not friendly - mostly used encoding is UTF8, but in
most cases users should to write locale.utf8.

I don't understand what you are trying to say. Please provide more
detail.

go down.

b) there is bug - default collate (database collate is ignored)

postgres=# show lc_collate;
lc_collate
────────────
cs_CZ.UTF8
(1 row)

Time: 0.518 ms
postgres=# select * from jmena order by v;
v
───────────
Chromečka
Crha
Drobný
Čečetka
(4 rows)

postgres=# select * from jmena order by v collate "cs_CZ.utf8";
v
───────────
Crha
Čečetka
Drobný
Chromečka
(4 rows)

both result should be same.

I tried to reproduce this here but got the expected results. Could you
try to isolate a complete test script?

I can't to reproduce now too. On different system and comp. Maybe I
did some wrong. Sorry.

isn't there problem in case sensitive collate name? When I use a
lc_collate value, I got a error message

postgres=# select * from jmena order by v collate "cs_CZ.UTF8";
ERROR: collation "cs_CZ.UTF8" for current database encoding "UTF8"
does not exist
LINE 1: select * from jmena order by v collate "cs_CZ.UTF8";

problem is when table is created without explicit collate.

Well, I agree it's not totally nice, but we have to do something, and I
think it's logical to use the locale names as collation names by
default, and collation names are SQL identifiers. Do you have any ideas
for improving this?

yes - my first question is: Why we need to specify encoding, when only
one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
use a UTF8 - btw there is wrong message:

yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
ERROR: collation "cs_CZ.iso88592" for current database encoding
"UTF8" does not exist
LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
^

I don't know why, but preferred encoding for czech is iso88592 now -
but I can't to use it - so I can't to use a names "czech", "cs_CZ". I
always have to use a full name "cs_CZ.utf8". It's wrong. More - from
this moment, my application depends on firstly used encoding - I can't
to change encoding without refactoring of SQL statements - because
encoding is hardly there (in collation clause).

So I don't understand, why you fill a table pg_collation with thousand
collated that are not possible to use? If I use a utf8, then there
should be just utf8 based collates. And if you need to work with wide
collates, then I am for a preferring utf8 - minimally for central
europe region. if somebody would to use a collates here, then he will
use a combination cs, de, en - so it must to use a latin2 and latin1
or utf8. I think so encoding should not be a part of collation when it
is possible.

Regards

Pavel

Show quoted text

peter_e@gmx.net

over 15 years ago

In reply to: Pavel Stehule (#6)

Re: Per-column collation

On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:

yes - my first question is: Why we need to specify encoding, when only
one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
use a UTF8 - btw there is wrong message:

yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
ERROR: collation "cs_CZ.iso88592" for current database encoding
"UTF8" does not exist
LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
^

Sorry, is there some mistake in that message?

I don't know why, but preferred encoding for czech is iso88592 now -
but I can't to use it - so I can't to use a names "czech", "cs_CZ". I
always have to use a full name "cs_CZ.utf8". It's wrong. More - from
this moment, my application depends on firstly used encoding - I can't
to change encoding without refactoring of SQL statements - because
encoding is hardly there (in collation clause).

I can only look at the locales that the operating system provides. We
could conceivably make some simplifications like stripping off the
".utf8", but then how far do we go and where do we stop? Locale names
on Windows look different too. But in general, how do you suppose we
should map an operating system locale name to an "acceptable" SQL
identifier? You might hope, for example, that we could look through the
list of operating system locale names and map, say,

cs_CZ -> "czech"
cs_CZ.iso88592 -> "czech"
cs_CZ.utf8 -> "czech"
czech -> "czech"

but we have no way to actually know that these are semantically similar,
so this illustrated mapping is AI complete. We need to take the locale
names as is, and that may or may not carry encoding information.

So I don't understand, why you fill a table pg_collation with thousand
collated that are not possible to use? If I use a utf8, then there
should be just utf8 based collates. And if you need to work with wide
collates, then I am for a preferring utf8 - minimally for central
europe region. if somebody would to use a collates here, then he will
use a combination cs, de, en - so it must to use a latin2 and latin1
or utf8. I think so encoding should not be a part of collation when it
is possible.

Different databases can have different encodings, but the pg_collation
catalog is copied from the template database in any case. We can't do
any changes in system catalogs as we create new databases, so the
"useless" collations have to be there. There are only a few hundred,
actually, so it's not really a lot of wasted space.

pavel.stehule@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#7)

Re: Per-column collation

2010/11/16 Peter Eisentraut <peter_e@gmx.net>:

On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:

yes - my first question is: Why we need to specify encoding, when only
one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
use a UTF8 - btw there is wrong message:

yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
ERROR: collation "cs_CZ.iso88592" for current database encoding
"UTF8" does not exist
LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
^

Sorry, is there some mistake in that message?

it is unclean - I expect some like "cannot to use collation
cs_CZ.iso88502, because your database use a utf8 encoding".

I don't know why, but preferred encoding for czech is iso88592 now -
but I can't to use it - so I can't to use a names "czech", "cs_CZ". I
always have to use a full name "cs_CZ.utf8". It's wrong. More - from
this moment, my application depends on firstly used encoding - I can't
to change encoding without refactoring of SQL statements - because
encoding is hardly there (in collation clause).

I can only look at the locales that the operating system provides. We
could conceivably make some simplifications like stripping off the
".utf8", but then how far do we go and where do we stop? Locale names
on Windows look different too. But in general, how do you suppose we
should map an operating system locale name to an "acceptable" SQL
identifier? You might hope, for example, that we could look through the
list of operating system locale names and map, say,

cs_CZ -> "czech"
cs_CZ.iso88592 -> "czech"
cs_CZ.utf8 -> "czech"
czech -> "czech"

but we have no way to actually know that these are semantically similar,
so this illustrated mapping is AI complete. We need to take the locale
names as is, and that may or may not carry encoding information.

So I don't understand, why you fill a table pg_collation with thousand
collated that are not possible to use? If I use a utf8, then there
should be just utf8 based collates. And if you need to work with wide
collates, then I am for a preferring utf8 - minimally for central
europe region. if somebody would to use a collates here, then he will
use a combination cs, de, en - so it must to use a latin2 and latin1
or utf8. I think so encoding should not be a part of collation when it
is possible.

Different databases can have different encodings, but the pg_collation
catalog is copied from the template database in any case. We can't do
any changes in system catalogs as we create new databases, so the
"useless" collations have to be there. There are only a few hundred,
actually, so it's not really a lot of wasted space.

I have not a problem with size. Just I think, current behave isn't
practical. When database encoding is utf8, then I except, so "cs_CZ"
or "czech" will be for utf8. I understand, so template0 must have a
all locales, and I understand why current behave is, but it is very
user unfriendly. Actually, only old application in CR uses latin2,
almost all uses a utf, but now latin2 is preferred. This is bad and
should be solved.

Regards

Pavel

Show quoted text

marcin mank

marcin.mank@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#7)

Re: Per-column collation

I can only look at the locales that the operating system provides. We
could conceivably make some simplifications like stripping off the
".utf8", but then how far do we go and where do we stop? Locale names
on Windows look different too. But in general, how do you suppose we
should map an operating system locale name to an "acceptable" SQL
identifier? You might hope, for example, that we could look through the

It would be nice if we could have some mapping of locale names bult
in, so one doesn`t have to write alternative sql depending on DB
server OS:
select * from tab order by foo collate "Polish, Poland"
select * from tab order by foo collate "pl_PL.UTF-8"

(that`s how it works now, correct?)

Greetings
Marcin Mańk

#10

pavel.stehule@gmail.com

over 15 years ago

In reply to: marcin mank (#9)

Re: Per-column collation

2010/11/16 marcin mank <marcin.mank@gmail.com>:

I can only look at the locales that the operating system provides. We
could conceivably make some simplifications like stripping off the
".utf8", but then how far do we go and where do we stop? Locale names
on Windows look different too. But in general, how do you suppose we
should map an operating system locale name to an "acceptable" SQL
identifier? You might hope, for example, that we could look through the

It would be nice if we could have some mapping of locale names bult
in, so one doesn`t have to write alternative sql depending on DB
server OS:

Pavel

Show quoted text

select * from tab order by foo collate "Polish, Poland"
select * from tab order by foo collate "pl_PL.UTF-8"

(that`s how it works now, correct?)

Greetings
Marcin Mańk

#11

peter_e@gmx.net

over 15 years ago

In reply to: Pavel Stehule (#8)

Re: Per-column collation

On tis, 2010-11-16 at 20:59 +0100, Pavel Stehule wrote:

2010/11/16 Peter Eisentraut <peter_e@gmx.net>:

On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:

yes - my first question is: Why we need to specify encoding, when only
one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
use a UTF8 - btw there is wrong message:

yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
ERROR: collation "cs_CZ.iso88592" for current database encoding
"UTF8" does not exist
LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
^

Sorry, is there some mistake in that message?

it is unclean - I expect some like "cannot to use collation
cs_CZ.iso88502, because your database use a utf8 encoding".

No, the namespace for collations is per encoding. (This is per SQL
standard.) So you *could* have a collation called "cs_CZ.iso88502" for
the UTF8 encoding.

I have not a problem with size. Just I think, current behave isn't
practical. When database encoding is utf8, then I except, so "cs_CZ"
or "czech" will be for utf8. I understand, so template0 must have a
all locales, and I understand why current behave is, but it is very
user unfriendly. Actually, only old application in CR uses latin2,
almost all uses a utf, but now latin2 is preferred. This is bad and
should be solved.

Again, we can only look at the locale names that the operating system
gives us. Mapping that to the names you expect is an AI problem. If
you have a solution, please share it.

#12

peter_e@gmx.net

over 15 years ago

In reply to: marcin mank (#9)

Re: Per-column collation

On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:

It would be nice if we could have some mapping of locale names bult
in, so one doesn`t have to write alternative sql depending on DB
server OS:
select * from tab order by foo collate "Polish, Poland"
select * from tab order by foo collate "pl_PL.UTF-8"

Sure that would be nice, but how do you hope to do that?

#13

pavel.stehule@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#11)

Re: Per-column collation

2010/11/16 Peter Eisentraut <peter_e@gmx.net>:

On tis, 2010-11-16 at 20:59 +0100, Pavel Stehule wrote:

2010/11/16 Peter Eisentraut <peter_e@gmx.net>:

On tis, 2010-11-16 at 20:00 +0100, Pavel Stehule wrote:

yes - my first question is: Why we need to specify encoding, when only
one encoding is supported? I can't to use a cs_CZ.iso88592 when my db
use a UTF8 - btw there is wrong message:

yyy=# select * from jmena order by jmeno collate "cs_CZ.iso88592";
ERROR: collation "cs_CZ.iso88592" for current database encoding
"UTF8" does not exist
LINE 1: select * from jmena order by jmeno collate "cs_CZ.iso88592";
^

Sorry, is there some mistake in that message?

it is unclean - I expect some like "cannot to use collation
cs_CZ.iso88502, because your database use a utf8 encoding".

No, the namespace for collations is per encoding. (This is per SQL
standard.) So you *could* have a collation called "cs_CZ.iso88502" for
the UTF8 encoding.

I have not a problem with size. Just I think, current behave isn't
practical. When database encoding is utf8, then I except, so "cs_CZ"
or "czech" will be for utf8. I understand, so template0 must have a
all locales, and I understand why current behave is, but it is very
user unfriendly. Actually, only old application in CR uses latin2,
almost all uses a utf, but now latin2 is preferred. This is bad and
should be solved.

Again, we can only look at the locale names that the operating system
gives us. Mapping that to the names you expect is an AI problem. If
you have a solution, please share it.

ok, then we should to define this alias manually

some like - CREATE COLLATE "czech" FOR LOCALE "cs_CZ.UTF8"

or some similar. Without this, the application or stored procedures
can be non portable between UNIX and WIN.

Peter, now initdb check relation between encoding and locale - and
this check is portable. Can we use this code?

Pavel

Show quoted text

#14

peter_e@gmx.net

over 15 years ago

In reply to: Pavel Stehule (#13)

Re: Per-column collation

On tis, 2010-11-16 at 21:40 +0100, Pavel Stehule wrote:

ok, then we should to define this alias manually

some like - CREATE COLLATE "czech" FOR LOCALE "cs_CZ.UTF8"

or some similar. Without this, the application or stored procedures
can be non portable between UNIX and WIN.

Yes, such a command will be provided. You can already do it manually.

Peter, now initdb check relation between encoding and locale - and
this check is portable. Can we use this code?

Hmm, not really, but something similar, I suppose. Only that the
mapping list would be much longer and more volatile.

#15

Martijn van Oosterhout

kleptog@svana.org

over 15 years ago

In reply to: Peter Eisentraut (#12)

Re: Per-column collation

On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote:

On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:

It would be nice if we could have some mapping of locale names bult
in, so one doesn`t have to write alternative sql depending on DB
server OS:
select * from tab order by foo collate "Polish, Poland"
select * from tab order by foo collate "pl_PL.UTF-8"

Sure that would be nice, but how do you hope to do that?

Given that each operating system comes with a different set of
collations, it seems unlikely you could even find two collations on
different OSes that even correspond. There's not a lot of
standardisation here (well, except for the unicode collation
algorithm, but that doesn't help with language variations).

I don't think this is a big deal for now, perhaps after per-column
collation is implemented we can work on the portability issues. Make it
work, then make it better.

</me ducks>

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patriotism is when love of your own people comes first; nationalism,
when hate for people other than your own comes first.
- Charles de Gaulle

#16

Tom Lane

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Martijn van Oosterhout (#15)

Re: Per-column collation

Martijn van Oosterhout <kleptog@svana.org> writes:

On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote:

On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:

It would be nice if we could have some mapping of locale names bult
in, so one doesn`t have to write alternative sql depending on DB
server OS:

Sure that would be nice, but how do you hope to do that?

Given that each operating system comes with a different set of
collations, it seems unlikely you could even find two collations on
different OSes that even correspond.

Yeah, the *real* portability problem here is that the locale behavior is
likely to be different, not just the name. I don't think we'd be doing
people many favors by masking behavioral differences between a forced
common name.

regards, tom lane

#17

pavel.stehule@gmail.com

over 15 years ago

In reply to: Tom Lane (#16)

Re: Per-column collation

2010/11/16 Tom Lane <tgl@sss.pgh.pa.us>:

Martijn van Oosterhout <kleptog@svana.org> writes:

On Tue, Nov 16, 2010 at 10:32:01PM +0200, Peter Eisentraut wrote:

On tis, 2010-11-16 at 21:05 +0100, marcin mank wrote:

It would be nice if we could have some mapping of locale names bult
in, so one doesn`t have to write alternative sql depending on DB
server OS:

Sure that would be nice, but how do you hope to do that?

Given that each operating system comes with a different set of
collations, it seems unlikely you could even find two collations on
different OSes that even correspond.

Yeah, the *real* portability problem here is that the locale behavior is
likely to be different, not just the name. I don't think we'd be doing
people many favors by masking behavioral differences between a forced
common name.

no, minimally there is same behave of cs_CZ.utf8 and cs_CZ.iso88592.
But without any "alias" user should to modify source code, when he
change a encoding.

Pavel

Show quoted text

regards, tom lane

#18

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 15 years ago

In reply to: Peter Eisentraut (#3)

Re: Per-column collation

On 15.11.2010 21:42, Peter Eisentraut wrote:

On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:

I am checking a patch. I found a problem with initdb

Ah, late night brain farts, it appears. Here is a corrected version.

Some random comments:

In syntax.sgml:

+    The <literal>COLLATE</literal> clause overrides the collation of
+    an expression.  It is appended to the expression at applies to:

That last sentence doesn't parse.

Would it be possible to eliminate the ExecEvalCollateClause function
somehow? It just calls through the argument. How about directly
returning the argument ExprState in ExecInitExpr?

get_collation_name() returns the plain name without schema, so it's not
good enough for use in ruleutils.c. pg_dump is also ignoring collation's
schema.

Have you done any performance testing? Functions like text_cmp can be a
hotspot in sorting, so any extra overhead there might be show up in tests.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#3)

Re: Per-column collation

On Tue, Nov 16, 2010 at 04:42, Peter Eisentraut <peter_e@gmx.net> wrote:

On mån, 2010-11-15 at 11:34 +0100, Pavel Stehule wrote:

I am checking a patch. I found a problem with initdb

Ah, late night brain farts, it appears. Here is a corrected version.

This version cannot be applied cleanly any more. Please update it.
(I think you don't have to include changes for catversion.h)
./src/backend/optimizer/util/plancat.c.rej
./src/backend/optimizer/plan/createplan.c.rej
./src/backend/optimizer/path/indxpath.c.rej
./src/include/catalog/catversion.h.rej

I didn't compile nor run the patched server, but I found a couple of
issues in the design and source code:

* COLLATE information must be explicitly passed by caller in the patch,
but we might forgot the handover when we write new codes. Is it possible
to pass it automatically, say using a global variable? If we could do so,
existing extensions might work with collation without rewritten.

* Did you check the regression test on Windows? We probably cannot use
en_US.utf8 on Windows. Also, some output of the test includes non-ASCII
characters. How will we test COLLATE feature on non-UTF8 databases?

[src/test/regress/sql/collate.sql]
+CREATE TABLE collate_test1 (
+    a int,
+    b text COLLATE "en_US.utf8" not null
+);

* Did you see any performance regression by collation?
I found a bug in lc_collate_is_c(); result >= 0 should be
checked before any other checks. SearchSysCache1() here
would be a performance regression.

[src/backend/utils/adt/pg_locale.c]
-lc_collate_is_c(void)
+lc_collate_is_c(Oid collation)
 {
...
+		tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(collation));
...
HERE =>	if (result >= 0)
 		return (bool) result;

--
Itagaki Takahiro

#20

peter_e@gmx.net

over 15 years ago

In reply to: Heikki Linnakangas (#18)

Re: Per-column collation

On tor, 2010-11-18 at 21:37 +0200, Heikki Linnakangas wrote:

Have you done any performance testing? Functions like text_cmp can be
a hotspot in sorting, so any extra overhead there might be show up in
tests.

Without having optimized it very much yet, the performance for a 1GB
ORDER BY is

* without COLLATE clause: about the same as without the patch

* with COLLATE clause: about 30%-50% slower

I can imagine that there is some optimization potential in the latter
case. But in any case, it's not awfully slow.

#21

peter_e@gmx.net

over 15 years ago

In reply to: Itagaki Takahiro (#19)

#22

peter_e@gmx.net

over 15 years ago

In reply to: Itagaki Takahiro (#19)

#23

peter_e@gmx.net

over 15 years ago

In reply to: Heikki Linnakangas (#18)

#24

peter_e@gmx.net

over 15 years ago

In reply to: Peter Eisentraut (#22)

#25

peter_e@gmx.net

over 15 years ago

In reply to: Peter Eisentraut (#1)

#26

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#25)

#27

David E. Wheeler

david@kineticode.com

over 15 years ago

In reply to: Itagaki Takahiro (#26)

#28

pavel.stehule@gmail.com

over 15 years ago

In reply to: David E. Wheeler (#27)

#29

peter_e@gmx.net

over 15 years ago

In reply to: Itagaki Takahiro (#26)

#30

peter_e@gmx.net

over 15 years ago

In reply to: David E. Wheeler (#27)

#31

David E. Wheeler

david@kineticode.com

over 15 years ago

In reply to: Peter Eisentraut (#30)

#32

Alexandre Riveira

alexandre@objectdata.com.br

over 15 years ago

In reply to: Peter Eisentraut (#30)

#33

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#25)

#34

peter_e@gmx.net

over 15 years ago

In reply to: Itagaki Takahiro (#33)

#35

peter_e@gmx.net

over 15 years ago

In reply to: Peter Eisentraut (#29)

#36

peter_e@gmx.net

over 15 years ago

In reply to: Peter Eisentraut (#25)

#37

Robert Haas

robertmhaas@gmail.com

over 15 years ago

In reply to: Peter Eisentraut (#36)

#38

Greg Smith

gsmith@gregsmith.com

over 15 years ago

In reply to: Robert Haas (#37)

#39