initdb in 8.3

Started by Tim Tassonisalmost 18 years ago14 messagesgeneral
Jump to latest
#1Tim Tassonis
timtas@cubic.ch

Hi

I just recently compiled and installed 8.3.1 on a System that has UTF-8
as the default characterset in the environment. Copied the binaries, run
initdb without parameters, the usual stuff.

As you probably are all aware of, this results now in a cluster that
will only allow you to create UTF-8 databases. I have read some posts
regarding this topic where it is explained that allowing LATIN1 on a
cluster initialized with UTF-8 will give you problems anyway etc and you
have to use locale=C to be able to create databases with different
charactersets.

My question is: Why then is --locale=C not the default for initdb, as I
do regard it as a rather big annoyance that a default installation on
probably almost any modern linux distribution results in a UTF-8 only
cluster, fixable only by dropping all databases, rerun initdb and the
reimporting them again.

Apart from that I'd of course like to restate that postgresql is a
wonderful software which I like very much.

Bye
Tim

#2Richard Huxton
dev@archonet.com
In reply to: Tim Tassonis (#1)
Re: initdb in 8.3

Tim Tassonis wrote:

Hi

I just recently compiled and installed 8.3.1 on a System that has UTF-8
as the default characterset in the environment. Copied the binaries, run
initdb without parameters, the usual stuff.

As you probably are all aware of, this results now in a cluster that
will only allow you to create UTF-8 databases. I have read some posts
regarding this topic where it is explained that allowing LATIN1 on a
cluster initialized with UTF-8 will give you problems anyway etc and you
have to use locale=C to be able to create databases with different
charactersets.

You can only have one locale per installation (initdb'd cluster).
You can create a database with any encoding that is compatible with that
locale (LATIN1, LATIN9, UTF-8, etc).

Now, the locale controls (amongst other things) all your sorting. If you
choose a locale of "C" you get a simple binary sorting. By default PG
tries to match whatever sorting you have set up on your operating-system.

See the difference below:

richardh@server3:db$ LANG=C sort /tmp/words.txt
apple
berry
Apple
apple
apples
the apple

richardh@server3:db$ LANG=en_GB.UTF-8 sort /tmp/words.txt
apple
apple
Apple
apples
berry
the apple

I think someone is looking at per-database locales for 8.4 - the issue
is more tricky than you might think because you need to worry about
system catalogue sort-order.

--
Richard Huxton
Archonet Ltd

#3Peter Eisentraut
peter_e@gmx.net
In reply to: Tim Tassonis (#1)
Re: initdb in 8.3

Am Mittwoch, 23. April 2008 schrieb Tim Tassonis:

My question is: Why then is --locale=C not the default for initdb, as I
do regard it as a rather big annoyance that a default installation on
probably almost any modern linux distribution results in a UTF-8 only
cluster, fixable only by dropping all databases, rerun initdb and the
reimporting them again.

Because the vast majority of users prefers UTF-8 encoded databases over C
locale databases.

#4Tim Tassonis
timtas@cubic.ch
In reply to: Peter Eisentraut (#3)
Re: initdb in 8.3

Peter Eisentraut wrote:

Am Mittwoch, 23. April 2008 schrieb Tim Tassonis:

My question is: Why then is --locale=C not the default for initdb, as I
do regard it as a rather big annoyance that a default installation on
probably almost any modern linux distribution results in a UTF-8 only
cluster, fixable only by dropping all databases, rerun initdb and the
reimporting them again.

Because the vast majority of users prefers UTF-8 encoded databases over C
locale databases.

Ok, let me put it in another way. If UTF-8 is chosen at initdb, only
UTF-8 databases can be created, if C is chosen, you can specify
different encodings (UTF-8, LATIN1 etc) for each database.

As I understood now, sorting will then still be in C style and not in
the locale specific way. Which leads me to the following questions:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

From how I understand you, if I wanted a postgres server machine
supporting databases with different charsets, I'm advised to initialise
one cluster per locale.

If specifying a characterset different from the default locale for a
database is not a bad idea, why does the default install forbid me to do
exactly this?

Regards
Tim

#5Martijn van Oosterhout
kleptog@svana.org
In reply to: Tim Tassonis (#4)
Re: initdb in 8.3

On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote:

Ok, let me put it in another way. If UTF-8 is chosen at initdb, only
UTF-8 databases can be created, if C is chosen, you can specify
different encodings (UTF-8, LATIN1 etc) for each database.

As I understood now, sorting will then still be in C style and not in
the locale specific way. Which leads me to the following questions:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

It isn't possible, that's the point. What is possible is that client
can use any encoding they like to talk to the server, but the server
will store and manage it all in one. What locale C means "I'm an
encoding wizard and will ensure all my programs can handle all the
encodings I want to use, because I understand the database will treat
everything I send as ASCII bytes no matter what encoding the clients
say it is".

From how I understand you, if I wanted a postgres server machine
supporting databases with different charsets, I'm advised to initialise
one cluster per locale.

If you want to control the *storage* charset, yes. If you just want
clients to think it's a LATIN9 DB, doing a:

ALTER DATABASE foo SET client_encoding=latin9;

If specifying a characterset different from the default locale for a
database is not a bad idea, why does the default install forbid me to do
exactly this?

It is a bad idea, because most normal the C library can only handle one
encoding at a time. Locale C is a backdoor because it has system
independant semantics and does not require libc. It's also not what
people usually want, and so not recommended.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#6Peter Eisentraut
peter_e@gmx.net
In reply to: Tim Tassonis (#4)
Re: initdb in 8.3

Am Mittwoch, 23. April 2008 schrieb Tim Tassonis:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

Because Japanese users need this functionality. Aside from spectacularly
bizarre niche applications, that is really the only reason. If you are
dealing with European-type languages, just forget about it and specify the
locale you want to initdb and let the encodings fall into place.

#7Tim Tassonis
timtas@cubic.ch
In reply to: Martijn van Oosterhout (#5)
Re: initdb in 8.3

Martijn van Oosterhout wrote:

On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

It isn't possible, that's the point. What is possible is that client
can use any encoding they like to talk to the server, but the server
will store and manage it all in one. What locale C means "I'm an
encoding wizard and will ensure all my programs can handle all the
encodings I want to use, because I understand the database will treat
everything I send as ASCII bytes no matter what encoding the clients
say it is".

From how I understand you, if I wanted a postgres server machine
supporting databases with different charsets, I'm advised to initialise
one cluster per locale.

If you want to control the *storage* charset, yes. If you just want
clients to think it's a LATIN9 DB, doing a:

ALTER DATABASE foo SET client_encoding=latin9;

Ok, got it, it's really this setting that's interesting. If I have a
legacy application that defaults to latin1, I can leave the DB to UTF-8
,set the client_encoding to latin1 and then all my selects and inserts
can use latin1, but the data will be stored in utf-8.

Well, that's really all I need, sorry for the confusion.

Thanks a lot
Tim

#8Christopher Condit
condit@sdsc.edu
In reply to: Tim Tassonis (#7)
Re: initdb in 8.3

I have a question related to this issue:
Now that the locale has changed, it seems that the planner no longer
wants to use the indexes for running LIKE queries on varchar columns
unless I specify varchar_pattern_ops when creating the index. And if I
create the index with varchar_pattern_ops, then the planner won't use
it for = queries.

What's the correct solution to this problem (when using UTF-8 and
lc_collate and lc_ctype are both 1252)? Do I need to create two indexes?

Thanks,
-Chris

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tim Tassonis
Sent: Wednesday, April 23, 2008 12:22
To: Martijn van Oosterhout
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] initdb in 8.3

Martijn van Oosterhout wrote:

On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

It isn't possible, that's the point. What is possible is that client
can use any encoding they like to talk to the server, but the server
will store and manage it all in one. What locale C means "I'm an
encoding wizard and will ensure all my programs can handle all the
encodings I want to use, because I understand the database will treat
everything I send as ASCII bytes no matter what encoding the clients
say it is".

From how I understand you, if I wanted a postgres server machine
supporting databases with different charsets, I'm advised to

initialise

one cluster per locale.

If you want to control the *storage* charset, yes. If you just want
clients to think it's a LATIN9 DB, doing a:

ALTER DATABASE foo SET client_encoding=latin9;

Ok, got it, it's really this setting that's interesting. If I have a
legacy application that defaults to latin1, I can leave the DB to UTF-8
,set the client_encoding to latin1 and then all my selects and inserts
can use latin1, but the data will be stored in utf-8.

Well, that's really all I need, sorry for the confusion.

Thanks a lot
Tim

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#9Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Tim Tassonis (#1)
Re: initdb in 8.3

On Wed, Apr 23, 2008 at 11:46:35AM +0200, Tim Tassonis wrote:

As you probably are all aware of, this results now in a cluster that
will only allow you to create UTF-8 databases. I have read some posts
regarding this topic where it is explained that allowing LATIN1 on a
cluster initialized with UTF-8 will give you problems

Ain't it the other way round ?

Creating UTF8 databases on LATIN1 clusters won't work ?

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#10Tim Tassonis
timtas@cubic.ch
In reply to: Karsten Hilbert (#9)
Re: initdb in 8.3

Karsten Hilbert wrote:

On Wed, Apr 23, 2008 at 11:46:35AM +0200, Tim Tassonis wrote:

As you probably are all aware of, this results now in a cluster that
will only allow you to create UTF-8 databases. I have read some posts
regarding this topic where it is explained that allowing LATIN1 on a
cluster initialized with UTF-8 will give you problems

Ain't it the other way round ?

Creating UTF8 databases on LATIN1 clusters won't work ?

No it's this way round, but maybe the other way round, too.
You can only create databases with arbitrary encodings on C locale
clusters from 8.3 on.

Regards
Tim

Show quoted text

Karsten

#11Craig Ringer
craig@2ndquadrant.com
In reply to: Christopher Condit (#8)
Re: initdb in 8.3

Christopher Condit wrote:

I have a question related to this issue:
Now that the locale has changed, it seems that the planner no longer
wants to use the indexes for running LIKE queries on varchar columns
unless I specify varchar_pattern_ops when creating the index. And if I
create the index with varchar_pattern_ops, then the planner won't use
it for = queries.

What's the correct solution to this problem (when using UTF-8 and
lc_collate and lc_ctype are both 1252)? Do I need to create two indexes?

http://www.postgresql.org/docs/faqs.FAQ.html#item4.6
http://www.postgresql.org/docs/8.3/static/indexes-opclass.html

It should really also be mentioned in the section on the LIKE operator.

By the way, when I tried to leave a comment on the pattern matching
operator doc page I found that after logging in I was presented with a
blank comments form again. When I re-entered my comment and submitted I
got an error indicating that '' is not a valid NUMERIC. Going back to
the original comments form on the pattern matching operator doc page
(having logged in) and submitting there works fine.

--
Craig Ringer

#12Christopher Condit
condit@sdsc.edu
In reply to: Craig Ringer (#11)
Re: initdb in 8.3

Ahhh - I See. Thanks, Craig.
Although, once you've built the index with varchar_pattern_ops index,
the following two (essentially equivalent) queries will run at vastly
different speeds:
select * from A where A.value like 'Nacho';
select * from A where A.value = 'Nacho';

Seems that the optimizer should catch this...
C

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Thursday, April 24, 2008 01:20
To: Christopher Condit
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] initdb in 8.3

Christopher Condit wrote:

I have a question related to this issue:
Now that the locale has changed, it seems that the planner no longer
wants to use the indexes for running LIKE queries on varchar columns
unless I specify varchar_pattern_ops when creating the index. And if I
create the index with varchar_pattern_ops, then the planner won't use
it for = queries.

What's the correct solution to this problem (when using UTF-8 and
lc_collate and lc_ctype are both 1252)? Do I need to create two

indexes?

http://www.postgresql.org/docs/faqs.FAQ.html#item4.6
http://www.postgresql.org/docs/8.3/static/indexes-opclass.html

It should really also be mentioned in the section on the LIKE operator.

By the way, when I tried to leave a comment on the pattern matching
operator doc page I found that after logging in I was presented with a
blank comments form again. When I re-entered my comment and submitted I
got an error indicating that '' is not a valid NUMERIC. Going back to
the original comments form on the pattern matching operator doc page
(having logged in) and submitting there works fine.

--
Craig Ringer

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Condit (#12)
Re: initdb in 8.3

"Christopher Condit" <condit@sdsc.edu> writes:

Although, once you've built the index with varchar_pattern_ops index,
the following two (essentially equivalent) queries will run at vastly
different speeds:
select * from A where A.value like 'Nacho';
select * from A where A.value = 'Nacho';

Seems that the optimizer should catch this...

Yeah, it's on the to-do list ...
http://archives.postgresql.org/pgsql-hackers/2008-02/msg01003.php

regards, tom lane

#14Zdenek Kotala
Zdenek.Kotala@Sun.COM
In reply to: Richard Huxton (#2)
Re: initdb in 8.3

Richard Huxton napsal(a):

I think someone is looking at per-database locales for 8.4 - the issue
is more tricky than you might think because you need to worry about
system catalogue sort-order.

There is Google Soc project for implementing collation per database level. I
hope it will appear in 8.4 version.

Catalog sort-order is not affected by locale, because "name" data type uses
different operator then varchar.

Zdenek