initdb in 8.3

Started by Tim Tassonisabout 18 years ago14 messagesgeneral

timtas@cubic.ch

about 18 years ago

I just recently compiled and installed 8.3.1 on a System that has UTF-8
as the default characterset in the environment. Copied the binaries, run
initdb without parameters, the usual stuff.

As you probably are all aware of, this results now in a cluster that
will only allow you to create UTF-8 databases. I have read some posts
regarding this topic where it is explained that allowing LATIN1 on a
cluster initialized with UTF-8 will give you problems anyway etc and you
have to use locale=C to be able to create databases with different
charactersets.

My question is: Why then is --locale=C not the default for initdb, as I
do regard it as a rather big annoyance that a default installation on
probably almost any modern linux distribution results in a UTF-8 only
cluster, fixable only by dropping all databases, rerun initdb and the
reimporting them again.

Apart from that I'd of course like to restate that postgresql is a
wonderful software which I like very much.

Bye
Tim

Richard Huxton

dev@archonet.com

about 18 years ago

In reply to: Tim Tassonis (#1)

Re: initdb in 8.3

Tim Tassonis wrote:

Hi

I just recently compiled and installed 8.3.1 on a System that has UTF-8
as the default characterset in the environment. Copied the binaries, run
initdb without parameters, the usual stuff.

As you probably are all aware of, this results now in a cluster that
will only allow you to create UTF-8 databases. I have read some posts
regarding this topic where it is explained that allowing LATIN1 on a
cluster initialized with UTF-8 will give you problems anyway etc and you
have to use locale=C to be able to create databases with different
charactersets.

You can only have one locale per installation (initdb'd cluster).
You can create a database with any encoding that is compatible with that
locale (LATIN1, LATIN9, UTF-8, etc).

Now, the locale controls (amongst other things) all your sorting. If you
choose a locale of "C" you get a simple binary sorting. By default PG
tries to match whatever sorting you have set up on your operating-system.

See the difference below:

richardh@server3:db$ LANG=C sort /tmp/words.txt
apple
berry
Apple
apple
apples
the apple

richardh@server3:db$ LANG=en_GB.UTF-8 sort /tmp/words.txt
apple
apple
Apple
apples
berry
the apple

I think someone is looking at per-database locales for 8.4 - the issue
is more tricky than you might think because you need to worry about
system catalogue sort-order.

--
Richard Huxton
Archonet Ltd

Peter Eisentraut

peter_e@gmx.net

about 18 years ago

In reply to: Tim Tassonis (#1)

Re: initdb in 8.3

Am Mittwoch, 23. April 2008 schrieb Tim Tassonis:

My question is: Why then is --locale=C not the default for initdb, as I
do regard it as a rather big annoyance that a default installation on
probably almost any modern linux distribution results in a UTF-8 only
cluster, fixable only by dropping all databases, rerun initdb and the
reimporting them again.

Because the vast majority of users prefers UTF-8 encoded databases over C
locale databases.

Tim Tassonis

timtas@cubic.ch

about 18 years ago

In reply to: Peter Eisentraut (#3)

Re: initdb in 8.3

Peter Eisentraut wrote:

Am Mittwoch, 23. April 2008 schrieb Tim Tassonis:

My question is: Why then is --locale=C not the default for initdb, as I
do regard it as a rather big annoyance that a default installation on
probably almost any modern linux distribution results in a UTF-8 only
cluster, fixable only by dropping all databases, rerun initdb and the
reimporting them again.

Because the vast majority of users prefers UTF-8 encoded databases over C
locale databases.

Ok, let me put it in another way. If UTF-8 is chosen at initdb, only
UTF-8 databases can be created, if C is chosen, you can specify
different encodings (UTF-8, LATIN1 etc) for each database.

As I understood now, sorting will then still be in C style and not in
the locale specific way. Which leads me to the following questions:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

From how I understand you, if I wanted a postgres server machine
supporting databases with different charsets, I'm advised to initialise
one cluster per locale.

If specifying a characterset different from the default locale for a
database is not a bad idea, why does the default install forbid me to do
exactly this?

Regards
Tim

Martijn van Oosterhout

kleptog@svana.org

about 18 years ago

In reply to: Tim Tassonis (#4)

Re: initdb in 8.3

On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote:

Ok, let me put it in another way. If UTF-8 is chosen at initdb, only
UTF-8 databases can be created, if C is chosen, you can specify
different encodings (UTF-8, LATIN1 etc) for each database.

As I understood now, sorting will then still be in C style and not in
the locale specific way. Which leads me to the following questions:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

It isn't possible, that's the point. What is possible is that client
can use any encoding they like to talk to the server, but the server
will store and manage it all in one. What locale C means "I'm an
encoding wizard and will ensure all my programs can handle all the
encodings I want to use, because I understand the database will treat
everything I send as ASCII bytes no matter what encoding the clients
say it is".

From how I understand you, if I wanted a postgres server machine
supporting databases with different charsets, I'm advised to initialise
one cluster per locale.

If you want to control the *storage* charset, yes. If you just want
clients to think it's a LATIN9 DB, doing a:

ALTER DATABASE foo SET client_encoding=latin9;

If specifying a characterset different from the default locale for a
database is not a bad idea, why does the default install forbid me to do
exactly this?

It is a bad idea, because most normal the C library can only handle one
encoding at a time. Locale C is a backdoor because it has system
independant semantics and does not require libc. It's also not what
people usually want, and so not recommended.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

Peter Eisentraut

peter_e@gmx.net

about 18 years ago

In reply to: Tim Tassonis (#4)

Re: initdb in 8.3

Am Mittwoch, 23. April 2008 schrieb Tim Tassonis:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

Because Japanese users need this functionality. Aside from spectacularly
bizarre niche applications, that is really the only reason. If you are
dealing with European-type languages, just forget about it and specify the
locale you want to initdb and let the encodings fall into place.

Tim Tassonis

timtas@cubic.ch

about 18 years ago

In reply to: Martijn van Oosterhout (#5)

Re: initdb in 8.3

Martijn van Oosterhout wrote:

On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

It isn't possible, that's the point. What is possible is that client
can use any encoding they like to talk to the server, but the server
will store and manage it all in one. What locale C means "I'm an
encoding wizard and will ensure all my programs can handle all the
encodings I want to use, because I understand the database will treat
everything I send as ASCII bytes no matter what encoding the clients
say it is".

From how I understand you, if I wanted a postgres server machine
supporting databases with different charsets, I'm advised to initialise
one cluster per locale.

If you want to control the *storage* charset, yes. If you just want
clients to think it's a LATIN9 DB, doing a:

ALTER DATABASE foo SET client_encoding=latin9;

Ok, got it, it's really this setting that's interesting. If I have a
legacy application that defaults to latin1, I can leave the DB to UTF-8
,set the client_encoding to latin1 and then all my selects and inserts
can use latin1, but the data will be stored in utf-8.

Well, that's really all I need, sorry for the confusion.

Thanks a lot
Tim

Christopher Condit

condit@sdsc.edu

about 18 years ago

In reply to: Tim Tassonis (#7)

Re: initdb in 8.3

I have a question related to this issue:
Now that the locale has changed, it seems that the planner no longer
wants to use the indexes for running LIKE queries on varchar columns
unless I specify varchar_pattern_ops when creating the index. And if I
create the index with varchar_pattern_ops, then the planner won't use
it for = queries.

What's the correct solution to this problem (when using UTF-8 and
lc_collate and lc_ctype are both 1252)? Do I need to create two indexes?

Thanks,
-Chris

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Tim Tassonis
Sent: Wednesday, April 23, 2008 12:22
To: Martijn van Oosterhout
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] initdb in 8.3

Martijn van Oosterhout wrote:

On Wed, Apr 23, 2008 at 04:35:04PM +0200, Tim Tassonis wrote:

If specifying a characterset different from the default locale for a
database is such a bad idea, why is it possible at all?

It isn't possible, that's the point. What is possible is that client
can use any encoding they like to talk to the server, but the server
will store and manage it all in one. What locale C means "I'm an
encoding wizard and will ensure all my programs can handle all the
encodings I want to use, because I understand the database will treat
everything I send as ASCII bytes no matter what encoding the clients
say it is".

From how I understand you, if I wanted a postgres server machine
supporting databases with different charsets, I'm advised to

initialise

one cluster per locale.

If you want to control the *storage* charset, yes. If you just want
clients to think it's a LATIN9 DB, doing a:

ALTER DATABASE foo SET client_encoding=latin9;

Well, that's really all I need, sorry for the confusion.

Thanks a lot
Tim

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Karsten Hilbert

Karsten.Hilbert@gmx.net

about 18 years ago

In reply to: Tim Tassonis (#1)

Re: initdb in 8.3

On Wed, Apr 23, 2008 at 11:46:35AM +0200, Tim Tassonis wrote:

As you probably are all aware of, this results now in a cluster that
will only allow you to create UTF-8 databases. I have read some posts
regarding this topic where it is explained that allowing LATIN1 on a
cluster initialized with UTF-8 will give you problems

Ain't it the other way round ?

Creating UTF8 databases on LATIN1 clusters won't work ?

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

#10

Tim Tassonis

timtas@cubic.ch

about 18 years ago

In reply to: Karsten Hilbert (#9)

Re: initdb in 8.3

Karsten Hilbert wrote:

On Wed, Apr 23, 2008 at 11:46:35AM +0200, Tim Tassonis wrote:

As you probably are all aware of, this results now in a cluster that
will only allow you to create UTF-8 databases. I have read some posts
regarding this topic where it is explained that allowing LATIN1 on a
cluster initialized with UTF-8 will give you problems

Ain't it the other way round ?

Creating UTF8 databases on LATIN1 clusters won't work ?

No it's this way round, but maybe the other way round, too.
You can only create databases with arbitrary encodings on C locale
clusters from 8.3 on.

Regards
Tim

Show quoted text

Karsten

#11

Craig Ringer

craig@2ndquadrant.com

about 18 years ago

In reply to: Christopher Condit (#8)

Re: initdb in 8.3

Christopher Condit wrote:

I have a question related to this issue:
Now that the locale has changed, it seems that the planner no longer
wants to use the indexes for running LIKE queries on varchar columns
unless I specify varchar_pattern_ops when creating the index. And if I
create the index with varchar_pattern_ops, then the planner won't use
it for = queries.

What's the correct solution to this problem (when using UTF-8 and
lc_collate and lc_ctype are both 1252)? Do I need to create two indexes?

http://www.postgresql.org/docs/faqs.FAQ.html#item4.6
http://www.postgresql.org/docs/8.3/static/indexes-opclass.html

It should really also be mentioned in the section on the LIKE operator.

By the way, when I tried to leave a comment on the pattern matching
operator doc page I found that after logging in I was presented with a
blank comments form again. When I re-entered my comment and submitted I
got an error indicating that '' is not a valid NUMERIC. Going back to
the original comments form on the pattern matching operator doc page
(having logged in) and submitting there works fine.

--
Craig Ringer

#12

Christopher Condit

condit@sdsc.edu

about 18 years ago

In reply to: Craig Ringer (#11)

Re: initdb in 8.3

Ahhh - I See. Thanks, Craig.
Although, once you've built the index with varchar_pattern_ops index,
the following two (essentially equivalent) queries will run at vastly
different speeds:
select * from A where A.value like 'Nacho';
select * from A where A.value = 'Nacho';

Seems that the optimizer should catch this...
C

-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Thursday, April 24, 2008 01:20
To: Christopher Condit
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] initdb in 8.3

Christopher Condit wrote:

I have a question related to this issue:
Now that the locale has changed, it seems that the planner no longer
wants to use the indexes for running LIKE queries on varchar columns
unless I specify varchar_pattern_ops when creating the index. And if I
create the index with varchar_pattern_ops, then the planner won't use
it for = queries.

What's the correct solution to this problem (when using UTF-8 and
lc_collate and lc_ctype are both 1252)? Do I need to create two

indexes?

http://www.postgresql.org/docs/faqs.FAQ.html#item4.6
http://www.postgresql.org/docs/8.3/static/indexes-opclass.html

It should really also be mentioned in the section on the LIKE operator.

--
Craig Ringer

#13

Tom Lane

tgl@sss.pgh.pa.us

about 18 years ago

In reply to: Christopher Condit (#12)

Re: initdb in 8.3

"Christopher Condit" <condit@sdsc.edu> writes:

Although, once you've built the index with varchar_pattern_ops index,
the following two (essentially equivalent) queries will run at vastly
different speeds:
select * from A where A.value like 'Nacho';
select * from A where A.value = 'Nacho';

Seems that the optimizer should catch this...

Yeah, it's on the to-do list ...
http://archives.postgresql.org/pgsql-hackers/2008-02/msg01003.php

regards, tom lane

#14

Zdenek Kotala

Zdenek.Kotala@Sun.COM

about 18 years ago

In reply to: Richard Huxton (#2)

Re: initdb in 8.3

Richard Huxton napsal(a):

I think someone is looking at per-database locales for 8.4 - the issue
is more tricky than you might think because you need to worry about
system catalogue sort-order.

There is Google Soc project for implementing collation per database level. I
hope it will appear in 8.4 version.

Catalog sort-order is not affected by locale, because "name" data type uses
different operator then varchar.

Zdenek