Why do we let CREATE DATABASE reassign encoding?

Started by Tom Laneover 16 years ago16 messages

tgl@sss.pgh.pa.us

over 16 years ago

If I have locale set to C, I can do this:

regression=# create database u8 encoding 'utf8';
CREATE DATABASE
regression=# create database l1 encoding 'latin1' template u8;
CREATE DATABASE

Had I had any actual utf8 data in u8, l1 would now contain
encoding-corrupt information. Given that we've tried to
clamp down on encoding violations in recent releases, I wonder
why this case is still allowed.

(In non-C locales, this will typically fail because the two
different encodings can't both match the locale. But I don't
believe it's our policy to enforce encoding validity only for
non-C locales.)

We should presumably let the encoding be changed when cloning
from template0, and probably it's reasonable to trust the user
if either source or destination DB encoding is SQL_ASCII.
In other cases I'm thinking it should fail.

regards, tom lane

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Tom Lane (#1)

Re: Why do we let CREATE DATABASE reassign encoding?

Tom Lane wrote:

If I have locale set to C, I can do this:

regression=# create database u8 encoding 'utf8';
CREATE DATABASE
regression=# create database l1 encoding 'latin1' template u8;
CREATE DATABASE

Had I had any actual utf8 data in u8, l1 would now contain
encoding-corrupt information. Given that we've tried to
clamp down on encoding violations in recent releases, I wonder
why this case is still allowed.

Wow, I'm surprised we allow that. Never occurred to me to try.

We should presumably let the encoding be changed when cloning
from template0, and probably it's reasonable to trust the user
if either source or destination DB encoding is SQL_ASCII.
In other cases I'm thinking it should fail.

Agreed, that's exactly what we did with per-database collation.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

wmoran@potentialtech.com

over 16 years ago

In reply to: Tom Lane (#1)

Re: Why do we let CREATE DATABASE reassign encoding?

In response to Tom Lane <tgl@sss.pgh.pa.us>:

If I have locale set to C, I can do this:

regression=# create database u8 encoding 'utf8';
CREATE DATABASE
regression=# create database l1 encoding 'latin1' template u8;
CREATE DATABASE

Had I had any actual utf8 data in u8, l1 would now contain
encoding-corrupt information. Given that we've tried to
clamp down on encoding violations in recent releases, I wonder
why this case is still allowed.

(In non-C locales, this will typically fail because the two
different encodings can't both match the locale. But I don't
believe it's our policy to enforce encoding validity only for
non-C locales.)

We should presumably let the encoding be changed when cloning
from template0, and probably it's reasonable to trust the user
if either source or destination DB encoding is SQL_ASCII.
In other cases I'm thinking it should fail.

On a pedantic level, doesn't this remove the ability to have
databases on a single cluster that are different encodings? I mean,
if template1 is utf8, and I can't change that using CREATE
DATABASE, then I'm stuck with utf8 for all databases on that
cluster ... unless I'm missing something.

Granted, there's the potential for special cases with databases used
only for templates, but as I see it, this should be allowed, it should
just fail if any data in the template can't be converted to the
desired encoding. I mean, I can always alter template1 by inserting
non-utf8 data, and then try to use it to create a utf8 encoded
database ...

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

andrew@dunslane.net

over 16 years ago

In reply to: Tom Lane (#1)

Re: Why do we let CREATE DATABASE reassign encoding?

Tom Lane wrote:

If I have locale set to C, I can do this:

regression=# create database u8 encoding 'utf8';
CREATE DATABASE
regression=# create database l1 encoding 'latin1' template u8;
CREATE DATABASE

Had I had any actual utf8 data in u8, l1 would now contain
encoding-corrupt information. Given that we've tried to
clamp down on encoding violations in recent releases, I wonder
why this case is still allowed.

(In non-C locales, this will typically fail because the two
different encodings can't both match the locale. But I don't
believe it's our policy to enforce encoding validity only for
non-C locales.)

We should presumably let the encoding be changed when cloning
from template0, and probably it's reasonable to trust the user
if either source or destination DB encoding is SQL_ASCII.
In other cases I'm thinking it should fail.

Really? You want to forbid selecting an encoding when the source is
template1, which is the default, and template1 is not SQL_ASCII? So the
following sequence woiuld be illegal:

initdb -E latin1
createdb -E utf8

I think we have a bit more thinking to do on this - I don't have a
reasonable solution immediately in my head.

cheers

andrew

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Bill Moran (#3)

Re: Why do we let CREATE DATABASE reassign encoding?

Bill Moran <wmoran@potentialtech.com> writes:

In response to Tom Lane <tgl@sss.pgh.pa.us>:

We should presumably let the encoding be changed when cloning
from template0, and probably it's reasonable to trust the user
if either source or destination DB encoding is SQL_ASCII.
In other cases I'm thinking it should fail.

On a pedantic level, doesn't this remove the ability to have
databases on a single cluster that are different encodings? I mean,
if template1 is utf8, and I can't change that using CREATE
DATABASE, then I'm stuck with utf8 for all databases on that
cluster ... unless I'm missing something.

You're supposed to clone from template0, not template1, when creating
DBs that are different in either encoding or locale from the
installation default. We already enforce this except for having missed
the special case of C locale. (There might be some corner cases
involving UTF8 on Windows, too; not sure about that.) The reason is
that template0 is expected to contain only ASCII data, but template1
might not.

regards, tom lane

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 16 years ago

In reply to: Bill Moran (#3)

Re: Why do we let CREATE DATABASE reassign encoding?

Bill Moran wrote:

In response to Tom Lane <tgl@sss.pgh.pa.us>:

We should presumably let the encoding be changed when cloning
from template0, and probably it's reasonable to trust the user
if either source or destination DB encoding is SQL_ASCII.
In other cases I'm thinking it should fail.

On a pedantic level, doesn't this remove the ability to have
databases on a single cluster that are different encodings? I mean,
if template1 is utf8, and I can't change that using CREATE
DATABASE, then I'm stuck with utf8 for all databases on that
cluster ... unless I'm missing something.

You could still use template0 as template for a database with any
encoding, like:

CREATE DATABASE .. TEMPLATE = template0;

We can special case template0 because we know its contents are pure
7-bit ascii which is compatible with any server encoding.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

greg.stark@enterprisedb.com

over 16 years ago

In reply to: Bill Moran (#3)

Re: Why do we let CREATE DATABASE reassign encoding?

So it would still be possible to byass this check by cloning a
database into SQL_ASCII and then cloning it into the desired encoding?
Doesn't sound like it really accomplishes much.

I do seem to recall some discussion about this way back. I don't
recall the conclusion but I remember some talk about detecting an
empty template database and the ned to reindex.

--
Greg

On 23 Apr 2009, at 19:46, Bill Moran <wmoran@potentialtech.com> wrote:

Show quoted text

In response to Tom Lane <tgl@sss.pgh.pa.us>:

If I have locale set to C, I can do this:

regression=# create database u8 encoding 'utf8';
CREATE DATABASE
regression=# create database l1 encoding 'latin1' template u8;
CREATE DATABASE

Had I had any actual utf8 data in u8, l1 would now contain
encoding-corrupt information. Given that we've tried to
clamp down on encoding violations in recent releases, I wonder
why this case is still allowed.

(In non-C locales, this will typically fail because the two
different encodings can't both match the locale. But I don't
believe it's our policy to enforce encoding validity only for
non-C locales.)

We should presumably let the encoding be changed when cloning
from template0, and probably it's reasonable to trust the user
if either source or destination DB encoding is SQL_ASCII.
In other cases I'm thinking it should fail.

On a pedantic level, doesn't this remove the ability to have
databases on a single cluster that are different encodings? I mean,
if template1 is utf8, and I can't change that using CREATE
DATABASE, then I'm stuck with utf8 for all databases on that
cluster ... unless I'm missing something.

Granted, there's the potential for special cases with databases used
only for templates, but as I see it, this should be allowed, it should
just fail if any data in the template can't be converted to the
desired encoding. I mean, I can always alter template1 by inserting
non-utf8 data, and then try to use it to create a utf8 encoded
database ...

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Andrew Dunstan (#4)

Re: Why do we let CREATE DATABASE reassign encoding?

Andrew Dunstan <andrew@dunslane.net> writes:

So the following sequence woiuld be illegal:

initdb -E latin1
createdb -E utf8

Yes, that's rather the point. Note that it already *is* illegal
unless you happen to have selected C locale; AFAICS that is an
oversight and not intentional. For instance, going in the other
direction in en_US locale, I get

$ createdb -E latin1 l1
createdb: database creation failed: ERROR: encoding LATIN1 does not match locale en_US.utf8
DETAIL: The chosen LC_CTYPE setting requires encoding UTF8.

You can get around this by cloning template0 instead of template1
(we assume template0 contains nothing that's encoding-specific).
Possibly the docs will need to be improved to emphasize that.

regards, tom lane

andrew@dunslane.net

over 16 years ago

In reply to: Tom Lane (#8)

Re: Why do we let CREATE DATABASE reassign encoding?

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

So the following sequence woiuld be illegal:

initdb -E latin1
createdb -E utf8

Yes, that's rather the point. Note that it already *is* illegal
unless you happen to have selected C locale; AFAICS that is an
oversight and not intentional.

Ok, then I agree that we need a big warning on that in the docs.

cheers

andrew

wmoran@potentialtech.com

over 16 years ago

In reply to: Tom Lane (#5)

Re: Why do we let CREATE DATABASE reassign encoding?

In response to Tom Lane <tgl@sss.pgh.pa.us>:

Bill Moran <wmoran@potentialtech.com> writes:

In response to Tom Lane <tgl@sss.pgh.pa.us>:

We should presumably let the encoding be changed when cloning
from template0, and probably it's reasonable to trust the user
if either source or destination DB encoding is SQL_ASCII.
In other cases I'm thinking it should fail.

On a pedantic level, doesn't this remove the ability to have
databases on a single cluster that are different encodings? I mean,
if template1 is utf8, and I can't change that using CREATE
DATABASE, then I'm stuck with utf8 for all databases on that
cluster ... unless I'm missing something.

You're supposed to clone from template0, not template1, when creating
DBs that are different in either encoding or locale from the
installation default. We already enforce this except for having missed
the special case of C locale.

Ah ... was not aware of that. It hasn't come up in my usage.

(There might be some corner cases
involving UTF8 on Windows, too; not sure about that.) The reason is
that template0 is expected to contain only ASCII data, but template1
might not.

Makes sense, with that explanation.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Greg Stark (#7)

Re: Why do we let CREATE DATABASE reassign encoding?

Greg Stark <greg.stark@enterprisedb.com> writes:

So it would still be possible to byass this check by cloning a
database into SQL_ASCII and then cloning it into the desired encoding?
Doesn't sound like it really accomplishes much.

Well, it accomplishes preventing stupid encoding violations. The point
came to mind when I saw a post a little bit ago on -general in which
the poster seemed to imagine that CREATE DATABASE would convert
encodings for him. Since that is not happening in the foreseeable
future, I think we need to prevent the system from acting like it
would work.

If we wanted to be entirely anal about this, we could allow SQL_ASCII
destination with a different source encoding, but not the reverse.
However, we currently consider that you're on your own to ensure sanity
when using SQL_ASCII as far as locale goes, so I'm not sure why the
policy would be different for encoding.

regards, tom lane

andrew@dunslane.net

over 16 years ago

In reply to: Tom Lane (#11)

Re: Why do we let CREATE DATABASE reassign encoding?

Tom Lane wrote:

If we wanted to be entirely anal about this, we could allow SQL_ASCII
destination with a different source encoding, but not the reverse.
However, we currently consider that you're on your own to ensure sanity
when using SQL_ASCII as far as locale goes, so I'm not sure why the
policy would be different for encoding.

The trouble is that people won't know the provenance of their database.
I think we should try to guarantee as far as possible that if a database
has encoding X then all the data in it is valid in that encoding.

cheers

andrew

David E. Wheeler

david@kineticode.com

over 16 years ago

In reply to: Tom Lane (#8)

Re: Why do we let CREATE DATABASE reassign encoding?

On Apr 23, 2009, at 12:00 PM, Tom Lane wrote:

You can get around this by cloning template0 instead of template1
(we assume template0 contains nothing that's encoding-specific).
Possibly the docs will need to be improved to emphasize that.

I was just about to suggest that. With this change, template0 is
suddenly going to be a lot more important for people to know about and
make use of.

Best,

David

Peter Eisentraut

peter_e@gmx.net

over 16 years ago

In reply to: Tom Lane (#8)

Re: Why do we let CREATE DATABASE reassign encoding?

On Thursday 23 April 2009 22:00:25 Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

So the following sequence woiuld be illegal:

initdb -E latin1
createdb -E utf8

Yes, that's rather the point. Note that it already is illegal
unless you happen to have selected C locale;

AFAIR, the only reason that we haven't disallowed this sort of stuff years and
years ago is that people use it; the Japanese in particular. I don't see what
is different now.

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Peter Eisentraut (#14)

Re: Why do we let CREATE DATABASE reassign encoding?

Peter Eisentraut <peter_e@gmx.net> writes:

AFAIR, the only reason that we haven't disallowed this sort of stuff
years and years ago is that people use it; the Japanese in particular.
I don't see what is different now.

What's different now is that 8.4 has already established the principle
that you have to clone template0 if you want to change the locale of a
database. I think this is a good time to establish the same principle
for encodings. (Or in other words, if we don't fix it now, when will
be a better time?)

regards, tom lane

tgl@sss.pgh.pa.us

over 16 years ago

In reply to: Tom Lane (#15)

Re: Why do we let CREATE DATABASE reassign encoding?

I wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

AFAIR, the only reason that we haven't disallowed this sort of stuff
years and years ago is that people use it; the Japanese in particular.
I don't see what is different now.

What's different now is that 8.4 has already established the principle
that you have to clone template0 if you want to change the locale of a
database. I think this is a good time to establish the same principle
for encodings. (Or in other words, if we don't fix it now, when will
be a better time?)

Attached is a proposed patch (without documentation changes as yet)
for this. Since the code is already enforcing exact locale match when
cloning a non-template0 database, I just made it act the same for
encoding, without any strange exceptions for SQL_ASCII.

I found that mbregress.sh was already broken by the existing
restrictions, if you try to use it in a database whose default
locale isn't C. The patch adds switches to fix that.

The patch also incidentally fixes a few ereport's that were missing
errcode values.

Last chance for objections ...

regards, tom lane