pg_conversion seems rather strangely defined

Started by Tom Laneover 10 years ago10 messageshackers

tgl@sss.pgh.pa.us

over 10 years ago

What is the point of pg_conversion.condefault (the flag that says whether
a conversion is "default")? AFAICS, there is absolutely no way to invoke
a conversion that is not default, which means we might as well eliminate
the concept.

I do not see a lot of point in the namespacing of encoding conversions
either. Does anyone really need or use search-path-dependent lookup of
conversions? (If they do, it's probably broken anyway, since for example
we do not trouble to re-identify the client encoding conversion functions
when search_path changes.)

While actually removing pg_conversion.connamespace might not be worth
the trouble, it's mighty tempting to have just a single unique index on
(conforencoding, contoencoding), thereby enforcing that There Can Be Only
One conversion between any given pair of encodings, and then we can just
use that index to find the right entry without any fooling with search
path.

But in any case we should get rid of the concept of defaultness, because
it's pointless; all entries should be equally "default".

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Tom Lane (#1)

Re: pg_conversion seems rather strangely defined

I wrote:

What is the point of pg_conversion.condefault (the flag that says whether
a conversion is "default")? AFAICS, there is absolutely no way to invoke
a conversion that is not default, which means we might as well eliminate
the concept.

I do not see a lot of point in the namespacing of encoding conversions
either. Does anyone really need or use search-path-dependent lookup of
conversions? (If they do, it's probably broken anyway, since for example
we do not trouble to re-identify the client encoding conversion functions
when search_path changes.)

While actually removing pg_conversion.connamespace might not be worth
the trouble, it's mighty tempting to have just a single unique index on
(conforencoding, contoencoding), thereby enforcing that There Can Be Only
One conversion between any given pair of encodings, and then we can just
use that index to find the right entry without any fooling with search
path.

But in any case we should get rid of the concept of defaultness, because
it's pointless; all entries should be equally "default".

After further poking around, I realized that there *used* to be a way to
get at non-default encoding conversions, but it was removed here:

commit 02138357ffc8c41a3d646035368712a5394f1f5f
Author: Andrew Dunstan <andrew@dunslane.net>
Date: Mon Sep 24 01:29:30 2007 +0000

Remove "convert 'blah' using conversion_name" facility, because if it
produces text it is an encoding hole and if not it's incompatible
with the spec, whatever the spec means (which we're not sure about anyway).

Now, the big problem there was that the function could produce values of
type "text" that violate the current database encoding. We could have
retained the facility by making the function take and return bytea, as
convert() does today, but apparently nobody spoke up for the need then
or since.

Still, somebody might want it someday, and it would be pretty messy
to remove the concept of a default encoding only to have to put it
back later.

I still think however that search-path-based lookup of default encoding
conversions is a Bad Idea, and that we only ought to allow one default
conversion to exist for any pair of encodings.

I realized that we could implement that without too much trouble by
redefining pg_conversion.condefault as being true for default conversions
and NULL (not false) for non-default ones. With this definition, a
unique index on pg_conversion (conforencoding, contoencoding, condefault)
would have the behavior we want --- sort of a poor man's partial unique
index, except that it would work correctly on a system catalog whereas
a true partial index wouldn't.

Thoughts?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Tom Lane (#2)

Re: pg_conversion seems rather strangely defined

I wrote:

I still think however that search-path-based lookup of default encoding
conversions is a Bad Idea, and that we only ought to allow one default
conversion to exist for any pair of encodings.

I realized that we could implement that without too much trouble by
redefining pg_conversion.condefault as being true for default conversions
and NULL (not false) for non-default ones. With this definition, a
unique index on pg_conversion (conforencoding, contoencoding, condefault)
would have the behavior we want --- sort of a poor man's partial unique
index, except that it would work correctly on a system catalog whereas
a true partial index wouldn't.

Turns out that indeed that works just fine. See attached draft patch.

regards, tom lane

Noah Misch

noah@leadboat.com

over 10 years ago

In reply to: Tom Lane (#1)

Re: pg_conversion seems rather strangely defined

On Tue, Jan 05, 2016 at 01:46:51PM -0500, Tom Lane wrote:

I do not see a lot of point in the namespacing of encoding conversions
either. Does anyone really need or use search-path-dependent lookup of
conversions?

I have not issued CREATE CONVERSION except to experiment, and I have never
worked in a database in which someone else had created one. Among PGXN
distributions, CREATE CONVERSION appears only in the pyrseas test suite. It
could be hard to track down testimony on real-world usage patterns, but I
envision two credible patterns. First, you could change the default search
path to "corrected_conversions, pg_catalog, $user, public" and inject fixed
versions of the system conversions. One could use that to backport commit
8d3e090. Second, you could add conversions we omit entirely, like UTF8 ->
MULE_INTERNAL. Dropping search-path-dependent lookup would remove the
supported way to fix system conversions.

(If they do, it's probably broken anyway, since for example
we do not trouble to re-identify the client encoding conversion functions
when search_path changes.)

That's bad in principle, but I'll guess it's tolerable in practice. Switching
among implementations of a particular conversion might happen with O(weeks) or
longer period, like updating your system's iconv() conversion tables. I can't
easily envision one application switching between implementations over the
course of a session. (An application doing that today probably works around
the problem, perhaps with extra "SET client_encoding" calls.)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Noah Misch (#4)

Re: pg_conversion seems rather strangely defined

Noah Misch <noah@leadboat.com> writes:

On Tue, Jan 05, 2016 at 01:46:51PM -0500, Tom Lane wrote:

I do not see a lot of point in the namespacing of encoding conversions
either. Does anyone really need or use search-path-dependent lookup of
conversions?

I have not issued CREATE CONVERSION except to experiment, and I have never
worked in a database in which someone else had created one. Among PGXN
distributions, CREATE CONVERSION appears only in the pyrseas test suite. It
could be hard to track down testimony on real-world usage patterns, but I
envision two credible patterns. First, you could change the default search
path to "corrected_conversions, pg_catalog, $user, public" and inject fixed
versions of the system conversions. One could use that to backport commit
8d3e090. Second, you could add conversions we omit entirely, like UTF8 ->
MULE_INTERNAL. Dropping search-path-dependent lookup would remove the
supported way to fix system conversions.

The built-in conversions are very intentionally not pinned. So to my
mind, the supported way to replace one is to drop it and create your own.
The way you describe works only if an appropriate search path is installed
at the time a new session activates its client encoding conversions. TBH,
I have no idea whether we apply (for example) "ALTER ROLE SET search_path"
before that happens; but even if we do, there's no real guarantee that it
wouldn't change. We publish no documentation about the order of startup
actions. Moreover past attempts at defining dependencies between GUC
settings have been spectacular failures, so I really don't want to go
there in this context.

It would only be important to be able to do it like that if different
users of the same database had conflicting ideas about what was the
correct conversion between client and database encodings. I submit
that that's somewhere around epsilon probability, considering we have
not even heard of anyone replacing the system conversions at all.

(Adding conversions we don't supply is, of course, orthogonal to this.)

Moreover, we have precedent both for this approach being a bad idea
and for us changing it without many complaints. We used to have
search-path-dependent lookup of default index operator classes.
We found out that that was a bad idea and got rid of it, cf commit
3ac1ac58c. This situation looks much the same to me.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

over 10 years ago

In reply to: Tom Lane (#5)

Re: pg_conversion seems rather strangely defined

On Wed, Jan 06, 2016 at 11:56:14PM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

On Tue, Jan 05, 2016 at 01:46:51PM -0500, Tom Lane wrote:

I do not see a lot of point in the namespacing of encoding conversions
either. Does anyone really need or use search-path-dependent lookup of
conversions?

I have not issued CREATE CONVERSION except to experiment, and I have never
worked in a database in which someone else had created one. Among PGXN
distributions, CREATE CONVERSION appears only in the pyrseas test suite. It
could be hard to track down testimony on real-world usage patterns, but I
envision two credible patterns. First, you could change the default search
path to "corrected_conversions, pg_catalog, $user, public" and inject fixed
versions of the system conversions. One could use that to backport commit
8d3e090. Second, you could add conversions we omit entirely, like UTF8 ->
MULE_INTERNAL. Dropping search-path-dependent lookup would remove the
supported way to fix system conversions.

The built-in conversions are very intentionally not pinned. So to my
mind, the supported way to replace one is to drop it and create your own.

I just learned something. Interesting.

The way you describe works only if an appropriate search path is installed
at the time a new session activates its client encoding conversions. TBH,
I have no idea whether we apply (for example) "ALTER ROLE SET search_path"
before that happens; but even if we do, there's no real guarantee that it
wouldn't change. We publish no documentation about the order of startup
actions. Moreover past attempts at defining dependencies between GUC
settings have been spectacular failures, so I really don't want to go
there in this context.

It would only be important to be able to do it like that if different
users of the same database had conflicting ideas about what was the
correct conversion between client and database encodings. I submit
that that's somewhere around epsilon probability, considering we have
not even heard of anyone replacing the system conversions at all.

(Adding conversions we don't supply is, of course, orthogonal to this.)

Agreed on all those points. Even so, I don't find that the need to set GUCs
in a particular order makes a good case for removing this ancient capability.
I _would_ send a new feature back for redesign on the strength of such a
defect, but that is different.

Independent from that dearth of positive cause to restrict this, users taking
the "DROP CONVERSION pg_catalog.foo" route get dump/restore problems. pg_dump
doesn't notice that one dropped a pg_catalog conversion; the user would
manually repeat each drop before each restore. That's especially awkward for
pg_upgrade. I guess the user could drop each conversion in the new cluster's
template0, run pg_upgrade, and then recreate conversions in databases that had
not overridden them in the original cluster. That's a mess.

Moreover, we have precedent both for this approach being a bad idea
and for us changing it without many complaints. We used to have
search-path-dependent lookup of default index operator classes.
We found out that that was a bad idea and got rid of it, cf commit
3ac1ac58c. This situation looks much the same to me.

Per the 3ac1ac58c log message, "CREATE OPERATOR CLASS only allows one default
opclass per datatype regardless of schemas." That had been true since day one
for CREATE OPERATOR CLASS. It doesn't hold for conversions today, and that's
a crucial difference between that commit and this proposal.

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Noah Misch (#6)

Re: pg_conversion seems rather strangely defined

Noah Misch <noah@leadboat.com> writes:

On Wed, Jan 06, 2016 at 11:56:14PM -0500, Tom Lane wrote:

Moreover, we have precedent both for this approach being a bad idea
and for us changing it without many complaints. We used to have
search-path-dependent lookup of default index operator classes.
We found out that that was a bad idea and got rid of it, cf commit
3ac1ac58c. This situation looks much the same to me.

Per the 3ac1ac58c log message, "CREATE OPERATOR CLASS only allows one default
opclass per datatype regardless of schemas." That had been true since day one
for CREATE OPERATOR CLASS. It doesn't hold for conversions today, and that's
a crucial difference between that commit and this proposal.

Well, the state of affairs is slightly different, but that doesn't mean
it's not equally broken. What I took away from the default-opclass fiasco
is that search-path-based lookup is a good idea only when you are looking
up something *by name*, so that the user can resolve any ambiguity by
schema-qualifying that name. Searches that aren't based on a
user-specified name shouldn't depend on search_path. This is important
because there are multiple conflicting concerns when you choose a
search_path setting, and you may not want to allow any particular search
to force your hand on what you put where in your search path. If, for
example, you don't really want to put "public" on the front of your search
path, the current code gives you no way to select a conversion that's in
"public".

If we need to cater for alternative conversions, my preference would be a
GUC or some other kind of setting that allows selecting a client I/O
conversion by name, whether it is "default" or not. But given the lack
of apparent demand, I don't really want to design and implement such a
feature right now.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tatsuo Ishii

t-ishii@sra.co.jp

over 10 years ago

In reply to: Tom Lane (#5)

Re: pg_conversion seems rather strangely defined

It would only be important to be able to do it like that if different
users of the same database had conflicting ideas about what was the
correct conversion between client and database encodings. I submit
that that's somewhere around epsilon probability, considering we have
not even heard of anyone replacing the system conversions at all.

I used to had a customer who needs to have different client and
database encoding than the default. That is, a slightly different
mapping between Shift-JIS and other database encoding. Due to
unfortunate historical reasons, there are several Shift-JIS variants
(in addition to the standard defined by government, there are IBM, NEC
and Microsoft versions). This is the reason why I wanted to have the
functionality at that time. I'm not sure the customer still uses the
functionality, but it is possible that there are other users who have
similar use cases, since the Shift-JIS variants are still used.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 10 years ago

In reply to: Tatsuo Ishii (#8)

Re: pg_conversion seems rather strangely defined

Tatsuo Ishii <ishii@postgresql.org> writes:

It would only be important to be able to do it like that if different
users of the same database had conflicting ideas about what was the
correct conversion between client and database encodings. I submit
that that's somewhere around epsilon probability, considering we have
not even heard of anyone replacing the system conversions at all.

I used to had a customer who needs to have different client and
database encoding than the default. That is, a slightly different
mapping between Shift-JIS and other database encoding. Due to
unfortunate historical reasons, there are several Shift-JIS variants
(in addition to the standard defined by government, there are IBM, NEC
and Microsoft versions). This is the reason why I wanted to have the
functionality at that time. I'm not sure the customer still uses the
functionality, but it is possible that there are other users who have
similar use cases, since the Shift-JIS variants are still used.

Hm. Odd that we've not heard complaints about the removal of
CONVERT(... USING ...), then.

I think it would be a good idea at least to put back some equivalent
of CONVERT(... USING ...), if for no other reason than that it would
ease testing. As I understood it, the argument to remove it was not
that the functionality was bad, but that we were using a SQL-standard
syntax for what we concluded was not SQL-standard functionality.
I'd propose putting it back with a syntax of, say,

convert_with(input bytea, conversion_name text) returns bytea

As for the client encoding conversion case, I still say a
search-path-based lookup is a horrible idea, and furthermore there
seems no very good reason why it has to be restricted to default
conversions. Aside from other arguments, that tends to push people
to mark *every* conversion as default, which is outright silly if
you have several competing ones.

As a sketch of an alternative, consider inventing a GUC named
preferred_conversions or some such, which is a list of
possibly-schema-qualified conversion names. When establishing an
original or new value of client_encoding, we look through the list
for the first entry that exists and performs the desired encoding
conversion (whether or not it is default). If there is no match,
look up the (unique) default conversion for the encoding pair, and
use that. (Obviously this has to be done twice, once for each
direction, when setting up client encoding conversions.) We could
apply the same rules for identifying which specific conversion to use
in convert() and siblings.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tatsuo Ishii

t-ishii@sra.co.jp

over 10 years ago

In reply to: Tom Lane (#9)

Re: pg_conversion seems rather strangely defined

I used to had a customer who needs to have different client and
database encoding than the default. That is, a slightly different
mapping between Shift-JIS and other database encoding. Due to
unfortunate historical reasons, there are several Shift-JIS variants
(in addition to the standard defined by government, there are IBM, NEC
and Microsoft versions). This is the reason why I wanted to have the
functionality at that time. I'm not sure the customer still uses the
functionality, but it is possible that there are other users who have
similar use cases, since the Shift-JIS variants are still used.

Hm. Odd that we've not heard complaints about the removal of
CONVERT(... USING ...), then.

Yeah, I'm not sure the customer updated to the newer version of
PostgreSQL.

I think it would be a good idea at least to put back some equivalent
of CONVERT(... USING ...), if for no other reason than that it would
ease testing. As I understood it, the argument to remove it was not
that the functionality was bad, but that we were using a SQL-standard
syntax for what we concluded was not SQL-standard functionality.
I'd propose putting it back with a syntax of, say,

convert_with(input bytea, conversion_name text) returns bytea

As for the client encoding conversion case, I still say a
search-path-based lookup is a horrible idea, and furthermore there
seems no very good reason why it has to be restricted to default
conversions. Aside from other arguments, that tends to push people
to mark *every* conversion as default, which is outright silly if
you have several competing ones.

As a sketch of an alternative, consider inventing a GUC named
preferred_conversions or some such, which is a list of
possibly-schema-qualified conversion names. When establishing an
original or new value of client_encoding, we look through the list
for the first entry that exists and performs the desired encoding
conversion (whether or not it is default). If there is no match,
look up the (unique) default conversion for the encoding pair, and
use that. (Obviously this has to be done twice, once for each
direction, when setting up client encoding conversions.) We could
apply the same rules for identifying which specific conversion to use
in convert() and siblings.

Sounds good to me.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pg_conversion seems rather strangely defined

Attachments: