Improving collation-dependent indexes in system catalogs

Started by Tom Laneover 7 years ago4 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Awhile back we noticed that a couple of system catalogs had
acquired indexes on "text" columns, which were unsafe because
their sort order was collation-dependent, so that cloning
template0 with a different database collation could yield
broken indexes. We fixed this in commit 0b28ea79 with a
bit of a hack: we said it's okay to have such indexes as long
as they use the text_pattern_ops opclass, making them
collation insensitive.

While fooling with the idea of making type "name" collation
aware, it occurred to me that there's a better, more general
answer, which is to insist that collation-aware system catalog
columns must be marked with C collation. This rule would apply
without modification to both "text" and "name" columns. In the
wake of commit 5e0928005, it also means that pg_statistic data
for such a column would port safely across a database collation
change, which up to now it does not. And I think we could have
the bootstrap code apply the rule automatically, making for one
less way to screw up when changing catalog definitions.

Thoughts, objections?

regards, tom lane

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Improving collation-dependent indexes in system catalogs

I wrote:

While fooling with the idea of making type "name" collation
aware, it occurred to me that there's a better, more general
answer, which is to insist that collation-aware system catalog
columns must be marked with C collation. This rule would apply
without modification to both "text" and "name" columns. In the
wake of commit 5e0928005, it also means that pg_statistic data
for such a column would port safely across a database collation
change, which up to now it does not. And I think we could have
the bootstrap code apply the rule automatically, making for one
less way to screw up when changing catalog definitions.

Concretely, this ...

regards, tom lane

Attachments:

use-c-collation-in-catalogs-1.patchtext/x-diff; charset=us-ascii; name=use-c-collation-in-catalogs-1.patchDownload+56-43
#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Improving collation-dependent indexes in system catalogs

On 2018-Dec-15, Tom Lane wrote:

I wrote:

While fooling with the idea of making type "name" collation
aware, it occurred to me that there's a better, more general
answer, which is to insist that collation-aware system catalog
columns must be marked with C collation.

Concretely, this ...

Looks sane in a quick once-over.

I notice that some information_schema view columns end up with C
collation after this patch, and others remain with default collation.
Is that sensible? (I think the only two cases where this might matter
at all are information_schema.parameters.parameter_name,
information_schema.routines.external_name and
information_schema.foreign_servers.foreign_server_type.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#3)
Re: Improving collation-dependent indexes in system catalogs

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I notice that some information_schema view columns end up with C
collation after this patch, and others remain with default collation.
Is that sensible? (I think the only two cases where this might matter
at all are information_schema.parameters.parameter_name,
information_schema.routines.external_name and
information_schema.foreign_servers.foreign_server_type.)

Yeah. Looking closer at that, there are no collation-sensitive indexes
in information_schema (if there were, the existing opr_sanity test would
have caught 'em). But there are collation-sensitive table columns, which
do have pg_statistic entries, and those entries are at least nominally
broken by copying them into a database with a different default collation.

We could think of two ways to deal with that. One is to plaster
COLLATE "C" on each textual table column in the information_schema.
A more aggressive approach is to attach COLLATE "C" to each of the
domain types that information_schema defines, which fixes the table
columns a fortiori, and also causes all of the exposed information_schema
view columns to acquire database-independent collations.

I tried both ways, as in the attached patches below (each meant to be
applied on top of my patch upthread), and they both pass check-world.

A possible advantage of the second approach is that it could end up
allowing comparisons on information_schema view columns to be translated
to indexable comparisons on the underlying "name" columns, which would
be a pleasant outcome. On the other hand, people might be annoyed by
the semantics change, if they'd previously been doing that with the
expectation of getting database-collation-based comparisons.

I'm not sure whether the SQL standard says anything that either patch
would be violating. I see that it does say that these domains have
CHARACTER SET SQL_TEXT, and that the collation of that character set
is implementation-defined, so I think we could get away with changing
so far as spec compliance is concerned.

regards, tom lane

Attachments:

info-schema-collations-solution-1.patchtext/x-diff; charset=us-ascii; name=info-schema-collations-solution-1.patchDownload+55-33
info-schema-collations-solution-2.patchtext/x-diff; charset=us-ascii; name=info-schema-collations-solution-2.patchDownload+25-3