Improving collation-dependent indexes in system catalogs

Started by Tom Laneover 7 years ago4 messageshackers

tgl@sss.pgh.pa.us

over 7 years ago

Awhile back we noticed that a couple of system catalogs had
acquired indexes on "text" columns, which were unsafe because
their sort order was collation-dependent, so that cloning
template0 with a different database collation could yield
broken indexes. We fixed this in commit 0b28ea79 with a
bit of a hack: we said it's okay to have such indexes as long
as they use the text_pattern_ops opclass, making them
collation insensitive.

While fooling with the idea of making type "name" collation
aware, it occurred to me that there's a better, more general
answer, which is to insist that collation-aware system catalog
columns must be marked with C collation. This rule would apply
without modification to both "text" and "name" columns. In the
wake of commit 5e0928005, it also means that pg_statistic data
for such a column would port safely across a database collation
change, which up to now it does not. And I think we could have
the bootstrap code apply the rule automatically, making for one
less way to screw up when changing catalog definitions.

Thoughts, objections?

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Tom Lane (#1)

Re: Improving collation-dependent indexes in system catalogs

I wrote:

While fooling with the idea of making type "name" collation
aware, it occurred to me that there's a better, more general
answer, which is to insist that collation-aware system catalog
columns must be marked with C collation. This rule would apply
without modification to both "text" and "name" columns. In the
wake of commit 5e0928005, it also means that pg_statistic data
for such a column would port safely across a database collation
change, which up to now it does not. And I think we could have
the bootstrap code apply the rule automatically, making for one
less way to screw up when changing catalog definitions.

Concretely, this ...

regards, tom lane

Alvaro Herrera

alvherre@2ndquadrant.com

over 7 years ago

In reply to: Tom Lane (#2)

Re: Improving collation-dependent indexes in system catalogs

On 2018-Dec-15, Tom Lane wrote:

I wrote:

While fooling with the idea of making type "name" collation
aware, it occurred to me that there's a better, more general
answer, which is to insist that collation-aware system catalog
columns must be marked with C collation.

Concretely, this ...

Looks sane in a quick once-over.

I notice that some information_schema view columns end up with C
collation after this patch, and others remain with default collation.
Is that sensible? (I think the only two cases where this might matter
at all are information_schema.parameters.parameter_name,
information_schema.routines.external_name and
information_schema.foreign_servers.foreign_server_type.)

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Alvaro Herrera (#3)

Re: Improving collation-dependent indexes in system catalogs

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I notice that some information_schema view columns end up with C
collation after this patch, and others remain with default collation.
Is that sensible? (I think the only two cases where this might matter
at all are information_schema.parameters.parameter_name,
information_schema.routines.external_name and
information_schema.foreign_servers.foreign_server_type.)

Yeah. Looking closer at that, there are no collation-sensitive indexes
in information_schema (if there were, the existing opr_sanity test would
have caught 'em). But there are collation-sensitive table columns, which
do have pg_statistic entries, and those entries are at least nominally
broken by copying them into a database with a different default collation.

We could think of two ways to deal with that. One is to plaster
COLLATE "C" on each textual table column in the information_schema.
A more aggressive approach is to attach COLLATE "C" to each of the
domain types that information_schema defines, which fixes the table
columns a fortiori, and also causes all of the exposed information_schema
view columns to acquire database-independent collations.

I tried both ways, as in the attached patches below (each meant to be
applied on top of my patch upthread), and they both pass check-world.

A possible advantage of the second approach is that it could end up
allowing comparisons on information_schema view columns to be translated
to indexable comparisons on the underlying "name" columns, which would
be a pleasant outcome. On the other hand, people might be annoyed by
the semantics change, if they'd previously been doing that with the
expectation of getting database-collation-based comparisons.

I'm not sure whether the SQL standard says anything that either patch
would be violating. I see that it does say that these domains have
CHARACTER SET SQL_TEXT, and that the collation of that character set
is implementation-defined, so I think we could get away with changing
so far as spec compliance is concerned.

regards, tom lane

Improving collation-dependent indexes in system catalogs

Attachments:

Attachments: