Improving collation-dependent indexes in system catalogs
Awhile back we noticed that a couple of system catalogs had
acquired indexes on "text" columns, which were unsafe because
their sort order was collation-dependent, so that cloning
template0 with a different database collation could yield
broken indexes. We fixed this in commit 0b28ea79 with a
bit of a hack: we said it's okay to have such indexes as long
as they use the text_pattern_ops opclass, making them
collation insensitive.
While fooling with the idea of making type "name" collation
aware, it occurred to me that there's a better, more general
answer, which is to insist that collation-aware system catalog
columns must be marked with C collation. This rule would apply
without modification to both "text" and "name" columns. In the
wake of commit 5e0928005, it also means that pg_statistic data
for such a column would port safely across a database collation
change, which up to now it does not. And I think we could have
the bootstrap code apply the rule automatically, making for one
less way to screw up when changing catalog definitions.
Thoughts, objections?
regards, tom lane
I wrote:
While fooling with the idea of making type "name" collation
aware, it occurred to me that there's a better, more general
answer, which is to insist that collation-aware system catalog
columns must be marked with C collation. This rule would apply
without modification to both "text" and "name" columns. In the
wake of commit 5e0928005, it also means that pg_statistic data
for such a column would port safely across a database collation
change, which up to now it does not. And I think we could have
the bootstrap code apply the rule automatically, making for one
less way to screw up when changing catalog definitions.
Concretely, this ...
regards, tom lane
Attachments:
use-c-collation-in-catalogs-1.patchtext/x-diff; charset=us-ascii; name=use-c-collation-in-catalogs-1.patchDownload+56-43
On 2018-Dec-15, Tom Lane wrote:
I wrote:
While fooling with the idea of making type "name" collation
aware, it occurred to me that there's a better, more general
answer, which is to insist that collation-aware system catalog
columns must be marked with C collation.
Concretely, this ...
Looks sane in a quick once-over.
I notice that some information_schema view columns end up with C
collation after this patch, and others remain with default collation.
Is that sensible? (I think the only two cases where this might matter
at all are information_schema.parameters.parameter_name,
information_schema.routines.external_name and
information_schema.foreign_servers.foreign_server_type.)
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
I notice that some information_schema view columns end up with C
collation after this patch, and others remain with default collation.
Is that sensible? (I think the only two cases where this might matter
at all are information_schema.parameters.parameter_name,
information_schema.routines.external_name and
information_schema.foreign_servers.foreign_server_type.)
Yeah. Looking closer at that, there are no collation-sensitive indexes
in information_schema (if there were, the existing opr_sanity test would
have caught 'em). But there are collation-sensitive table columns, which
do have pg_statistic entries, and those entries are at least nominally
broken by copying them into a database with a different default collation.
We could think of two ways to deal with that. One is to plaster
COLLATE "C" on each textual table column in the information_schema.
A more aggressive approach is to attach COLLATE "C" to each of the
domain types that information_schema defines, which fixes the table
columns a fortiori, and also causes all of the exposed information_schema
view columns to acquire database-independent collations.
I tried both ways, as in the attached patches below (each meant to be
applied on top of my patch upthread), and they both pass check-world.
A possible advantage of the second approach is that it could end up
allowing comparisons on information_schema view columns to be translated
to indexable comparisons on the underlying "name" columns, which would
be a pleasant outcome. On the other hand, people might be annoyed by
the semantics change, if they'd previously been doing that with the
expectation of getting database-collation-based comparisons.
I'm not sure whether the SQL standard says anything that either patch
would be violating. I see that it does say that these domains have
CHARACTER SET SQL_TEXT, and that the collation of that character set
is implementation-defined, so I think we could get away with changing
so far as spec compliance is concerned.
regards, tom lane