Collatability of type "name"

Started by Tom Laneover 7 years ago5 messageshackers

tgl@sss.pgh.pa.us

over 7 years ago

I've been experimenting with the task proposed in [1]/messages/by-id/5978.1544030694@sss.pgh.pa.us of expanding
the text_ops operator family to include type "name" as well as
cross-type text vs. name operators. These operators would need to
offer collation-aware sorting, since that's exactly the difference
between text_ops and the non-collation-aware name_ops opfamily.
I ran into a nasty stumbling block almost immediately: the proposed
name vs. name comparison operators fail, because the parser sees
that both inputs are of noncollatable types so it doesn't assign
any collation to the operator node.

I experimented with leaving out the name vs. name operators and
just adding cross-type text vs. name and name vs. text operators.
That turns out not to work well at all. Aside from the fact that
opr_sanity whines about an incomplete operator family, I found
various situations where the planner fails, complaining about
things like "missing operator 1(19,19) in opfamily 1994". The
root of that mess seems to be that we've supposed that if an
equality operator is marked mergejoinable then it is mergejoinable
in every opfamily that it's a member of. But that isn't true in
an opfamily structure like this. For instance "text = name" should
be mergejoinable in the name_ops opclass, since we know how to sort
both text and name in non-collation-aware ways. But it's not
mergejoinable in the text_ops opclass if text_ops doesn't provide
collation-aware name vs. name operators to sort the name input with.

We could probably fix that, at the cost of about tripling the work
needed to detect whether an operator is really mergejoinable, but
I have little confidence that there aren't more problems lurking
behind it. There are a lot of aspects of EquivalenceClass processing
that look pretty questionable if we're trying to support operators
that act this way. For instance, if we derive "a = c" given "a = b"
and "b = c", the equality operator in "a = c" might be mergejoinable
in a different set of opclasses than the other two operators are,
making it debatable whether it can be thought to belong to the same
EquivalenceClass at all.

So the other approach I'm contemplating is to mark type name as
collatable (with "C" as its typcollation, probably). There are
two plausible sub-approaches:

1. The regular name comparison operators remain non-collation-aware.
This would be the least invasive way but it'd have the odd side-effect
that expressions like "namecoll1 < namecoll2 COLLATE something"
would be accepted but the collation would be ignored. Also, we'd
have to invent some new names for the collation-aware name-vs-name
operators, and I don't see any obvious candidate for that.

2. Upgrade the name comparison operators to be collation-aware,
with (probably) all the same optimizations for C collation as we
have for text. This'd be a cleaner end result but it seems like
there are a lot of potential side-effects, e.g. syscache lookups
would have to be prepared to pass the right collation argument
to name comparisons.

I feel like #2 is probably really the Right Thing, but it's also
sounding like significantly more work than I thought this was going
to involve. Not sure if it's worth the effort right now.

Also, I think that either solution would lead to some subtle changes
in semantics. For example, right now if you compare a name column
to a text value, you get a text (collation-aware) comparison using
the database's default collation. It looks like if name columns
are marked with attcollation = 'C', that would win and the comparison
would now have 'C' collation unless you explicitly override it with
a COLLATE clause. I'm not sure this is a bad thing --- it'd be more
likely to match the sort order of the index on the column --- but it
could surprise people.

Thoughts?

regards, tom lane

[1]: /messages/by-id/5978.1544030694@sss.pgh.pa.us

Pavel Stehule

pavel.stehule@gmail.com

over 7 years ago

In reply to: Tom Lane (#1)

Re: Collatability of type "name"

ne 9. 12. 2018 v 18:50 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

I've been experimenting with the task proposed in [1] of expanding
the text_ops operator family to include type "name" as well as
cross-type text vs. name operators. These operators would need to
offer collation-aware sorting, since that's exactly the difference
between text_ops and the non-collation-aware name_ops opfamily.
I ran into a nasty stumbling block almost immediately: the proposed
name vs. name comparison operators fail, because the parser sees
that both inputs are of noncollatable types so it doesn't assign
any collation to the operator node.

I experimented with leaving out the name vs. name operators and
just adding cross-type text vs. name and name vs. text operators.
That turns out not to work well at all. Aside from the fact that
opr_sanity whines about an incomplete operator family, I found
various situations where the planner fails, complaining about
things like "missing operator 1(19,19) in opfamily 1994". The
root of that mess seems to be that we've supposed that if an
equality operator is marked mergejoinable then it is mergejoinable
in every opfamily that it's a member of. But that isn't true in
an opfamily structure like this. For instance "text = name" should
be mergejoinable in the name_ops opclass, since we know how to sort
both text and name in non-collation-aware ways. But it's not
mergejoinable in the text_ops opclass if text_ops doesn't provide
collation-aware name vs. name operators to sort the name input with.

We could probably fix that, at the cost of about tripling the work
needed to detect whether an operator is really mergejoinable, but
I have little confidence that there aren't more problems lurking
behind it. There are a lot of aspects of EquivalenceClass processing
that look pretty questionable if we're trying to support operators
that act this way. For instance, if we derive "a = c" given "a = b"
and "b = c", the equality operator in "a = c" might be mergejoinable
in a different set of opclasses than the other two operators are,
making it debatable whether it can be thought to belong to the same
EquivalenceClass at all.

So the other approach I'm contemplating is to mark type name as
collatable (with "C" as its typcollation, probably). There are
two plausible sub-approaches:

1. The regular name comparison operators remain non-collation-aware.
This would be the least invasive way but it'd have the odd side-effect
that expressions like "namecoll1 < namecoll2 COLLATE something"
would be accepted but the collation would be ignored. Also, we'd
have to invent some new names for the collation-aware name-vs-name
operators, and I don't see any obvious candidate for that.

2. Upgrade the name comparison operators to be collation-aware,
with (probably) all the same optimizations for C collation as we
have for text. This'd be a cleaner end result but it seems like
there are a lot of potential side-effects, e.g. syscache lookups
would have to be prepared to pass the right collation argument
to name comparisons.

I feel like #2 is probably really the Right Thing, but it's also
sounding like significantly more work than I thought this was going
to involve. Not sure if it's worth the effort right now.

Also, I think that either solution would lead to some subtle changes
in semantics. For example, right now if you compare a name column
to a text value, you get a text (collation-aware) comparison using
the database's default collation. It looks like if name columns
are marked with attcollation = 'C', that would win and the comparison
would now have 'C' collation unless you explicitly override it with
a COLLATE clause. I'm not sure this is a bad thing --- it'd be more
likely to match the sort order of the index on the column --- but it
could surprise people.

The sort of table's names is not too common operation. I don't see a C
collate for names as any risk.

Regards

Pavel

Show quoted text

Thoughts?

regards, tom lane

[1] /messages/by-id/5978.1544030694@sss.pgh.pa.us

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Tom Lane (#1)

Re: Collatability of type "name"

On Mon, Dec 10, 2018 at 2:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I feel like #2 is probably really the Right Thing,

I think so, too.

Also, I think that either solution would lead to some subtle changes
in semantics. For example, right now if you compare a name column
to a text value, you get a text (collation-aware) comparison using
the database's default collation. It looks like if name columns
are marked with attcollation = 'C', that would win and the comparison
would now have 'C' collation unless you explicitly override it with
a COLLATE clause. I'm not sure this is a bad thing --- it'd be more
likely to match the sort order of the index on the column --- but it
could surprise people.

It's not great to change the semantics of stuff like this, but it
doesn't sound all that bad.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Robert Haas (#3)

Re: Collatability of type "name"

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Dec 10, 2018 at 2:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, I think that either solution would lead to some subtle changes
in semantics. For example, right now if you compare a name column
to a text value, you get a text (collation-aware) comparison using
the database's default collation. It looks like if name columns
are marked with attcollation = 'C', that would win and the comparison
would now have 'C' collation unless you explicitly override it with
a COLLATE clause. I'm not sure this is a bad thing --- it'd be more
likely to match the sort order of the index on the column --- but it
could surprise people.

It's not great to change the semantics of stuff like this, but it
doesn't sound all that bad.

I had an epiphany after committing 6b0faf723: if we're forcing system
catalog columns to have "C" collation, there's no critical need for
type "name" to do that for itself. We could upgrade "name" to be
collatable with typcollation = DEFAULT_COLLATION_OID, and then its
comparison semantics would be *exactly the same as text*. Only the
physical representation is different.

This should mean that it's semantically trivial to unify the name_ops
opfamily with text_ops (not text_pattern_ops, as I'd previously supposed)
and add all the requisite cross-type operators. I haven't actually
done that yet, but I have made a patch to make "name" fully
collation-aware, as attached.

This approach does have some minuses, though:

* There are assorted user-defined "name" columns in the regression
tests, which may introduce locale dependencies that weren't there
before. I found a couple by running check-world under various locales,
and patched those in the attached, but it's definitely possible that
there are more issues in locales I didn't try.

* If any end users are using columns of type "name", they'd likewise
see behavioral changes, plus their indexes would be broken. We
discourage people from using that type, so I don't think this is a
deal-breaker, but we'd at least have to add intelligence to pg_upgrade
to make it notice user-defined indexes on name columns and arrange
to reindex them.

We could eliminate those two problems if we made "name" have
typcollation "C" rather than "default", so that its semantics
wouldn't change without explicit collation specs. This feels
like pretty much of a wart to me, but maybe it's worth doing
in the name of avoiding compatibility issues. We could still
unify name_ops with text_ops, but now "name" would act more like
a domain with an explicit collation spec.

Thoughts?

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Tom Lane (#4)

Re: Collatability of type "name"

I wrote:

We could eliminate those two problems if we made "name" have
typcollation "C" rather than "default", so that its semantics
wouldn't change without explicit collation specs. This feels
like pretty much of a wart to me, but maybe it's worth doing
in the name of avoiding compatibility issues. We could still
unify name_ops with text_ops, but now "name" would act more like
a domain with an explicit collation spec.

Here's a variant patch that does it like that. On reflection this seems
like a safer way to proceed. It feels like a wart because it violates
the system's original assumption that collatable base types all have
DEFAULT_COLLATION_OID, but as far as I can tell that doesn't have any
really severe consequences. The main ugliness is that CREATE TYPE can
only set typcollation to 0 or DEFAULT_COLLATION_OID for new base types,
meaning that it's impossible to duplicate the behavior of type "name"
in a user-defined type, which seems like an extensibility failure.
But it's not one that I'm sufficiently excited about to wish to fix.
Another point is that there are places in parse_collate.c that suppose
that "domain's typcollation is different from DEFAULT_COLLATION_OID"
is equivalent to "domain's collation was explicitly specified", which
would not be the case for domains over type name. But this seems to be
isomorphic to the situation where "name" is a domain with COLLATE "C"
over some anonymous base type, so I don't think that any fundamental
semantic breakage ensues.

Barring objections I'm going to push forward with committing this and
unifying name_ops with text_ops.

regards, tom lane

Collatability of type "name"

Attachments:

Attachments: