Wrong results from inner-unique joins caused by collation mismatch

Started by Richard Guo2 months ago5 messageshackers

guofenglinux@gmail.com

2 months ago

While working on the wrong results issue caused by collation mismatch
in GROUP BY and havingQual [1]/messages/by-id/CAMbWs48Dn2wW6XM94GZsoyMiH42=KgMo+WcobPKuWvGYnWaPOQ@mail.gmail.com, I noticed $subject:

create collation ci (provider = icu, locale = 'und-u-ks-level2',
deterministic = false);

create table t (a text);
insert into t values ('A'), ('a');
create unique index on t (a);

-- wrong results: should be 4 rows
select * from t t1 join t t2 on t1.a = t2.a collate ci;
a | a
---+---
A | A
a | a
(2 rows)

The root cause is explained by the XXX comment in
relation_has_unique_index_for():

/*
* XXX at some point we may need to check collations here too.
* For the moment we assume all collations reduce to the same
* notion of equality.
*/

That assumption stopped being safe when nondeterministic collations
were introduced in PG 12. A unique index enforces uniqueness under
its own collation; if a query's equality clause uses a different
collation, and either side is nondeterministic, the index's uniqueness
does not imply uniqueness under the clause.

Several planner optimizations use this uniqueness proof, and all of
them can yield wrong results in this scenario. These include
inner-unique join execution, left-join removal, semijoin-to-innerjoin
reduction, and self-join elimination.

My first thought was to fix this by:

+  if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+                                exprInputCollation((Node *) rinfo->clause)))
+      continue;

However, this caused an unexpected plan diff in join.out where a
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).

Hence, I got attached patch. Thoughts?

[1]: /messages/by-id/CAMbWs48Dn2wW6XM94GZsoyMiH42=KgMo+WcobPKuWvGYnWaPOQ@mail.gmail.com

- Richard

Tom Lane

tgl@sss.pgh.pa.us

2 months ago

In reply to: Richard Guo (#1)

Re: Wrong results from inner-unique joins caused by collation mismatch

Richard Guo <guofenglinux@gmail.com> writes:

My first thought was to fix this by:

+  if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+                                exprInputCollation((Node *) rinfo->clause)))
+      continue;

However, this caused an unexpected plan diff in join.out where a
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).

Yes, we'd be taking a serious performance hit if we insisted on
exact collation matches for this purpose. I agree that disallowing
non-matching non-deterministic collations is the right fix.

Hence, I got attached patch. Thoughts?

I don't love doing it like this, for two reasons:

1. I think there are other places in the planner that will need
substantially this same logic. I recommend breaking out a
subroutine defined more or less as "do these collations have
equivalent notions of equality".

2. I find the test next to unreadable as written --- for example,
it's more difficult than it should be to figure out what happens
if one collation is deterministic and the other not. Using a
subroutine would help here by letting you break down the test
into multiple steps.

regards, tom lane

Richard Guo

guofenglinux@gmail.com

2 months ago

In reply to: Tom Lane (#2)

Re: Wrong results from inner-unique joins caused by collation mismatch

On Fri, Apr 24, 2026 at 11:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:

My first thought was to fix this by:
+  if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+                                exprInputCollation((Node *) rinfo->clause)))
+      continue;
However, this caused an unexpected plan diff in join.out where a
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).

Yes, we'd be taking a serious performance hit if we insisted on
exact collation matches for this purpose. I agree that disallowing
non-matching non-deterministic collations is the right fix.

Thanks for taking a look!

Hence, I got attached patch. Thoughts?

I don't love doing it like this, for two reasons:

1. I think there are other places in the planner that will need
substantially this same logic. I recommend breaking out a
subroutine defined more or less as "do these collations have
equivalent notions of equality".

Right. I just found several other places that need this same logic,
which are in query_is_distinct_for(). Without it, we produce wrong
results:

select * from t t1 join
(select distinct a from t) t2 on t1.a = t2.a COLLATE "ci";
a | a
---+---
A | a
a | a
(2 rows)

select * from t t1 join
(select a from t group by a) t2 on t1.a = t2.a COLLATE "ci";
a | a
---+---
A | a
a | a
(2 rows)

2. I find the test next to unreadable as written --- for example,
it's more difficult than it should be to figure out what happens
if one collation is deterministic and the other not. Using a
subroutine would help here by letting you break down the test
into multiple steps.

Agreed. Will wrap the logic in a subroutine.

- Richard

Richard Guo

guofenglinux@gmail.com

about 2 months ago

In reply to: Richard Guo (#3)

Re: Wrong results from inner-unique joins caused by collation mismatch

On Sat, Apr 25, 2026 at 12:44 AM Richard Guo <guofenglinux@gmail.com> wrote:

On Fri, Apr 24, 2026 at 11:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

1. I think there are other places in the planner that will need
substantially this same logic. I recommend breaking out a
subroutine defined more or less as "do these collations have
equivalent notions of equality".

Right. I just found several other places that need this same logic,
which are in query_is_distinct_for(). Without it, we produce wrong
results:

0001 wrapped the logic in subroutine collations_are_compatible().

(I'm a little unsure about the InvalidOid cases. The current
implementation treats InvalidOid on either side as compatible, as
absence of a collation can't conflict with the other side. This
generalizes the asymmetric treatment in IndexCollMatchesExprColl().)

0002 fixed query_is_distinct_for(), using that subroutine.

- Richard

Richard Guo

guofenglinux@gmail.com

about 2 months ago

In reply to: Richard Guo (#4)

Re: Wrong results from inner-unique joins caused by collation mismatch

On Sat, Apr 25, 2026 at 6:24 PM Richard Guo <guofenglinux@gmail.com> wrote:

0001 wrapped the logic in subroutine collations_are_compatible().

I don't think that name is good. It sounds like a general claim about
the two collations, but what the subroutine actually checks is much
narrower: whether the two collations agree on what counts as equal.
It has nothing to say about ordering, and two deterministic collations
agree on = but can disagree on <.

I renamed it to collations_agree_on_equality(), which seems a better
name to me. And then I committed this patch and back-patched it to
all supported branches.

0002 fixed query_is_distinct_for(), using that subroutine.

This patch changes the signature of query_is_distinct_for, which would
be an ABI break on stable branches. So in back-patches I added a
local function query_is_distinct_for_with_collations, which is a
collation-aware verson of query_is_distinct_for, and retained
query_is_distinct_for as a thin wrapper that calls that new local
function.

I also committed and back-patched this patch.

- Richard

Wrong results from inner-unique joins caused by collation mismatch

Attachments:

Attachments: