Wrong results from inner-unique joins caused by collation mismatch

Started by Richard Guo8 days ago4 messageshackers
Jump to latest
#1Richard Guo
guofenglinux@gmail.com

While working on the wrong results issue caused by collation mismatch
in GROUP BY and havingQual [1]/messages/by-id/CAMbWs48Dn2wW6XM94GZsoyMiH42=KgMo+WcobPKuWvGYnWaPOQ@mail.gmail.com, I noticed $subject:

create collation ci (provider = icu, locale = 'und-u-ks-level2',
deterministic = false);

create table t (a text);
insert into t values ('A'), ('a');
create unique index on t (a);

-- wrong results: should be 4 rows
select * from t t1 join t t2 on t1.a = t2.a collate ci;
a | a
---+---
A | A
a | a
(2 rows)

The root cause is explained by the XXX comment in
relation_has_unique_index_for():

/*
* XXX at some point we may need to check collations here too.
* For the moment we assume all collations reduce to the same
* notion of equality.
*/

That assumption stopped being safe when nondeterministic collations
were introduced in PG 12. A unique index enforces uniqueness under
its own collation; if a query's equality clause uses a different
collation, and either side is nondeterministic, the index's uniqueness
does not imply uniqueness under the clause.

Several planner optimizations use this uniqueness proof, and all of
them can yield wrong results in this scenario. These include
inner-unique join execution, left-join removal, semijoin-to-innerjoin
reduction, and self-join elimination.

My first thought was to fix this by:

+  if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+                                exprInputCollation((Node *) rinfo->clause)))
+      continue;

However, this caused an unexpected plan diff in join.out where a
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).

Hence, I got attached patch. Thoughts?

[1]: /messages/by-id/CAMbWs48Dn2wW6XM94GZsoyMiH42=KgMo+WcobPKuWvGYnWaPOQ@mail.gmail.com

- Richard

Attachments:

v1-0001-Consider-collation-when-proving-uniqueness-from-u.patchapplication/octet-stream; name=v1-0001-Consider-collation-when-proving-uniqueness-from-u.patchDownload+192-4
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Guo (#1)
Re: Wrong results from inner-unique joins caused by collation mismatch

Richard Guo <guofenglinux@gmail.com> writes:

My first thought was to fix this by:

+  if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+                                exprInputCollation((Node *) rinfo->clause)))
+      continue;

However, this caused an unexpected plan diff in join.out where a
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).

Yes, we'd be taking a serious performance hit if we insisted on
exact collation matches for this purpose. I agree that disallowing
non-matching non-deterministic collations is the right fix.

Hence, I got attached patch. Thoughts?

I don't love doing it like this, for two reasons:

1. I think there are other places in the planner that will need
substantially this same logic. I recommend breaking out a
subroutine defined more or less as "do these collations have
equivalent notions of equality".

2. I find the test next to unreadable as written --- for example,
it's more difficult than it should be to figure out what happens
if one collation is deterministic and the other not. Using a
subroutine would help here by letting you break down the test
into multiple steps.

regards, tom lane

#3Richard Guo
guofenglinux@gmail.com
In reply to: Tom Lane (#2)
Re: Wrong results from inner-unique joins caused by collation mismatch

On Fri, Apr 24, 2026 at 11:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Richard Guo <guofenglinux@gmail.com> writes:

My first thought was to fix this by:

+  if (!IndexCollMatchesExprColl(ind->indexcollations[c],
+                                exprInputCollation((Node *) rinfo->clause)))
+      continue;

However, this caused an unexpected plan diff in join.out where a
left-join removal over (name, text) stopped working, because name and
text use different collations. So this check is too strict: a
mismatch between two deterministic collations should be OK for
uniqueness proof, as a deterministic collation treats two strings as
equal iff they are byte-wise equal (see CREATE COLLATION).

Yes, we'd be taking a serious performance hit if we insisted on
exact collation matches for this purpose. I agree that disallowing
non-matching non-deterministic collations is the right fix.

Thanks for taking a look!

Hence, I got attached patch. Thoughts?

I don't love doing it like this, for two reasons:

1. I think there are other places in the planner that will need
substantially this same logic. I recommend breaking out a
subroutine defined more or less as "do these collations have
equivalent notions of equality".

Right. I just found several other places that need this same logic,
which are in query_is_distinct_for(). Without it, we produce wrong
results:

select * from t t1 join
(select distinct a from t) t2 on t1.a = t2.a COLLATE "ci";
a | a
---+---
A | a
a | a
(2 rows)

select * from t t1 join
(select a from t group by a) t2 on t1.a = t2.a COLLATE "ci";
a | a
---+---
A | a
a | a
(2 rows)

2. I find the test next to unreadable as written --- for example,
it's more difficult than it should be to figure out what happens
if one collation is deterministic and the other not. Using a
subroutine would help here by letting you break down the test
into multiple steps.

Agreed. Will wrap the logic in a subroutine.

- Richard

#4Richard Guo
guofenglinux@gmail.com
In reply to: Richard Guo (#3)
Re: Wrong results from inner-unique joins caused by collation mismatch

On Sat, Apr 25, 2026 at 12:44 AM Richard Guo <guofenglinux@gmail.com> wrote:

On Fri, Apr 24, 2026 at 11:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

1. I think there are other places in the planner that will need
substantially this same logic. I recommend breaking out a
subroutine defined more or less as "do these collations have
equivalent notions of equality".

Right. I just found several other places that need this same logic,
which are in query_is_distinct_for(). Without it, we produce wrong
results:

0001 wrapped the logic in subroutine collations_are_compatible().

(I'm a little unsure about the InvalidOid cases. The current
implementation treats InvalidOid on either side as compatible, as
absence of a collation can't conflict with the other side. This
generalizes the asymmetric treatment in IndexCollMatchesExprColl().)

0002 fixed query_is_distinct_for(), using that subroutine.

- Richard

Attachments:

v2-0001-Consider-collation-when-proving-uniqueness-from-u.patchapplication/octet-stream; name=v2-0001-Consider-collation-when-proving-uniqueness-from-u.patchDownload+211-5
v2-0002-Consider-collation-when-proving-subquery-uniquene.patchapplication/octet-stream; name=v2-0002-Consider-collation-when-proving-subquery-uniquene.patchDownload+326-56