BUG #19460: FULL JOIN rewriting issue on empty queries
The following bug has been logged on the website:
Bug reference: 19460
Logged by: François Jehl
Email address: francois.jehl@pigment.com
PostgreSQL version: 17.9
Operating system: Linux
Description:
Good evening,
After migrating from version 15 to 17.9, the following query fails with:
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join
conditions
This is weird because the FULL JOIN is a basic equi-join that succeeds on
PG15 (we tested it; it also fails on 16).
It requires a table t(id UUID PRIMARY KEY) but here is a repro on DbFiddle
https://www.db-fiddle.com/f/hCq5S13Zs3EV8f86Mxxh3B/3.
SELECT COALESCE(lhs.id, rhs.id) AS id
FROM (SELECT gen_random_uuid() AS id) AS lhs
FULL OUTER JOIN (
SELECT sub.id
FROM (
SELECT empty_source.id
FROM (SELECT NULL::UUID AS id WHERE FALSE) AS empty_source
LEFT OUTER JOIN (
SELECT t.id FROM t WHERE t.id =
'26c5112c-0a8f-4315-9ff5-7dcb59b8359e'::UUID
) AS sub ON sub.id = empty_source.id
) AS sub
) AS rhs ON rhs.id = lhs.id;
Adding OFFSET 0 to the empty subquery on the RHS prevents the error,
suggesting the query rewriter is doing something it should not!
Another thing: removing the PK constraint on t.id, or removing/changing the
WHERE filter to a non-PK column, makes the query succeed. The PK equality
filter is required to trigger the failure (maybe because it generates some
additional inlining, which then allows the empty subquery to collapse).
We're happy to provide more context or test patches if helpful.
PG Bug reporting form <noreply@postgresql.org> writes:
After migrating from version 15 to 17.9, the following query fails with:
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join
conditions
Thanks for the report!
This turns out to be because somebody long ago thought that outer join
removal could be lazy about how much of the planner's data structures
it needs to update. Specifically, when the lower LEFT OUTER JOIN
gets removed, we failed to remove the associated relids from the
left_relids and right_relids of the upper "ON rhs.id = lhs.id" clause,
and that blocks recognition of the applicability of a hash or merge
join, because clause_sides_match_join() fails.
The fix seems pretty trivial, as attached. (While I'm only certain
that we have to fix left_relids and right_relids, this discovery
makes it seem like it'd be pretty foolish not to fix all the relid
sets of a RestrictInfo.) I didn't make a regression test case yet,
but we need one since no existing test results change (!?).
I'm feeling a tad nervous about pushing this into released branches.
It seems likely that it might enable quite a few join plans that were
previously not considered, and people tend not to like plan changes in
stable branches. However, (a) it's hard to argue that this isn't a
regression from pre-v16, and (b) since this change affects no existing
test, maybe the blast radius isn't as big as I fear.
Thoughts?
regards, tom lane
Attachments:
v1-fix-missing-RestrictInfo-updates.patchtext/x-diff; charset=us-ascii; name=v1-fix-missing-RestrictInfo-updates.patchDownload+17-1
On Mon, Apr 20, 2026 at 6:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
This turns out to be because somebody long ago thought that outer join
removal could be lazy about how much of the planner's data structures
it needs to update. Specifically, when the lower LEFT OUTER JOIN
gets removed, we failed to remove the associated relids from the
left_relids and right_relids of the upper "ON rhs.id = lhs.id" clause,
and that blocks recognition of the applicability of a hash or merge
join, because clause_sides_match_join() fails.
I came to the same conclusion.
The fix seems pretty trivial, as attached. (While I'm only certain
that we have to fix left_relids and right_relids, this discovery
makes it seem like it'd be pretty foolish not to fix all the relid
sets of a RestrictInfo.) I didn't make a regression test case yet,
but we need one since no existing test results change (!?).
This fix LGTM. I think it'd be better to have a regression test case.
How about this one:
create table t (id int unique);
explain (costs off)
select t1.*
from t t1 full join
(select 1 as x
from t t2 left join t t3 on t2.id = t3.id
) sub on t1.id = sub.x;
ERROR: FULL JOIN is only supported with merge-joinable or
hash-joinable join conditions
I'm feeling a tad nervous about pushing this into released branches.
It seems likely that it might enable quite a few join plans that were
previously not considered, and people tend not to like plan changes in
stable branches. However, (a) it's hard to argue that this isn't a
regression from pre-v16, and (b) since this change affects no existing
test, maybe the blast radius isn't as big as I fear.
Fair points on both sides. I'd lean slightly toward back-patching
this fix, mostly because of your points (a) and (b). Without a
back-patch, users like François would need to adjust affected queries
when upgrading from pre-v16 to v16–v18, which feels a bit unfortunate.
- Richard
Richard Guo <guofenglinux@gmail.com> writes:
On Mon, Apr 20, 2026 at 6:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
This turns out to be because somebody long ago thought that outer join
removal could be lazy about how much of the planner's data structures
it needs to update. Specifically, when the lower LEFT OUTER JOIN
gets removed, we failed to remove the associated relids from the
left_relids and right_relids of the upper "ON rhs.id = lhs.id" clause,
and that blocks recognition of the applicability of a hash or merge
join, because clause_sides_match_join() fails.
I came to the same conclusion.
Thanks for looking at it! There is a loose end still bothering me:
if you remove the lower "WHERE t.id = ..." clause, or change it to be
something other than an equality constraint on t.id, the bug doesn't
manifest. The reason for that is un-obvious. I suppose it's somehow
related to the code that tries to push equality-to-a-constant through
outer join clauses, but that code shouldn't be able to produce any new
clauses here, so why is there a visible effect? I'm too tired to look
right now, and was planning to study it more tomorrow. But if you
are interested in digging before that, feel free.
regards, tom lane
On Mon, Apr 20, 2026 at 10:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thanks for looking at it! There is a loose end still bothering me:
if you remove the lower "WHERE t.id = ..." clause, or change it to be
something other than an equality constraint on t.id, the bug doesn't
manifest. The reason for that is un-obvious.
The reason seems to be that the equality constraint is a restriction
clause for the inner relation 't', and is needed to determine that the
relation has a matching unique index and is therefore distinct. If we
remove it, or change it to something that isn't mergejoinable, we
won't be able to prove the inner side of the left join is distinct,
and thus won't be able to remove that left join.
I think the qual clause "sub.id = empty_source.id" might be confusing,
because empty_source.id is constant NULL, and this clause would be
simplified to constant NULL during const-folding.
- Richard
Richard Guo <guofenglinux@gmail.com> writes:
On Mon, Apr 20, 2026 at 10:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thanks for looking at it! There is a loose end still bothering me:
if you remove the lower "WHERE t.id = ..." clause, or change it to be
something other than an equality constraint on t.id, the bug doesn't
manifest. The reason for that is un-obvious.
The reason seems to be that the equality constraint is a restriction
clause for the inner relation 't', and is needed to determine that the
relation has a matching unique index and is therefore distinct. If we
remove it, or change it to something that isn't mergejoinable, we
won't be able to prove the inner side of the left join is distinct,
and thus won't be able to remove that left join.
Hmm. The bug also goes away if "t" doesn't have a unique/pkey
constraint, and I find that easy to understand: we can't apply outer
join removal unless rel_supports_distinctness/rel_is_distinct_for
succeed, so that this buggy code in remove_rel_from_restrictinfo
is not reached. But that logic doesn't consider WHERE constraints
AFAICS. So I think there is some other code path involved.
It might turn out to not be all that interesting to run this to
ground, but I want to do so because it might inform our estimate
of the patch's blast radius.
regards, tom lane
On Mon, Apr 20, 2026 at 11:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
On Mon, Apr 20, 2026 at 10:26 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thanks for looking at it! There is a loose end still bothering me:
if you remove the lower "WHERE t.id = ..." clause, or change it to be
something other than an equality constraint on t.id, the bug doesn't
manifest. The reason for that is un-obvious.
The reason seems to be that the equality constraint is a restriction
clause for the inner relation 't', and is needed to determine that the
relation has a matching unique index and is therefore distinct. If we
remove it, or change it to something that isn't mergejoinable, we
won't be able to prove the inner side of the left join is distinct,
and thus won't be able to remove that left join.
Hmm. The bug also goes away if "t" doesn't have a unique/pkey
constraint, and I find that easy to understand: we can't apply outer
join removal unless rel_supports_distinctness/rel_is_distinct_for
succeed, so that this buggy code in remove_rel_from_restrictinfo
is not reached. But that logic doesn't consider WHERE constraints
AFAICS. So I think there is some other code path involved.
Hmm, relation_has_unique_index_for does consider the lower "WHERE t.id
= ..." clause, as that clause is a restriction clause for "t", and
relation_has_unique_index_for automatically adds any usable
restriction clauses for the rel.
- Richard
Richard Guo <guofenglinux@gmail.com> writes:
On Mon, Apr 20, 2026 at 11:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hmm. The bug also goes away if "t" doesn't have a unique/pkey
constraint, and I find that easy to understand: we can't apply outer
join removal unless rel_supports_distinctness/rel_is_distinct_for
succeed, so that this buggy code in remove_rel_from_restrictinfo
is not reached. But that logic doesn't consider WHERE constraints
AFAICS. So I think there is some other code path involved.
Hmm, relation_has_unique_index_for does consider the lower "WHERE t.id
= ..." clause, as that clause is a restriction clause for "t", and
relation_has_unique_index_for automatically adds any usable
restriction clauses for the rel.
Ah, I finally got it through my head that there are two distinct proof
paths by which we might reach the conclusion that the lower left join
is removable. I had been thinking that we were proving that from the
combination of the "sub.id = empty_source.id" clause with the unique
index on t.id. But we're not, in the query as-submitted, because
we pull up the NULL::UUID constant and const-fold that clause to
NULL. Instead, it's the lowest "WHERE t.id = ..." that is combined
with the unique index to make the proof. So without an equality
test there, we don't think the inner side is unique and don't do
join removal, thus dodging the bug. The WHERE FALSE bit masks this
omission because it causes us to reduce the outer join to a dummy
relation anyway, later on. But if you take that out, you can see
that join removal is not being performed.
I thought it was worth memorializing these two variants in separate
test queries, so I did that. The variant without the lowest WHERE
has a non-null constant in the LOJ's left-hand side, so it's able to
make the removal proof from the "sub.id = empty_source.id" clause.
Pushed at cfcd57111 et al. Thanks again for the report!
regards, tom lane
I wrote:
Pushed at cfcd57111 et al. Thanks again for the report!
Hmm, skink seems unhappy with this. Looking...
regards, tom lane
I wrote:
Pushed at cfcd57111 et al. Thanks again for the report!
Hmm, skink seems unhappy with this. Looking...
Ah: equivclass.c doesn't mind letting em->em_relids be an alias
for the left_relids or right_relids of some source RestrictInfo.
That's not problematic as long as those are all constants after
construction of the EquivalenceClass, but when remove_rel_from_eclass
is trying to change things, it's a big problem.
This seems to do the trick to fix it, although I'm going to wait
for a valgrind regression run to finish before deciding this
is enough:
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index bfb1af614c2..03056bdf3e0 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -783,6 +783,8 @@ remove_rel_from_eclass(EquivalenceClass *ec, int relid, int ojrelid)
bms_is_member(ojrelid, cur_em->em_relids))
{
Assert(!cur_em->em_is_const);
+ /* em_relids is likely to be shared with some RestrictInfo */
+ cur_em->em_relids = bms_copy(cur_em->em_relids);
cur_em->em_relids = bms_del_member(cur_em->em_relids, relid);
cur_em->em_relids = bms_del_member(cur_em->em_relids, ojrelid);
if (bms_is_empty(cur_em->em_relids))
This discovery may help explain why we'd seen so few trouble
reports up to now. At least some RestrictInfos' left/right_relids
would have indirectly gotten "fixed" by the above.
BTW, the case that is crashing the regression tests is where the above
bit reduces em_relids to empty, allowing bms_del_member to pfree it.
Now, the source RestrictInfo's left/right_relids is pointing at
garbage. The reason this didn't cause trouble before is that if
em_relids becomes empty, we remove that EquivalenceMember altogether,
and apparently that's enough to keep us from consulting the source
RestrictInfo anymore. But the loop over ec_sources just below does
see it, and now it is needing the left/right_relids to be valid.
regards, tom lane
On Tue, Apr 21, 2026 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ah: equivclass.c doesn't mind letting em->em_relids be an alias
for the left_relids or right_relids of some source RestrictInfo.
That's not problematic as long as those are all constants after
construction of the EquivalenceClass, but when remove_rel_from_eclass
is trying to change things, it's a big problem.
ha, I just came to the same conclusion.
This seems to do the trick to fix it, although I'm going to wait
for a valgrind regression run to finish before deciding this
is enough:
This seems safe enough to me. LGTM.
- Richard
Thanks Tom and Richard for the quick diagnosis and fix, I should be the one
saying thanks here!
François
On Tue, Apr 21, 2026 at 1:04 AM Richard Guo <guofenglinux@gmail.com> wrote:
Show quoted text
On Tue, Apr 21, 2026 at 7:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ah: equivclass.c doesn't mind letting em->em_relids be an alias
for the left_relids or right_relids of some source RestrictInfo.
That's not problematic as long as those are all constants after
construction of the EquivalenceClass, but when remove_rel_from_eclass
is trying to change things, it's a big problem.ha, I just came to the same conclusion.
This seems to do the trick to fix it, although I'm going to wait
for a valgrind regression run to finish before deciding this
is enough:This seems safe enough to me. LGTM.
- Richard