Remove inner joins based on foreign keys
Currently, the planner can remove useless left joins if the join
condition cannot match more than one RHS row, and the RHS rel is not
referenced above the join. I'd like to propose a similar optimization
for inner joins.
For an inner join to be removable, we need to make sure that the join
condition matches exactly one (no more, no less) RHS row. We can
leverage foreign key constraints to achieve that. But we need to be
careful about the case where the referencing relation's foreign key
columns are null. We can inject an IS NOT NULL filter for any foreign
key columns that are not defined as NOT NULL.
Here is an example to show this removal:
CREATE TABLE users (id int primary key, name text);
CREATE TABLE orders (id int, user_id int references users(id), amount int);
EXPLAIN (COSTS OFF)
SELECT o.* FROM orders o JOIN users u ON o.user_id = u.id;
QUERY PLAN
---------------------------------
Seq Scan on orders o
Filter: (user_id IS NOT NULL)
(2 rows)
One interesting aspect of this patch is that we can actually allow the
usage of the referenced key columns in ECs even above the join, which
can help bridge multi-hop joins. This is pretty useful for the
brand-new graph queries.
As an example, consider:
CREATE TABLE knows (
edge_id int primary key,
src_id int not null references users(id),
dst_id int not null references users(id),
creation_date date
);
CREATE PROPERTY GRAPH social_graph
VERTEX TABLES (users)
EDGE TABLES (knows
SOURCE KEY (src_id) REFERENCES users(id)
DESTINATION KEY (dst_id) REFERENCES users(id)
);
Suppose we only want to find the creation dates of a 3-hop connection
chain:
EXPLAIN (COSTS OFF)
SELECT e1_date, e2_date, e3_date
FROM GRAPH_TABLE (social_graph
MATCH (u1 IS users)
-[e1 IS knows]-> (u2 IS users)
-[e2 IS knows]-> (u3 IS users)
-[e3 IS knows]-> (u4 IS users)
COLUMNS (
e1.creation_date AS e1_date,
e2.creation_date AS e2_date,
e3.creation_date AS e3_date
)
);
QUERY PLAN
----------------------------------------------------
Hash Join
Hash Cond: (knows_1.dst_id = knows_2.src_id)
-> Hash Join
Hash Cond: (knows.dst_id = knows_1.src_id)
-> Seq Scan on knows
-> Hash
-> Seq Scan on knows knows_1
-> Hash
-> Seq Scan on knows knows_2
(9 rows)
As a result, all user nodes are removed, and this query is reduced
from a 7-table join down to a 3-table join.
Please see the draft commit message in the attached patch for the
implementation details. Any thoughts and feedback are very welcome.
- Richard
Attachments:
v1-0001-Remove-inner-joins-based-on-foreign-keys.patchapplication/octet-stream; name=v1-0001-Remove-inner-joins-based-on-foreign-keys.patchDownload+1140-71
On Sat, Mar 21, 2026 at 11:47:03AM +0900, Richard Guo wrote:
Please see the draft commit message in the attached patch for the
implementation details. Any thoughts and feedback are very welcome.
You did not mention anything about the version that would be impacted
by this change. At this stage of the release cycle, is it right to
imply that this would be for v20 and not v19?
--
Michael
On Sat, Mar 21, 2026 at 4:42 PM Michael Paquier <michael@paquier.xyz> wrote:
You did not mention anything about the version that would be impacted
by this change. At this stage of the release cycle, is it right to
imply that this would be for v20 and not v19?
Right. This is for v20.
- Richard
On Sat, 21 Mar 2026 at 15:47, Richard Guo <guofenglinux@gmail.com> wrote:
Currently, the planner can remove useless left joins if the join
condition cannot match more than one RHS row, and the RHS rel is not
referenced above the join. I'd like to propose a similar optimization
for inner joins.
I tried this many years ago and it was pretty much a dead end with how
the current foreign key implementation deferring the cascade of the
foreign key until the end of the query.
There's plenty of discussion. See [1]/messages/by-id/CAApHDvpDXXvKE+=ug1kA--nKfa=bjrjvK8Gp9G8UYwv6nHckVg@mail.gmail.com and [2]/messages/by-id/CAApHDvocUEYdt1uT+DLDPs2xEu=v3qJGT6HeXKonQM4rY_OsSA@mail.gmail.com. In particular, read and
follow along from what Heikki mentions in [3]/messages/by-id/54240C51.8040304@vmware.com. You should read all of
that and understand it to prevent prompting the same discussions all
over again.
It doesn't look like your patch has anything to protect against any of
the issues mentioned in [1]/messages/by-id/CAApHDvpDXXvKE+=ug1kA--nKfa=bjrjvK8Gp9G8UYwv6nHckVg@mail.gmail.com, so I assume you weren't aware of that
work.
David
[1]: /messages/by-id/CAApHDvpDXXvKE+=ug1kA--nKfa=bjrjvK8Gp9G8UYwv6nHckVg@mail.gmail.com
[2]: /messages/by-id/CAApHDvocUEYdt1uT+DLDPs2xEu=v3qJGT6HeXKonQM4rY_OsSA@mail.gmail.com
[3]: /messages/by-id/54240C51.8040304@vmware.com
On Sun, Mar 22, 2026 at 6:09 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 21 Mar 2026 at 15:47, Richard Guo <guofenglinux@gmail.com> wrote:
Currently, the planner can remove useless left joins if the join
condition cannot match more than one RHS row, and the RHS rel is not
referenced above the join. I'd like to propose a similar optimization
for inner joins.
I tried this many years ago and it was pretty much a dead end with how
the current foreign key implementation deferring the cascade of the
foreign key until the end of the query.
Thanks for pointing this out! I failed to find the prior work, and I
missed the fatal flaw introduced by the AFTER ROW trigger mechanism
for foreign key constraints. I had been making a mental analogy to
UNIQUE and NOT NULL constraints, but those are enforced immediately at
the heap/B-tree level.
Just for the sake of archives, the timeline of the trap during a
cascading delete looks like this:
T0: DELETE FROM users WHERE id = 1;
T1: The executor finds users row 1 and sets its xmax, physically
marking it as dead.
T2: [The Gap] The executor pushes the RI trigger into a queue to deal
with orders later. Right now, the orders row still exists, but its
referenced row in users is dead.
T3: The statement finishes, the trigger fires, and the orders row is
finally deleted.
The "T2 Gap" is small, but there are several ways to execute user code
inside that window, such as RETURNING clauses, volatile functions, or
user-defined AFTER ROW triggers.
Since the planner operates on static logical schema guarantees and
cannot predict dynamic execution-time trigger queues, it seems any
plan-time optimization that relies on foreign keys for correctness is
effectively a dead end. Maybe the only solution would be to handle
the join removal in the executor (where the trigger state is known),
but I noticed you explored that a decade ago and it seems far too
invasive.
Thanks again for the save. You saved me a lot of time and effort
chasing a dead end.
- Richard
Hi
po 23. 3. 2026 v 7:13 odesílatel Richard Guo <guofenglinux@gmail.com>
napsal:
On Sun, Mar 22, 2026 at 6:09 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 21 Mar 2026 at 15:47, Richard Guo <guofenglinux@gmail.com>
wrote:
Currently, the planner can remove useless left joins if the join
condition cannot match more than one RHS row, and the RHS rel is not
referenced above the join. I'd like to propose a similar optimization
for inner joins.I tried this many years ago and it was pretty much a dead end with how
the current foreign key implementation deferring the cascade of the
foreign key until the end of the query.Thanks for pointing this out! I failed to find the prior work, and I
missed the fatal flaw introduced by the AFTER ROW trigger mechanism
for foreign key constraints. I had been making a mental analogy to
UNIQUE and NOT NULL constraints, but those are enforced immediately at
the heap/B-tree level.Just for the sake of archives, the timeline of the trap during a
cascading delete looks like this:T0: DELETE FROM users WHERE id = 1;
T1: The executor finds users row 1 and sets its xmax, physically
marking it as dead.T2: [The Gap] The executor pushes the RI trigger into a queue to deal
with orders later. Right now, the orders row still exists, but its
referenced row in users is dead.T3: The statement finishes, the trigger fires, and the orders row is
finally deleted.The "T2 Gap" is small, but there are several ways to execute user code
inside that window, such as RETURNING clauses, volatile functions, or
user-defined AFTER ROW triggers.Since the planner operates on static logical schema guarantees and
cannot predict dynamic execution-time trigger queues, it seems any
plan-time optimization that relies on foreign keys for correctness is
effectively a dead end. Maybe the only solution would be to handle
the join removal in the executor (where the trigger state is known),
but I noticed you explored that a decade ago and it seems far too
invasive.Thanks again for the save. You saved me a lot of time and effort
chasing a dead end.
Maybe you can push this analysis to some README in the code.
Regards
Pavel
Show quoted text
- Richard
On Mon, Mar 23, 2026 at 3:12 PM Richard Guo <guofenglinux@gmail.com> wrote:
Just for the sake of archives, the timeline of the trap during a
cascading delete looks like this:T0: DELETE FROM users WHERE id = 1;
T1: The executor finds users row 1 and sets its xmax, physically
marking it as dead.T2: [The Gap] The executor pushes the RI trigger into a queue to deal
with orders later. Right now, the orders row still exists, but its
referenced row in users is dead.T3: The statement finishes, the trigger fires, and the orders row is
finally deleted.The "T2 Gap" is small, but there are several ways to execute user code
inside that window, such as RETURNING clauses, volatile functions, or
user-defined AFTER ROW triggers.
I've been exploring ways to detect whether a query is executing within
the RI trigger gap. It seems one promising approach is to leverage
the lock manager.
Every DML statement acquires RowExclusiveLock on its target table
before modifying any rows, so if the trigger gap is active, the lock
is guaranteed to be held. We can verify this during planning by
calling CheckRelationOidLockedByMe(), which is a quite efficient
check. (As a point of precedent, the planner already relies on the
local lock manager's state in several existing code paths.)
For prepared statements and cached plans, a generic plan that was
built with FK join removal could be reused in a context where the
trigger gap is active. To handle this, we can modify
choose_custom_plan() to check whether any relation involved in an
FK-based join removal currently holds RowExclusiveLock, and choose
custom plan if so.
The trade-off is false positives: because RowExclusiveLock persists
for the entire transaction while the trigger gap is intra-statement,
the optimization is also skipped after the DML completes but within
the same transaction, after ROLLBACK TO a savepoint, or when
RowExclusiveLock is held for other reasons (e.g., LOCK TABLE). These
seem like uncommon cases, and the query simply falls back to executing
the join normally as it would without the optimization at all, so I
think this is an acceptable trade-off.
Please see the v2 patch for the implementation details.
I didn't find any mention of this approach in the 2014 thread. I'd
appreciate any thoughts or feedback on this direction.
- Richard
Attachments:
v2-0001-Remove-inner-joins-based-on-foreign-keys.patchapplication/octet-stream; name=v2-0001-Remove-inner-joins-based-on-foreign-keys.patchDownload+1349-71
On Wed, Apr 1, 2026 at 6:45 PM Richard Guo <guofenglinux@gmail.com> wrote:
Please see the v2 patch for the implementation details.
I didn't find any mention of this approach in the 2014 thread. I'd
appreciate any thoughts or feedback on this direction.
Here is a new rebase over 20efbdffe. It also tightens up some checks
in inner_join_is_removable().
I've reconsidered the trigger-gap issue and I think I now have a
cleaner understanding. A snapshot captured inside the trigger-gap
window can outlive the window's closure -- for example via a STABLE
function that inherits an older snapshot from its caller, as Tom
pointed out back in 2015 [1]/messages/by-id/32139.1427667410@sss.pgh.pa.us. A query executed against such a stale
snapshot would still observe the inconsistency even after the trigger
queue has drained. So the predicate guarding the optimization has to
remain positive at least as long as any in-transaction snapshot could
still be referenced.
I went through several alternatives (a per-statement counter
incremented in ExecInitModifyTable, AfterTriggerPendingOnRel, etc.)
and convinced myself that all of them go silent before the relevant
snapshots are gone. The lock-based predicate is still the best
correct approach I can think of: RowExclusiveLock is released only at
end of transaction, which trivially exceeds the lifetime of any
in-transaction snapshot. By the time the lock is released, every
in-transaction snapshot has been released, so no stale gap-window
snapshot can still be referenced. Maybe the false positives are just
the price we need to pay for that lifetime guarantee.
[1]: /messages/by-id/32139.1427667410@sss.pgh.pa.us
- Richard
Attachments:
v3-0001-Remove-inner-joins-based-on-foreign-keys.patchapplication/octet-stream; name=v3-0001-Remove-inner-joins-based-on-foreign-keys.patchDownload+1511-47
Hi, Richard,
Richard Guo <guofenglinux@gmail.com> 于2026年4月28日周二 18:09写道:
On Wed, Apr 1, 2026 at 6:45 PM Richard Guo <guofenglinux@gmail.com> wrote:
Please see the v2 patch for the implementation details.
Thanks for this optimization. I look through this patch, and I have
two questions about this implementation.
This is my test case:
CREATE TABLE users (id int primary key, name text);
CREATE TABLE orders (id int primary key, user_id int references
users(id), amount int);
create table nation (id int, name text);
postgres=# explain select n.* from nation n join (orders o join users
u on o.user_id = u.id) on n.id = o.id;
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=55.78..81.82 rows=1264 width=36)
Hash Cond: (n.id = o.id)
-> Seq Scan on nation n (cost=0.00..22.70 rows=1270 width=36)
-> Hash (cost=30.40..30.40 rows=2030 width=8)
-> Seq Scan on orders o (cost=0.00..30.40 rows=2030 width=8)
Filter: (user_id IS NOT NULL)
(6 rows)
The above plan is what we want. The inner join between orders and
users is removed thanks to the FK constraint.
But if I change the most left-side inner join to an outer join, the
plan doesn't seem to be what we want. Maybe I'm wrong.
postgres=# explain select n.* from nation n left join (orders o join
users u on o.user_id = u.id) on n.id = o.id;
QUERY PLAN
-----------------------------------------------------------------------------------
Hash Left Join (cost=99.85..140.01 rows=1270 width=36)
Hash Cond: (n.id = o.id)
-> Seq Scan on nation n (cost=0.00..22.70 rows=1270 width=36)
-> Hash (cost=74.35..74.35 rows=2040 width=4)
-> Hash Join (cost=38.58..74.35 rows=2040 width=4)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders o (cost=0.00..30.40 rows=2040 width=8)
-> Hash (cost=22.70..22.70 rows=1270 width=4)
-> Seq Scan on users u (cost=0.00..22.70
rows=1270 width=4)
(9 rows)
The inner join between orders and users is not removed. It appears
this can be removed in this query. This is my first question.
The second question may not be related to this $SUBJECT.
I did some research about why the inner join cannot be removed in the
second query.
In the inner_join_is_removable(), we have the following check:
...
/*
* If the referenced relation has any restriction clauses or non-equality
* join clauses, they act as explicit filters. Since we cannot perform
* variable substitution to rewrite these clauses, we must abort.
*/
if (ref_rel->baserestrictinfo || ref_rel->joininfo)
return false;
...
The ref_rel is the users relation, and ref_rel->baserestrictinfo is
NIL, but the ref_rel->joininfo is not NIL.
In this query, the ref_rel->joininfo is (n.id = o.id), because it is a
left join, so n.id may not equal o.id.
So we have "non-equality join clauses" in the above comments.
My question is that the (n.id = o.id) clause seems not related to
users. Why does the planner add it to the users' RelOptInfo's
joininfo?
I only know that the thing happens in the
reconsider_outer_join_clauses(), the restrictinfo->required_relids
contains {1,2,3}, where the rtindex = 3 is the users relation.
--
Thanks,
Tender Wang
On Thu, Apr 30, 2026 at 4:50 PM Tender Wang <tndrwang@gmail.com> wrote:
The inner join between orders and users is not removed. It appears
this can be removed in this query. This is my first question.
Right. This is a mis-optimization case I was aware of. The inner
join can be removed in this case. inner_join_is_removable() is
currently overly conservative about ref_rel->joininfo: it bails as
long as the list is non-empty. We can relax this to bail only when a
clause actually references ref_rel via clause_relids. A clause can
appear in ref_rel->joininfo without its expression referencing
ref_rel, which connects to your second question.
The second question may not be related to this $SUBJECT.
My question is that the (n.id = o.id) clause seems not related to
users. Why does the planner add it to the users' RelOptInfo's
joininfo?I only know that the thing happens in the
reconsider_outer_join_clauses(), the restrictinfo->required_relids
contains {1,2,3}, where the rtindex = 3 is the users relation.
Because the clause "n.id = o.id" is an outer join ON clause that is
non-degenerate (one that references the non-nullable side of the join)
and we need to force it to be evaluated exactly at the level of the
outer join. To handle that, we add the join's minimum input relid set
to required_relids. That's why it appears in users' joininfo without
referencing users.
Attached patch relaxes the check against ref_rel->joininfo. Nothing
else has changed.
- Richard
Attachments:
v4-0001-Remove-inner-joins-based-on-foreign-keys.patchapplication/octet-stream; name=v4-0001-Remove-inner-joins-based-on-foreign-keys.patchDownload+1607-47
Richard Guo <guofenglinux@gmail.com> 于2026年5月1日周五 16:25写道:
Because the clause "n.id = o.id" is an outer join ON clause that is
non-degenerate (one that references the non-nullable side of the join)
and we need to force it to be evaluated exactly at the level of the
outer join. To handle that, we add the join's minimum input relid set
to required_relids. That's why it appears in users' joininfo without
referencing users.
Thanks for the explanation.
Attached patch relaxes the check against ref_rel->joininfo. Nothing
else has changed.
CREATE TABLE users (id int primary key, name text);
CREATE TABLE orders (id int primary key, user_id int not null
references users(id), amount int);
CREATE TABLE nation (id int primary key, name text);
postgres=# explain select n.* from nation n left join (orders o join
users u on o.user_id = u.id) on n.id = o.id;
QUERY PLAN
-------------------------------------------------------------------------
Hash Right Join (cost=38.58..74.34 rows=1270 width=36)
Hash Cond: (o.id = n.id)
-> Seq Scan on orders o (cost=0.00..30.40 rows=2040 width=8)
-> Hash (cost=22.70..22.70 rows=1270 width=36)
-> Seq Scan on nation n (cost=0.00..22.70 rows=1270 width=36)
(5 rows)
postgres=# explain select n.* from nation n left join orders o on n.id = o.id;
QUERY PLAN
-------------------------------------------------------------
Seq Scan on nation n (cost=0.00..22.70 rows=1270 width=36)
(1 row)
Recently, I encountered the two plans above. The first plan can
continue to transform into the second plan after inner-join removing.
But in current logic, we cannot do this. Because we do left-join
removable first. Then we do other join(semi-join,
self-join,inner-join) removable.
We can only remove outer-join, whose min-righthand is single.
Maybe we can call remove_useless_joins() again to remove the outer
join that the function couldn't remove in the first call.
I'm not sure it is worth doing this.
Any thoughts?
Hi Richard,
On Fri, May 1, 2026 at 1:25 AM Richard Guo <guofenglinux@gmail.com> wrote:
Attached patch relaxes the check against ref_rel->joininfo. Nothing
else has changed.
Thanks for the patch! I've tested and reviewed v4. Overall, this looks
good to me. Please find individual feedback below.
On Wed, Apr 1, 2026 at 2:46 AM Richard Guo <guofenglinux@gmail.com> wrote:
On Mon, Mar 23, 2026 at 3:12 PM Richard Guo <guofenglinux@gmail.com>
wrote:
Just for the sake of archives, the timeline of the trap during a
cascading delete looks like this:T0: DELETE FROM users WHERE id = 1;
T1: The executor finds users row 1 and sets its xmax, physically
marking it as dead.T2: [The Gap] The executor pushes the RI trigger into a queue to deal
with orders later. Right now, the orders row still exists, but its
referenced row in users is dead.T3: The statement finishes, the trigger fires, and the orders row is
finally deleted.The "T2 Gap" is small, but there are several ways to execute user code
inside that window, such as RETURNING clauses, volatile functions, or
user-defined AFTER ROW triggers.I've been exploring ways to detect whether a query is executing within
the RI trigger gap. It seems one promising approach is to leverage
the lock manager.Every DML statement acquires RowExclusiveLock on its target table
before modifying any rows, so if the trigger gap is active, the lock
is guaranteed to be held. We can verify this during planning by
calling CheckRelationOidLockedByMe(), which is a quite efficient
check. (As a point of precedent, the planner already relies on the
local lock manager's state in several existing code paths.)For prepared statements and cached plans, a generic plan that was
built with FK join removal could be reused in a context where the
trigger gap is active. To handle this, we can modify
choose_custom_plan() to check whether any relation involved in an
FK-based join removal currently holds RowExclusiveLock, and choose
custom plan if so.The trade-off is false positives: because RowExclusiveLock persists
for the entire transaction while the trigger gap is intra-statement,
the optimization is also skipped after the DML completes but within
the same transaction, after ROLLBACK TO a savepoint, or when
RowExclusiveLock is held for other reasons (e.g., LOCK TABLE). These
seem like uncommon cases, and the query simply falls back to executing
the join normally as it would without the optimization at all, so I
think this is an acceptable trade-off.
On Tue, Apr 28, 2026 at 3:09 AM Richard Guo <guofenglinux@gmail.com> wrote:
On Wed, Apr 1, 2026 at 6:45 PM Richard Guo <guofenglinux@gmail.com> wrote:
Please see the v2 patch for the implementation details.
I didn't find any mention of this approach in the 2014 thread. I'd
appreciate any thoughts or feedback on this direction.Here is a new rebase over 20efbdffe. It also tightens up some checks
in inner_join_is_removable().I've reconsidered the trigger-gap issue and I think I now have a
cleaner understanding. A snapshot captured inside the trigger-gap
window can outlive the window's closure -- for example via a STABLE
function that inherits an older snapshot from its caller, as Tom
pointed out back in 2015 [1]. A query executed against such a stale
snapshot would still observe the inconsistency even after the trigger
queue has drained. So the predicate guarding the optimization has to
remain positive at least as long as any in-transaction snapshot could
still be referenced.I went through several alternatives (a per-statement counter
incremented in ExecInitModifyTable, AfterTriggerPendingOnRel, etc.)
and convinced myself that all of them go silent before the relevant
snapshots are gone. The lock-based predicate is still the best
correct approach I can think of: RowExclusiveLock is released only at
end of transaction, which trivially exceeds the lifetime of any
in-transaction snapshot. By the time the lock is released, every
in-transaction snapshot has been released, so no stale gap-window
snapshot can still be referenced. Maybe the false positives are just
the price we need to pay for that lifetime guarantee.
Checking RowExclusiveLock makes sense to me.
One nitpick on wording: the RowExclusiveLock acquired at the beginning
of the DML command can be released early by ROLLBACK TO a savepoint
established before that command, so it doesn't strictly "exceed the
lifetime of any in-transaction snapshot." In practice this doesn't
matter, though, because the lock is guaranteed to outlive any snapshot
taken during the trigger gap of the DML command, which is the only
window where the FK invariant can be violated.
IIUC, as long as we ensure that the snapshot of the join statement --
whether it is an after-row trigger, RETURNING clause, cursor, etc. --
is taken outside of the trigger gap, which IS guarded by
RowExclusiveLock, removing the join from its plan should be safe.
So, setting aside the false positives you mentioned, which I think are
acceptable, I believe this approach is correct.
I then tried to find edge cases similar to the 2014 threads and found
one bug with TABLESAMPLE:
-- If the patch removes the join when the referenced table uses TABLESAMPLE,
-- it will return MORE rows than correct (all child rows instead of only
-- those whose parent appeared in the sample).
CREATE TABLE exp_parent (id int PRIMARY KEY, data text);
CREATE TABLE exp_child (
id int PRIMARY KEY,
parent_id int NOT NULL REFERENCES exp_parent(id)
);
INSERT INTO exp_parent SELECT g, 'row_' || g FROM generate_series(1, 100) g;
INSERT INTO exp_child SELECT g, g FROM generate_series(1, 100) g;
ANALYZE exp_parent;
ANALYZE exp_child;
-- This join should NOT be removed because TABLESAMPLE limits the parent
rows
EXPLAIN (COSTS OFF)
SELECT c.id
FROM exp_child c
INNER JOIN exp_parent p TABLESAMPLE BERNOULLI(1) REPEATABLE(42) ON
c.parent_id = p.id;
-- Verify: if join removal fires, this would return ~100 rows.
-- With the join, it should return ~1 row (1% sample).
SELECT count(*)
FROM exp_child c
INNER JOIN exp_parent p TABLESAMPLE BERNOULLI(1) REPEATABLE(42) ON
c.parent_id = p.id;
We should add a condition for TABLESAMPLE in inner_join_is_removable()
to disallow join removal.
Minor things I think might be worth mentioning, but I don't have
strong opinions:
1. Self-join removal has a GUC: enable_self_join_elimination. I didn't
investigate why that was needed, and I don't know in what cases people
would want to turn it off, maybe for debugging purposes? Do you think
we need a similar GUC for inner joins? Disclaimer: I'm not a fan of
unnecessary GUCs, just putting the question out there.
2. Nitpick: should we consider renaming remove_useless_joins() to
remove_useless_left_joins() to reflect what it actually does?
Best,
Alex
--
Alexandra Wang
EDB: https://www.enterprisedb.com
(CC'ing folks who discussed this with me at PGConf.dev last week;
thanks again for the conversations.)
After several discussions at PGConf.dev last week, there are concerns
with the lock-based predicate this thread has been converging on. I'd
like to lay out the underlying issue and the available approaches in
detail on-list, so the wider community can weigh in before we go
further. I'm not advocating any particular option here. My goal in
this email is to make sure everyone evaluating the patch is starting
from the same picture.
1. The trigger gap, demonstrated
--------------------------------
On master, no patch applied:
CREATE TABLE users (id int PRIMARY KEY, name text);
CREATE TABLE orders (id int PRIMARY KEY,
user_id int REFERENCES users(id) ON DELETE CASCADE);
INSERT INTO users VALUES (1, 'Alice'), (2, 'Bob');
INSERT INTO orders VALUES (10, 1), (11, 1), (20, 2);
CREATE FUNCTION show_gap() RETURNS void LANGUAGE plpgsql AS $$
DECLARE u_rows text; o_rows text;
BEGIN
SELECT string_agg(format('%s (id=%s)', name, id),
', ' ORDER BY id)
INTO u_rows FROM users;
SELECT string_agg(format('(id=%s, user_id=%s)', id, user_id),
', ' ORDER BY id)
INTO o_rows FROM orders;
RAISE NOTICE 'users: %', u_rows;
RAISE NOTICE 'orders: %', o_rows;
END $$;
DELETE FROM users WHERE id = 1 RETURNING show_gap();
NOTICE: users: Bob (id=2)
NOTICE: orders: (id=10, user_id=1), (id=11, user_id=1), (id=20, user_id=2)
show_gap() runs inside the DELETE's RETURNING clause, after Alice has
been removed from users but before the cascade trigger has deleted her
orders. In show_gap()'s snapshot, Alice is gone, but her two orders
(10 and 11) are still there, pointing at a user_id that no longer
exists. The FK invariant is locally false.
Why this happens: FK constraints are enforced by AFTER ROW triggers
that fire at end of statement, after RETURNING evaluation. Between
when the DELETE modifies the heap and when the cascade trigger fires,
the heap is transiently inconsistent. Any user code that runs in that
window with a fresh snapshot observes the inconsistency. In the demo
above, plpgsql's SPI call takes such a snapshot.
For the FK-based inner-join-removal optimization, this matters because
the rewrite from 'child JOIN parent ON c.fk = p.id' to 'child WHERE
c.fk IS NOT NULL' assumes the FK invariant holds for the executing
snapshot. In show_gap()'s snapshot it doesn't, and the rewrite would
return Alice's orders (10 and 11) as if Alice were still present, when
the joined form correctly excludes them.
2. Approaches
-------------
What PG promises about FK constraints today is what the SQL standard
requires: INITIALLY IMMEDIATE constraints are satisfied at end-of-
statement, INITIALLY DEFERRED at commit. Neither the standard nor the
PG docs explicitly promise that the FK invariant holds for every
snapshot a query inside the database can use. The optimization
proposed in this thread needs the stronger property: it rewrites a
join on the assumption that every visible child row with non-null FK
has a corresponding visible parent in the same snapshot. The trigger
gap is where that stronger property fails.
The approaches I have considered are listed below, with (B) being
the predicate currently in the v4 patch.
(A) Drop the optimization.
Don't pursue FK-based inner-join removal, or any other optimization
whose correctness depends on the FK invariant.
(B) Lock-based predicate.
inner_join_is_removable() and choose_custom_plan() consult
CheckRelationOidLockedByMe(RowExclusiveLock) for the FK rels; skip the
rewrite if held. RowExclusiveLock is xact-scoped and exceeds the
lifetime of any in-xact snapshot, so the lock being present is
sufficient to block the optimization in any context where a
gap-bearing snapshot might exist.
Pros: No new infrastructure. Predicate is local to two functions.
CheckRelationOidLockedByMe() is a cheap local hash lookup. Correct
against every gap-bearing snapshot path.
Cons: Xact-scoped, therefore pessimistic. Once a DML on either FK rel
runs in a transaction, the rewrite is suppressed for the rest of that
transaction, even when subsequent statements take fresh snapshots that
the FK invariant *does* hold for. For cached plans that previously
used FK removal, this also produces a planner-replan storm:
choose_custom_plan trips on every consultation in the writing xact and
replans each time.
(C) Snapshot-anchored predicate.
Add a "captured during a gap window" bit to SnapshotData, populated in
CopySnapshot when AfterTriggerPendingOnAnyRel() returns true at
capture time. inner_join_is_removable() and choose_custom_plan()
consult the bit on the active snapshot.
Pros: Precise. Only snapshots actually captured inside a gap suppress
the optimization. Sequential DML-then-SELECT in the same xact keeps
the rewrite. Eliminates the replan storm and most of the plan-shape
cliff that (B) introduces.
Cons: SnapshotData, one of PG's most foundational abstractions, gains
a field that encodes trigger-subsystem state. Every future reader of
snapshot.h has to learn what the gap is.
(D) Close the gap.
Make the FK invariant hold for every in-xact snapshot. Either by
enforcing RI synchronously inside the DML rather than via AFTER ROW
triggers, or by arranging that the deleted parent tuple and the
trigger's modifications to children become visible atomically.
Pros: Strengthens what PG promises about constraints. Benefits any
future optimization that needs the FK invariant. No gating predicate
is needed in the planner or plan cache.
Cons: Invasive and difficult.
(E) Apply the optimization without gap handling, and document the
corner cases.
Skip the predicate entirely. The optimization fires whenever the FK
constraint structurally permits it. Queries that execute against a
gap-bearing snapshot can return wrong results, documented as a known
limitation.
Pros: Smallest patch. Writing a join query that runs during a trigger
gap is unusual in practice for most user code.
Cons: Can have wrong results.
(F) Something else.
3. On splitting the patch
-------------------------
Independently of which option above wins, I'd like to separate the
patch into two parts so the optimization mechanics and the
gap-handling can be reviewed independently (see v5 patches):
Part 1. The structural inner-join-removal logic, assuming the FK
invariant holds.
Part 2. Whatever gap-handling we converge on from section 2 above. If
(A) wins, Part 2 is replaced by dropping Part 1 entirely.
Part 1 alone would be unsafe to commit by itself; the value of the
split is reviewing-order, not commit-order.
(Attached v5 is the split version of v4, plus addressing Alex's two
comments. 0002 is still the lock-based predicate from v4, posted as
the concrete reference for option (B); it can be swapped for whichever
approach the gap-handling discussion settles on.)
I'd like input on which of the approaches in section 2 people would be
willing to live with. Pointers to prior discussion I haven't found,
or design considerations I've missed, are also welcome.
- Richard