Eliminating SPI / SQL from some RI triggers - take 3

Started by Amit Langoteabout 1 year ago17 messages

Langote_Amit_f8@lab.ntt.co.jp

about 1 year ago

Hi,

We discussed $subject at [1]Simplifying foreign key/RI checks: /messages/by-id/CA+HiwqG5e8pk8s7+7zhr1Nc_PGyhEdM5f=pHkMOdK1RYWXfJsg@mail.gmail.com and [2]Eliminating SPI from RI triggers - take 2 /messages/by-id/CA+HiwqG5e8pk8s7+7zhr1Nc_PGyhEdM5f=pHkMOdK1RYWXfJsg@mail.gmail.com and I'd like to continue that
work with the hope to commit some part of it for v18.

In short, performing the RI checks for inserts and updates of a
referencing table as direct scans of the PK index results in up to 40%
improvement in their performance, especially when they are done in a
bulk manner as shown in the following example:

create unlogged table pk (a int primary key);
insert into pk select generate_series(1, 10000000);
insert into fk select generate_series(1, 10000000);

On my machine, the last query took 20 seconds with master, whereas 12
seconds with the patches. With master, a significant portion of the
time can be seen spent in ExecutorStart() and ExecutorEnd() on the
plan for the RI query, which adds up as it's done for each row in a
bulk load. Patch avoids that overhead because it calls the index AM
directly.

The patches haven't changed in the basic design since the last update
at [2]Eliminating SPI from RI triggers - take 2 /messages/by-id/CA+HiwqG5e8pk8s7+7zhr1Nc_PGyhEdM5f=pHkMOdK1RYWXfJsg@mail.gmail.com, though there are few changes:

1. I noticed a few additions to the RI trigger functions the patch
touches, such as those to support temporal foreign keys. I decided to
leave the SQL for temporal queries in place as the plan for those
doesn't look, on a glance, as simple as a simple index scan.

2. As I mentioned in [3]/messages/by-id/CA+TgmoaiTNj4DgQy42OT9JmTTP1NWcMV+ke0i=+a7=VgnzqGXw@mail.gmail.com, the new way of doing the PK lookup didn't
have a way to recheck the PK tuple after detecting concurrent updates
of the PK, so would cause an error under READ COMMITTED isolation
level. The old way of executing an SQL plan would deal with that
using the EvalPlanQual() mechanism in the executor. In the updated
patch, I've added an equivalent rechecking function that's called in
the same situations as EvalPlanQual() would get called in the old
method.

3. I reordered the patches as Robert suggested at [5]/messages/by-id/CA+TgmoaiTNj4DgQy42OT9JmTTP1NWcMV+ke0i=+a7=VgnzqGXw@mail.gmail.com. Mainly because
the patch set includes changes to address a bug where PK lookups could
return incorrect results under the REPEATABLE READ isolation level.
This issue arises because RI lookups on partitioned PK tables
manipulate ActiveSnapshot to pass the snapshot that's used by
find_inheritance_children() to determine the visibility of
detach-pending partitions to these RI lookups. To address this, the
patch set introduces refactoring of the PartitionDesc interface,
included in patch 0001. This refactoring eliminates the need to
manipulate ActiveSnapshot by explicitly passing the correct snapshot
for detach-pending visibility handling. The main patch (0002+0003),
which focuses on improving performance by avoiding SQL queries for RI
checks, builds upon these refactoring changes to pass the snapshot
directly instead of manipulating the ActiveSnapshot. Reordering the
patches this way ensures a logical progression of changes, as Robert
suggested, while avoiding any impression that the bug was introduced
by the ri_triggers.c changes.

However, I need to spend some time addressing Robert's feedback on the
basic design, as outlined at [5]/messages/by-id/CA+TgmoaiTNj4DgQy42OT9JmTTP1NWcMV+ke0i=+a7=VgnzqGXw@mail.gmail.com. Specifically, the new PK lookup
function could benefit significantly from caching information rather
than recomputing it for each row. This implies that the PlanCreate
function should create a struct to store reusable information across
PlanExecute calls for different rows being checked.

Beyond implementing these changes, I also need to confirm that the new
plan execution preserves all operations performed by the SQL plan for
the same checks, particularly those affecting user-visible behavior.
I've already verified that permission checks are preserved: revoking
access to the PK table during the checks causes them to fail, as
expected. This behavior is maintained because permission checks are
performed during each execution. The planned changes to separate the
"plan" and "execute" steps should continue to uphold this and other
behaviors that might need to be preserved.

--
Thanks, Amit Langote

[1]: Simplifying foreign key/RI checks: /messages/by-id/CA+HiwqG5e8pk8s7+7zhr1Nc_PGyhEdM5f=pHkMOdK1RYWXfJsg@mail.gmail.com
/messages/by-id/CA+HiwqG5e8pk8s7+7zhr1Nc_PGyhEdM5f=pHkMOdK1RYWXfJsg@mail.gmail.com

[2]: Eliminating SPI from RI triggers - take 2 /messages/by-id/CA+HiwqG5e8pk8s7+7zhr1Nc_PGyhEdM5f=pHkMOdK1RYWXfJsg@mail.gmail.com
/messages/by-id/CA+HiwqG5e8pk8s7+7zhr1Nc_PGyhEdM5f=pHkMOdK1RYWXfJsg@mail.gmail.com

[3]: /messages/by-id/CA+TgmoaiTNj4DgQy42OT9JmTTP1NWcMV+ke0i=+a7=VgnzqGXw@mail.gmail.com

[4]: /messages/by-id/CA+Tgmoa1DCQ0MdojD9o6Ppbfj=abXxe4FUkwA4O_6qBHwOMVjw@mail.gmail.com

[5]: /messages/by-id/CA+TgmoaiTNj4DgQy42OT9JmTTP1NWcMV+ke0i=+a7=VgnzqGXw@mail.gmail.com

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

12 months ago

In reply to: Amit Langote (#1)

Re: Eliminating SPI / SQL from some RI triggers - take 3

On Fri, Dec 20, 2024 at 1:23 PM Amit Langote <amitlangote09@gmail.com> wrote:

We discussed $subject at [1] and [2] and I'd like to continue that
work with the hope to commit some part of it for v18.

I did not get a chance to do any further work on this in this cycle,
but plan to start working on it after beta release, so moving this to
the next CF. I will post a rebased patch after the freeze to keep the
bots green for now.

--
Thanks, Amit Langote

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

5 months ago

In reply to: Amit Langote (#2)

Re: Eliminating SPI / SQL from some RI triggers - take 3

On Thu, Apr 3, 2025 at 7:19 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Fri, Dec 20, 2024 at 1:23 PM Amit Langote <amitlangote09@gmail.com> wrote:

We discussed $subject at [1] and [2] and I'd like to continue that
work with the hope to commit some part of it for v18.

I did not get a chance to do any further work on this in this cycle,
but plan to start working on it after beta release, so moving this to
the next CF. I will post a rebased patch after the freeze to keep the
bots green for now.

Sorry for the inactivity. I've moved the patch entry in the CF app to
PG19-Drafts, since I don't plan to work on it myself in the immediate
future. However, Junwang Zhao has expressed interest in taking this
work forward, and I look forward to working with him on it.

--
Thanks, Amit Langote

Pavel Stehule

pavel.stehule@gmail.com

5 months ago

In reply to: Amit Langote (#3)

Re: Eliminating SPI / SQL from some RI triggers - take 3

út 21. 10. 2025 v 6:07 odesílatel Amit Langote <amitlangote09@gmail.com>
napsal:

On Thu, Apr 3, 2025 at 7:19 PM Amit Langote <amitlangote09@gmail.com>
wrote:

On Fri, Dec 20, 2024 at 1:23 PM Amit Langote <amitlangote09@gmail.com>

wrote:

We discussed $subject at [1] and [2] and I'd like to continue that
work with the hope to commit some part of it for v18.

I did not get a chance to do any further work on this in this cycle,
but plan to start working on it after beta release, so moving this to
the next CF. I will post a rebased patch after the freeze to keep the
bots green for now.

Sorry for the inactivity. I've moved the patch entry in the CF app to
PG19-Drafts, since I don't plan to work on it myself in the immediate
future. However, Junwang Zhao has expressed interest in taking this
work forward, and I look forward to working with him on it.

This is very interesting and important feature - I can help with testing
and review if it will be necessary

Regards

Pavel

Show quoted text

--
Thanks, Amit Langote

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

5 months ago

In reply to: Pavel Stehule (#4)

Re: Eliminating SPI / SQL from some RI triggers - take 3

.
On Tue, Oct 21, 2025 at 2:10 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:

út 21. 10. 2025 v 6:07 odesílatel Amit Langote <amitlangote09@gmail.com> napsal:

On Thu, Apr 3, 2025 at 7:19 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Fri, Dec 20, 2024 at 1:23 PM Amit Langote <amitlangote09@gmail.com> wrote:

We discussed $subject at [1] and [2] and I'd like to continue that
work with the hope to commit some part of it for v18.

I did not get a chance to do any further work on this in this cycle,
but plan to start working on it after beta release, so moving this to
the next CF. I will post a rebased patch after the freeze to keep the
bots green for now.

Sorry for the inactivity. I've moved the patch entry in the CF app to
PG19-Drafts, since I don't plan to work on it myself in the immediate
future. However, Junwang Zhao has expressed interest in taking this
work forward, and I look forward to working with him on it.

This is very interesting and important feature - I can help with testing and review if it will be necessary

Thanks for the interest.

Just to add a quick note on the current direction I’ve been discussing
off-list with Junwang:

The next iteration of this work will likely follow a hybrid "fast-path
+ fallback" design rather than the original pure fast-path approach.
The idea is to keep the optimization for straightforward cases where
the foreign key and referenced key can be verified by a direct index
probe, while falling back to the existing SPI path only when the
runtime behavior of the executor is non-trivial to replicate -- such
as visibility rechecks under concurrent updates -- or when the
constraint itself involves richer semantics, like temporal foreign
keys that require range and aggregation logic. That keeps the
optimization safe without changing the meaning of constraint
enforcement.

This direction comes partly in response to the feedback from Robert
and Tom in the earlier Eliminating SPI threads, who raised concerns
that a fast path might silently diverge from what the executor does at
runtime in subtle cases. The fallback design aims to address that
directly: it keeps the optimization where it’s clearly safe, but
defers to the existing SPI-based implementation whenever correctness
might depend on executor behavior that would otherwise be difficult or
risky to reproduce locally.

In practice, this means adding a guarded fast path that performs the
index probe and tuple lock directly under the same snapshot and
security context that SPI would use, while caching stable metadata
such as index descriptors, scan keys, and operator information per
constraint or per statement. The fallback to SPI remains for the few
cases that either depend on executor behavior or need features beyond
a simple index probe:

* Concurrent updates or deletes: If table_tuple_lock() reports that
the target tuple was updated or deleted, we delegate to the SPI path
so that EvalPlanQual and visibility rules are applied as today.

* Partitioned parents: Skipped in v1 for simplicity, since they
require routing the probe through the correct partition using
PartitionDirectory. This can be added later as a separate patch once
the core mechanism is stable.

* Temporal foreign keys: These use range overlap and containment
semantics (&&, <@, range_agg()) that inherently involve aggregation
and multiple-row reasoning, so they stay on the SPI path.

Everything else -- multi-column keys, cross-type equality supported by
the index opfamily, collation matching, and RLS/ACL enforcement --
will be handled directly in the fast path. The security behavior will
mirror the existing SPI path by temporarily switching to the parent
table's owner with SECURITY_LOCAL_USERID_CHANGE | SECURITY_NOFORCE_RLS
around the probe, like ri_PerformCheck() does.

For concurrency, the fast path locks the located parent tuple with
LockTupleKeyShare under GetActiveSnapshot(). If that succeeds (TM_Ok),
the check passes immediately. While non-TM_Ok cases fall back for now,
a later refinement could follow the update chain with
table_tuple_fetch_row_version() under the current snapshot and re-lock
the visible version, making the fast path fully self-contained.

That’s the direction Junwang and I plan to explore next.

--
Thanks, Amit Langote

Junwang Zhao

zhjwpku@gmail.com

4 months ago

In reply to: Amit Langote (#5)

Re: Eliminating SPI / SQL from some RI triggers - take 3

Hi,

On Wed, Oct 22, 2025 at 9:56 PM Amit Langote <amitlangote09@gmail.com> wrote:

.
On Tue, Oct 21, 2025 at 2:10 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:

út 21. 10. 2025 v 6:07 odesílatel Amit Langote <amitlangote09@gmail.com> napsal:

On Thu, Apr 3, 2025 at 7:19 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Fri, Dec 20, 2024 at 1:23 PM Amit Langote <amitlangote09@gmail.com> wrote:

We discussed $subject at [1] and [2] and I'd like to continue that
work with the hope to commit some part of it for v18.

I did not get a chance to do any further work on this in this cycle,
but plan to start working on it after beta release, so moving this to
the next CF. I will post a rebased patch after the freeze to keep the
bots green for now.

Sorry for the inactivity. I've moved the patch entry in the CF app to
PG19-Drafts, since I don't plan to work on it myself in the immediate
future. However, Junwang Zhao has expressed interest in taking this
work forward, and I look forward to working with him on it.

This is very interesting and important feature - I can help with testing and review if it will be necessary

Thanks for the interest.

Just to add a quick note on the current direction I’ve been discussing
off-list with Junwang:

The next iteration of this work will likely follow a hybrid "fast-path
+ fallback" design rather than the original pure fast-path approach.
The idea is to keep the optimization for straightforward cases where
the foreign key and referenced key can be verified by a direct index
probe, while falling back to the existing SPI path only when the
runtime behavior of the executor is non-trivial to replicate -- such
as visibility rechecks under concurrent updates -- or when the
constraint itself involves richer semantics, like temporal foreign
keys that require range and aggregation logic. That keeps the
optimization safe without changing the meaning of constraint
enforcement.

This direction comes partly in response to the feedback from Robert
and Tom in the earlier Eliminating SPI threads, who raised concerns
that a fast path might silently diverge from what the executor does at
runtime in subtle cases. The fallback design aims to address that
directly: it keeps the optimization where it’s clearly safe, but
defers to the existing SPI-based implementation whenever correctness
might depend on executor behavior that would otherwise be difficult or
risky to reproduce locally.

In practice, this means adding a guarded fast path that performs the
index probe and tuple lock directly under the same snapshot and
security context that SPI would use, while caching stable metadata
such as index descriptors, scan keys, and operator information per
constraint or per statement. The fallback to SPI remains for the few
cases that either depend on executor behavior or need features beyond
a simple index probe:

* Concurrent updates or deletes: If table_tuple_lock() reports that
the target tuple was updated or deleted, we delegate to the SPI path
so that EvalPlanQual and visibility rules are applied as today.

* Partitioned parents: Skipped in v1 for simplicity, since they
require routing the probe through the correct partition using
PartitionDirectory. This can be added later as a separate patch once
the core mechanism is stable.

* Temporal foreign keys: These use range overlap and containment
semantics (&&, <@, range_agg()) that inherently involve aggregation
and multiple-row reasoning, so they stay on the SPI path.

Everything else -- multi-column keys, cross-type equality supported by
the index opfamily, collation matching, and RLS/ACL enforcement --
will be handled directly in the fast path. The security behavior will
mirror the existing SPI path by temporarily switching to the parent
table's owner with SECURITY_LOCAL_USERID_CHANGE | SECURITY_NOFORCE_RLS
around the probe, like ri_PerformCheck() does.

For concurrency, the fast path locks the located parent tuple with
LockTupleKeyShare under GetActiveSnapshot(). If that succeeds (TM_Ok),
the check passes immediately. While non-TM_Ok cases fall back for now,
a later refinement could follow the update chain with
table_tuple_fetch_row_version() under the current snapshot and re-lock
the visible version, making the fast path fully self-contained.

That’s the direction Junwang and I plan to explore next.

--
Thanks, Amit Langote

As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
design.

0001 adds a fast path optimization for foreign key constraint checks
that bypasses the SPI executor, the fast path applies when the referenced
table is not partitioned, and the constraint does not involve temporal
semantics.

With the following test:

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);

head:

[local] zhjwpku@postgres:5432-90419=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 13516.177 ms (00:13.516)

[local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
UPDATE 1000000
Time: 15057.638 ms (00:15.058)

patched:

[local] zhjwpku@postgres:5432-98673=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 8248.777 ms (00:08.249)

[local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
UPDATE 1000000
Time: 10117.002 ms (00:10.117)

0002 cache fast-path metadata used by the index probe, at the current
time only comparison operator hash entries, operator function OIDs
and strategy numbers and subtypes for index scans. But this cache
doesn't buy any performance improvement.

Caching additional metadata should improve performance for foreign key checks.

Amit suggested introducing a mechanism for ri_triggers.c to register a
cleanup callback in the EState, which AfterTriggerEndQuery() could then
invoke to release per-statement cached metadata (such as the IndexScanDesc).
However, I haven't been able to implement this mechanism yet.

Amit and I agree that we can post the patches here for review now. We are
continuing to work on improving the metadata cache implementation.

--
Regards
Junwang Zhao

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

27 days ago

In reply to: Junwang Zhao (#6)

Re: Eliminating SPI / SQL from some RI triggers - take 3

Hi Junwang,

On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
design.

0001 adds a fast path optimization for foreign key constraint checks
that bypasses the SPI executor, the fast path applies when the referenced
table is not partitioned, and the constraint does not involve temporal
semantics.

With the following test:

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);

head:

[local] zhjwpku@postgres:5432-90419=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 13516.177 ms (00:13.516)

[local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
UPDATE 1000000
Time: 15057.638 ms (00:15.058)

patched:

[local] zhjwpku@postgres:5432-98673=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 8248.777 ms (00:08.249)

[local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
UPDATE 1000000
Time: 10117.002 ms (00:10.117)

0002 cache fast-path metadata used by the index probe, at the current
time only comparison operator hash entries, operator function OIDs
and strategy numbers and subtypes for index scans. But this cache
doesn't buy any performance improvement.

Caching additional metadata should improve performance for foreign key checks.

Amit suggested introducing a mechanism for ri_triggers.c to register a
cleanup callback in the EState, which AfterTriggerEndQuery() could then
invoke to release per-statement cached metadata (such as the IndexScanDesc).
However, I haven't been able to implement this mechanism yet.

Thanks for working on this. I've taken your patches as a starting
point and reworked the series into two patches (attached): 1st is your
0001+0002 as the core patch that adds a gated fast-path alternative to
SPI and 2nd where I added per-statement resource caching. Doing the
latter turned out to be not so hard thanks to the structure you chose
to build the core fast path. Good call on adding the RLS and ACL test
cases, btw.

So, 0001 is a functionally complete fast path: concurrency handling,
REPEATABLE READ crosscheck, cross-type operators, security context,
and metadata caching. 0002 implements the per-statement resource
caching we discussed, though instead of sharing the EState between
trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
mechanism that fires at the end of each trigger-firing cycle
(per-statement for immediate constraints, or until COMMIT for deferred
ones). It layers resource caching on top so that the PK relation,
index, scan descriptor, and snapshot stay open across all FK trigger
invocations within a single trigger-firing cycle rather than being
opened and closed per row.

Note that phe previous 0002 (metadata caching) is folded into 0001,
and most of the new fast-path logic added in 0001 now lives in
ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
RI_FKey_check diff is just the gating call and SPI fallback.

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

The incremental gain from 0002 comes from eliminating per-row relation
open/close, scan begin/end, slot alloc/free, and replacing per-row
GetSnapshotData() with only curcid adjustment on the registered
snapshot copy in the cache.

The two current limitations are partitioned referenced tables and
temporal foreign keys. Partitioned PKs are relatively uncommon in
practice, so the non-partitioned case should cover most FK workloads,
so I'm not sure it's worth the added complexity to support them.
Temporal FKs are inherently multi-row, so they're a poor fit for a
single-probe fast path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I think the series is in reasonable shape but would appreciate extra
eyeballs, especially on the concurrency handling in ri_LockPKTuple()
in 0001 and the snapshot lifecycle in 0002. Or anything else that
catches one's eye.

--
Thanks, Amit Langote

Junwang Zhao

zhjwpku@gmail.com

23 days ago

In reply to: Amit Langote (#7)

Re: Eliminating SPI / SQL from some RI triggers - take 3

Hi Amit,

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
design.

0001 adds a fast path optimization for foreign key constraint checks
that bypasses the SPI executor, the fast path applies when the referenced
table is not partitioned, and the constraint does not involve temporal
semantics.

With the following test:

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);

head:

[local] zhjwpku@postgres:5432-90419=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 13516.177 ms (00:13.516)

[local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
UPDATE 1000000
Time: 15057.638 ms (00:15.058)

patched:

[local] zhjwpku@postgres:5432-98673=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 8248.777 ms (00:08.249)

[local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
UPDATE 1000000
Time: 10117.002 ms (00:10.117)

0002 cache fast-path metadata used by the index probe, at the current
time only comparison operator hash entries, operator function OIDs
and strategy numbers and subtypes for index scans. But this cache
doesn't buy any performance improvement.

Caching additional metadata should improve performance for foreign key checks.

Amit suggested introducing a mechanism for ri_triggers.c to register a
cleanup callback in the EState, which AfterTriggerEndQuery() could then
invoke to release per-statement cached metadata (such as the IndexScanDesc).
However, I haven't been able to implement this mechanism yet.

Thanks for working on this. I've taken your patches as a starting
point and reworked the series into two patches (attached): 1st is your
0001+0002 as the core patch that adds a gated fast-path alternative to
SPI and 2nd where I added per-statement resource caching. Doing the
latter turned out to be not so hard thanks to the structure you chose
to build the core fast path. Good call on adding the RLS and ACL test
cases, btw.

So, 0001 is a functionally complete fast path: concurrency handling,
REPEATABLE READ crosscheck, cross-type operators, security context,
and metadata caching. 0002 implements the per-statement resource
caching we discussed, though instead of sharing the EState between
trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
mechanism that fires at the end of each trigger-firing cycle
(per-statement for immediate constraints, or until COMMIT for deferred
ones). It layers resource caching on top so that the PK relation,
index, scan descriptor, and snapshot stay open across all FK trigger
invocations within a single trigger-firing cycle rather than being
opened and closed per row.

Note that phe previous 0002 (metadata caching) is folded into 0001,
and most of the new fast-path logic added in 0001 now lives in
ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
RI_FKey_check diff is just the gating call and SPI fallback.

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

The incremental gain from 0002 comes from eliminating per-row relation
open/close, scan begin/end, slot alloc/free, and replacing per-row
GetSnapshotData() with only curcid adjustment on the registered
snapshot copy in the cache.

The two current limitations are partitioned referenced tables and
temporal foreign keys. Partitioned PKs are relatively uncommon in
practice, so the non-partitioned case should cover most FK workloads,
so I'm not sure it's worth the added complexity to support them.
Temporal FKs are inherently multi-row, so they're a poor fit for a
single-probe fast path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I think the series is in reasonable shape but would appreciate extra
eyeballs, especially on the concurrency handling in ri_LockPKTuple()
in 0001 and the snapshot lifecycle in 0002. Or anything else that
catches one's eye.

--
Thanks, Amit Langote

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

--
Regards
Junwang Zhao

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

18 days ago

In reply to: Junwang Zhao (#8)

Re: Eliminating SPI / SQL from some RI triggers - take 3

Hi Junwang,

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

Thanks for testing, good to see similar numbers. I had forgotten to
note that these results are when these PK index probes don't do any
I/O, though you might be aware of that. Below, I report some numbers
that Tomas Vondra shared with me off-list where the probes do have to
perform I/O and there the benefits from only this patch set are only
marginal.

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

Good idea, done.

While polishing 0002, I revisited the snapshot caching semantics. The
previous commit message hand-waved about only curcid changing between
rows, but GetLatestSnapshot() also reflects other backends' commits,
so reusing the snapshot is a deliberate semantic change from the SPI
path. I think it's safe because curcid is all we need for
intra-statement visibility, concurrent commits either already happened
before our snapshot (and are visible) or are racing with our statement
and wouldn't be seen reliably even with per-row snapshots since the
order in which FK rows are checked is nondeterministic, and
LockTupleKeyShare prevents the PK row from disappearing regardless. In
essence, we're treating all the FK checks within a trigger-firing
cycle as a single plan execution that happens to scan N rows, rather
than N independent SPI queries each taking a fresh snapshot. That's
the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't
re-take GetLatestSnapshot() between rows either.

Similarly, the permission check (schema USAGE + table SELECT) is now
done once at cache entry creation in ri_FastPathGetEntry() rather than
on every flush. The RI check runs as the PK table owner, so we're
verifying that the owner can access their own table -- a condition
that won't change unless someone explicitly revokes from the owner,
which would also break the SPI path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I ended up going ahead with the batching and SAOP idea that David
mentioned -- I had a proof-of-concept working shortly after posting v3
and kept iterating on it. So attached set is now:

0001 - Core fast path (your 0001+0002 reworked, as before)

0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot)

0003 - FK row buffering: materialize FK tuples into a per-constraint
batch buffer (64 rows), flush when full or at batch end

0004 - SK_SEARCHARRAY for single-column FKs: build an array from the
buffered FK values and do one index scan instead of 64 separate tree
descents. Multi-column FKs fall back to a per-row loop.

0003 is pure infrastructure -- it doesn't improve performance on its
own because the per-row index descent still dominates. The payoff
comes in 0004.

Numbers (same machine as before, median of 3 runs):

numeric PK / bigint FK, 1M rows:
master: 2487 ms
0001..0004: 1168 ms (2.1x)

int PK / int FK, 500K rows:
master: 1043 ms
0001..0004: 335 ms (3.1x)

The int/int case benefits most because the per-row cost is lower, so
the SAOP traversal savings are a larger fraction of the total. The
numeric/bigint case still sees a solid improvement despite the
cross-type cast overhead.

Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case. In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:

master: 161617 ms
ri-check (0001..0004): 149446 ms (1.08x)
ri-check + i/o prefetching: 50885 ms (3.2x)

So the RI patches alone only give ~8% here since most time is waiting
on reads. But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.

Tomas also caught a memory context bug in the batch flush path: the
cached scandesc lives in TopTransactionContext, but the btree AM
defers _bt_preprocess_keys allocation to the first getnext call, which
pallocs into CurrentMemoryContext. If that's a short-lived
per-trigger-row context, the scandesc has dangling pointers on the
next rescan. Fixed by switching to TopTransactionContext before the
probe loop.

Finally, I've fixed a number of other small and not-so-small bugs
found while polishing the old patches and made other stylistic
improvements. One notable change is that I introduced a FastPathMeta
struct to store the fast path metadata instead of dumping those arrays
in the RI_ConstraintInfo. It's allocated lazily on first use and holds
the per-key compare entries, operator procedures, and index strategy
info needed by the scan key construction, so RI_ConstraintInfo doesn't
pay for them when the fast path isn't used.

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi Amit,

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
design.

0001 adds a fast path optimization for foreign key constraint checks
that bypasses the SPI executor, the fast path applies when the referenced
table is not partitioned, and the constraint does not involve temporal
semantics.

With the following test:

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);

head:

[local] zhjwpku@postgres:5432-90419=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 13516.177 ms (00:13.516)

[local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
UPDATE 1000000
Time: 15057.638 ms (00:15.058)

patched:

[local] zhjwpku@postgres:5432-98673=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 8248.777 ms (00:08.249)

[local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
UPDATE 1000000
Time: 10117.002 ms (00:10.117)

0002 cache fast-path metadata used by the index probe, at the current
time only comparison operator hash entries, operator function OIDs
and strategy numbers and subtypes for index scans. But this cache
doesn't buy any performance improvement.

Caching additional metadata should improve performance for foreign key checks.

Amit suggested introducing a mechanism for ri_triggers.c to register a
cleanup callback in the EState, which AfterTriggerEndQuery() could then
invoke to release per-statement cached metadata (such as the IndexScanDesc).
However, I haven't been able to implement this mechanism yet.

Thanks for working on this. I've taken your patches as a starting
point and reworked the series into two patches (attached): 1st is your
0001+0002 as the core patch that adds a gated fast-path alternative to
SPI and 2nd where I added per-statement resource caching. Doing the
latter turned out to be not so hard thanks to the structure you chose
to build the core fast path. Good call on adding the RLS and ACL test
cases, btw.

So, 0001 is a functionally complete fast path: concurrency handling,
REPEATABLE READ crosscheck, cross-type operators, security context,
and metadata caching. 0002 implements the per-statement resource
caching we discussed, though instead of sharing the EState between
trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
mechanism that fires at the end of each trigger-firing cycle
(per-statement for immediate constraints, or until COMMIT for deferred
ones). It layers resource caching on top so that the PK relation,
index, scan descriptor, and snapshot stay open across all FK trigger
invocations within a single trigger-firing cycle rather than being
opened and closed per row.

Note that phe previous 0002 (metadata caching) is folded into 0001,
and most of the new fast-path logic added in 0001 now lives in
ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
RI_FKey_check diff is just the gating call and SPI fallback.

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

The incremental gain from 0002 comes from eliminating per-row relation
open/close, scan begin/end, slot alloc/free, and replacing per-row
GetSnapshotData() with only curcid adjustment on the registered
snapshot copy in the cache.

The two current limitations are partitioned referenced tables and
temporal foreign keys. Partitioned PKs are relatively uncommon in
practice, so the non-partitioned case should cover most FK workloads,
so I'm not sure it's worth the added complexity to support them.
Temporal FKs are inherently multi-row, so they're a poor fit for a
single-probe fast path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I think the series is in reasonable shape but would appreciate extra
eyeballs, especially on the concurrency handling in ri_LockPKTuple()
in 0001 and the snapshot lifecycle in 0002. Or anything else that
catches one's eye.

--
Thanks, Amit Langote

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

--
Regards
Junwang Zhao

--
Thanks, Amit Langote

#10

Tomas Vondra

tomas.vondra@2ndquadrant.com

17 days ago

In reply to: Amit Langote (#9)

Re: Eliminating SPI / SQL from some RI triggers - take 3

On 2/28/26 08:08, Amit Langote wrote:

Hi Junwang,

...

Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case. In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:

master: 161617 ms
ri-check (0001..0004): 149446 ms (1.08x)
ri-check + i/o prefetching: 50885 ms (3.2x)

So the RI patches alone only give ~8% here since most time is waiting
on reads. But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.

I tested this (with the index prefetching v11 patch), because I wanted
to check if the revised API works fine for other use cases, not just the
regular index scans. Turns out the answer is "yes", the necessary tweaks
to the FK batching patch were pretty minimal, and at the same time it
did help quite a bit for cases bottle-necked on I/O.

FWIW I wonder how difficult would it be to do something like this for
inserts into indexes. It's an orthogonal issue to FK checks (especially
for the CPU-bound cases this thread focuses on), but it's a bit similar
to the I/O-bound case. In fact, I now realize I actually did a PoC for
that in 2023-11 [1]https://commitfest.postgresql.org/patch/4622/, but it went stale ...

benchmarks
----------

Anyway, thinking about the CPU-bound case, I decided to do a bit of
testing on my own. I was wondering about three things:

(a) how does the improvement depend on data distribution
(b) could it cause regressions for small inserts
(c) how sensitive is the batch size

So I devised two simple benchmarks:

1) run-pattern.sh - Inserts batches of values into a table, both the
batch and table can be either random or sequential. It's either 100k or
1M rows, logged or unlogged, etc.

2) run-pgbench.sh - Runs short pgbench inserting data into a table,
similar to (1), but with very few rows - so the timing approach is not
suitable to measure this.

Both scripts run against master, and then patched branch with three
batch sizes (default 64, 16 and 256).

results
-------

The results are very positive - see the attached PDF files comparing the
patched builds to master.

I have not found a single case where the batching causes regressions.
This surprised me a bit, I've expected small regressions for single-row
inserts in the pgbench test, but even that shows a small (~5%) gain.
Even just 2-row inserts show +25% improvement in pgbench throughput.

There are a couple cases where it matches master, I assume that's for
I/O bound cases where the CPU optimizations do not really matter. That's
expected, of course.

I don't see much sensitivity on the batch size. The 256 batches seem to
be a bit slower, but there's little difference between 16 and 64. So I'd
say 64 seems reasonable.

Overall, I think these results looks quite good. I haven't looked at the
code very closely, not beyond adjusting it to work with index prefetch.

[1]: https://commitfest.postgresql.org/patch/4622/

--
Tomas Vondra

#11

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

16 days ago

In reply to: Tomas Vondra (#10)

Re: Eliminating SPI / SQL from some RI triggers - take 3

Hi Tomas,

Thanks for the thorough benchmarking.

On Sun, Mar 1, 2026 at 9:22 PM Tomas Vondra <tomas@vondra.me> wrote:

On 2/28/26 08:08, Amit Langote wrote:

Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case. In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:

master: 161617 ms
ri-check (0001..0004): 149446 ms (1.08x)
ri-check + i/o prefetching: 50885 ms (3.2x)

So the RI patches alone only give ~8% here since most time is waiting
on reads. But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.

I tested this (with the index prefetching v11 patch), because I wanted
to check if the revised API works fine for other use cases, not just the
regular index scans. Turns out the answer is "yes", the necessary tweaks
to the FK batching patch were pretty minimal, and at the same time it
did help quite a bit for cases bottle-necked on I/O.

Do you think those changes to the FK batching are only necessary for
making it work with your patch or is that worth including with the set
here because it's generally applicable?

FWIW I wonder how difficult would it be to do something like this for
inserts into indexes. It's an orthogonal issue to FK checks (especially
for the CPU-bound cases this thread focuses on), but it's a bit similar
to the I/O-bound case. In fact, I now realize I actually did a PoC for
that in 2023-11 [1], but it went stale ...

Interesting. I hadn't seen your earlier PoC. Does the current I/O
prefetching infrastructure simplify that approach, or are they
independent paths? The old patch calls PrefetchBuffer() directly on
the leaf, which seems orthogonal to the scan-side prefetching. Either
way, would be nice to see more paths benefit from batching.

benchmarks
----------

Anyway, thinking about the CPU-bound case, I decided to do a bit of
testing on my own. I was wondering about three things:

(a) how does the improvement depend on data distribution
(b) could it cause regressions for small inserts
(c) how sensitive is the batch size

So I devised two simple benchmarks:

1) run-pattern.sh - Inserts batches of values into a table, both the
batch and table can be either random or sequential. It's either 100k or
1M rows, logged or unlogged, etc.

2) run-pgbench.sh - Runs short pgbench inserting data into a table,
similar to (1), but with very few rows - so the timing approach is not
suitable to measure this.

Both scripts run against master, and then patched branch with three
batch sizes (default 64, 16 and 256).

results
-------

The results are very positive - see the attached PDF files comparing the
patched builds to master.

I have not found a single case where the batching causes regressions.
This surprised me a bit, I've expected small regressions for single-row
inserts in the pgbench test, but even that shows a small (~5%) gain.
Even just 2-row inserts show +25% improvement in pgbench throughput.

This is reassuring. I too was half-expecting the batching
infrastructure to add measurable overhead for single-row inserts, but
it looks like the SPI bypass alone more than covers it.

There are a couple cases where it matches master, I assume that's for
I/O bound cases where the CPU optimizations do not really matter. That's
expected, of course.

I don't see much sensitivity on the batch size. The 256 batches seem to
be a bit slower, but there's little difference between 16 and 64. So I'd
say 64 seems reasonable.

Agreed. Interesting that 16 is consistently a little better than 64 in
the patterns benchmark. I'd guess that's the per-PK-index-match linear
scan over the batch cost showing up, since it's O(batch_size) per PK
match. 256 being noticeably worse fits that picture. 64 seems like a
good middle ground since the pgbench numbers show virtually no
difference between 16 and 64.

The best-case numbers are striking -- when both the PK table and the
FK values being inserted are in sequential order, the unlogged
patterns case hits 4-5x, wow. I guess that makes sense because
sequential FK values turn into a sorted SAOP array that walks
consecutive leaf pages, so it's essentially a single sequential scan
of the relevant index portion.

Overall, I think these results looks quite good. I haven't looked at the
code very closely, not beyond adjusting it to work with index prefetch.

If you get a chance, I'd welcome a closer look. Your memory context
catch was a real bug that I'd missed entirely. The area that would
benefit most from a second pair of eyes is the snapshot and permission
caching semantics in 0002. The argument for why reusing the snapshot
and checking permissions once per batch is safe rather than per-row is
sound I think, but the effects are global and hard to validate by
testing alone..

--
Thanks, Amit Langote

#12

Tomas Vondra

tomas.vondra@2ndquadrant.com

16 days ago

In reply to: Amit Langote (#11)

Re: Eliminating SPI / SQL from some RI triggers - take 3

On 3/2/26 08:49, Amit Langote wrote:

Hi Tomas,

Thanks for the thorough benchmarking.

On Sun, Mar 1, 2026 at 9:22 PM Tomas Vondra <tomas@vondra.me> wrote:

On 2/28/26 08:08, Amit Langote wrote:

Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case. In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:

master: 161617 ms
ri-check (0001..0004): 149446 ms (1.08x)
ri-check + i/o prefetching: 50885 ms (3.2x)

So the RI patches alone only give ~8% here since most time is waiting
on reads. But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.

I tested this (with the index prefetching v11 patch), because I wanted
to check if the revised API works fine for other use cases, not just the
regular index scans. Turns out the answer is "yes", the necessary tweaks
to the FK batching patch were pretty minimal, and at the same time it
did help quite a bit for cases bottle-necked on I/O.

Do you think those changes to the FK batching are only necessary for
making it work with your patch or is that worth including with the set
here because it's generally applicable?

All the changes were specific to the index prefetching patch, due to
small changes in API (function name, new argument, ...).

FWIW I wonder how difficult would it be to do something like this for
inserts into indexes. It's an orthogonal issue to FK checks (especially
for the CPU-bound cases this thread focuses on), but it's a bit similar
to the I/O-bound case. In fact, I now realize I actually did a PoC for
that in 2023-11 [1], but it went stale ...

Interesting. I hadn't seen your earlier PoC. Does the current I/O
prefetching infrastructure simplify that approach, or are they
independent paths? The old patch calls PrefetchBuffer() directly on
the leaf, which seems orthogonal to the scan-side prefetching. Either
way, would be nice to see more paths benefit from batching.

I think it's mostly independent, I don't see much code overlap. The
index insert prefetch needs to happen much earlier, not in the "after
trigger" phase (which is where FK checks happen, right?).

It also works with individual tuples by default - it prefetches leaf
pages for all indexes that one insert needs to touch. There is some
batching concept e.g. for COPY.

Anyway, it's a separate patch. Let's not discuss that here, so that we
don't distract people interested in the FK stuff.

benchmarks
----------

Anyway, thinking about the CPU-bound case, I decided to do a bit of
testing on my own. I was wondering about three things:

(a) how does the improvement depend on data distribution
(b) could it cause regressions for small inserts
(c) how sensitive is the batch size

So I devised two simple benchmarks:

1) run-pattern.sh - Inserts batches of values into a table, both the
batch and table can be either random or sequential. It's either 100k or
1M rows, logged or unlogged, etc.

2) run-pgbench.sh - Runs short pgbench inserting data into a table,
similar to (1), but with very few rows - so the timing approach is not
suitable to measure this.

Both scripts run against master, and then patched branch with three
batch sizes (default 64, 16 and 256).

results
-------

The results are very positive - see the attached PDF files comparing the
patched builds to master.

I have not found a single case where the batching causes regressions.
This surprised me a bit, I've expected small regressions for single-row
inserts in the pgbench test, but even that shows a small (~5%) gain.
Even just 2-row inserts show +25% improvement in pgbench throughput.

This is reassuring. I too was half-expecting the batching
infrastructure to add measurable overhead for single-row inserts, but
it looks like the SPI bypass alone more than covers it.

Seems like it.

There are a couple cases where it matches master, I assume that's for
I/O bound cases where the CPU optimizations do not really matter. That's
expected, of course.

I don't see much sensitivity on the batch size. The 256 batches seem to
be a bit slower, but there's little difference between 16 and 64. So I'd
say 64 seems reasonable.

Agreed. Interesting that 16 is consistently a little better than 64 in
the patterns benchmark. I'd guess that's the per-PK-index-match linear
scan over the batch cost showing up, since it's O(batch_size) per PK
match. 256 being noticeably worse fits that picture. 64 seems like a
good middle ground since the pgbench numbers show virtually no
difference between 16 and 64.

The best-case numbers are striking -- when both the PK table and the
FK values being inserted are in sequential order, the unlogged
patterns case hits 4-5x, wow. I guess that makes sense because
sequential FK values turn into a sorted SAOP array that walks
consecutive leaf pages, so it's essentially a single sequential scan
of the relevant index portion.

Right. I think the sequential case maximizes the amount of CPU time
spent in the FK checks, so the reduction is much more visible.

Overall, I think these results looks quite good. I haven't looked at the
code very closely, not beyond adjusting it to work with index prefetch.

If you get a chance, I'd welcome a closer look. Your memory context
catch was a real bug that I'd missed entirely. The area that would
benefit most from a second pair of eyes is the snapshot and permission
caching semantics in 0002. The argument for why reusing the snapshot
and checking permissions once per batch is safe rather than per-row is
sound I think, but the effects are global and hard to validate by
testing alone..

TBH I haven't noticed the memory context issue myself, I only noticed
because the builds with index prefetch started crashing. But I'll try to
take a look this week.

regards

--
Tomas Vondra

#13

Junwang Zhao

zhjwpku@gmail.com

16 days ago

In reply to: Amit Langote (#9)

Re: Eliminating SPI / SQL from some RI triggers - take 3

On Sat, Feb 28, 2026 at 3:08 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

Thanks for testing, good to see similar numbers. I had forgotten to
note that these results are when these PK index probes don't do any
I/O, though you might be aware of that. Below, I report some numbers
that Tomas Vondra shared with me off-list where the probes do have to
perform I/O and there the benefits from only this patch set are only
marginal.

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

Good idea, done.

While polishing 0002, I revisited the snapshot caching semantics. The
previous commit message hand-waved about only curcid changing between
rows, but GetLatestSnapshot() also reflects other backends' commits,
so reusing the snapshot is a deliberate semantic change from the SPI
path. I think it's safe because curcid is all we need for
intra-statement visibility, concurrent commits either already happened
before our snapshot (and are visible) or are racing with our statement
and wouldn't be seen reliably even with per-row snapshots since the
order in which FK rows are checked is nondeterministic, and
LockTupleKeyShare prevents the PK row from disappearing regardless. In
essence, we're treating all the FK checks within a trigger-firing
cycle as a single plan execution that happens to scan N rows, rather
than N independent SPI queries each taking a fresh snapshot. That's
the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't
re-take GetLatestSnapshot() between rows either.

Similarly, the permission check (schema USAGE + table SELECT) is now
done once at cache entry creation in ri_FastPathGetEntry() rather than
on every flush.

nice improvement.

The RI check runs as the PK table owner, so we're
verifying that the owner can access their own table -- a condition
that won't change unless someone explicitly revokes from the owner,
which would also break the SPI path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I ended up going ahead with the batching and SAOP idea that David
mentioned -- I had a proof-of-concept working shortly after posting v3
and kept iterating on it. So attached set is now:

0001 - Core fast path (your 0001+0002 reworked, as before)

0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot)

0003 - FK row buffering: materialize FK tuples into a per-constraint
batch buffer (64 rows), flush when full or at batch end

0004 - SK_SEARCHARRAY for single-column FKs: build an array from the
buffered FK values and do one index scan instead of 64 separate tree
descents. Multi-column FKs fall back to a per-row loop.

0003 is pure infrastructure -- it doesn't improve performance on its
own because the per-row index descent still dominates. The payoff
comes in 0004.

Numbers (same machine as before, median of 3 runs):

numeric PK / bigint FK, 1M rows:
master: 2487 ms
0001..0004: 1168 ms (2.1x)

int PK / int FK, 500K rows:
master: 1043 ms
0001..0004: 335 ms (3.1x)

The int/int case benefits most because the per-row cost is lower, so
the SAOP traversal savings are a larger fraction of the total. The
numeric/bigint case still sees a solid improvement despite the
cross-type cast overhead.

Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case. In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:

master: 161617 ms
ri-check (0001..0004): 149446 ms (1.08x)
ri-check + i/o prefetching: 50885 ms (3.2x)

So the RI patches alone only give ~8% here since most time is waiting
on reads. But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.

impressive!

Tomas also caught a memory context bug in the batch flush path: the
cached scandesc lives in TopTransactionContext, but the btree AM
defers _bt_preprocess_keys allocation to the first getnext call, which
pallocs into CurrentMemoryContext. If that's a short-lived
per-trigger-row context, the scandesc has dangling pointers on the
next rescan. Fixed by switching to TopTransactionContext before the
probe loop.

Finally, I've fixed a number of other small and not-so-small bugs
found while polishing the old patches and made other stylistic
improvements. One notable change is that I introduced a FastPathMeta

Yeah, this is much better than the fpmeta_valid field.

struct to store the fast path metadata instead of dumping those arrays
in the RI_ConstraintInfo. It's allocated lazily on first use and holds
the per-key compare entries, operator procedures, and index strategy
info needed by the scan key construction, so RI_ConstraintInfo doesn't
pay for them when the fast path isn't used.

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi Amit,

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
design.

0001 adds a fast path optimization for foreign key constraint checks
that bypasses the SPI executor, the fast path applies when the referenced
table is not partitioned, and the constraint does not involve temporal
semantics.

With the following test:

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);

head:

[local] zhjwpku@postgres:5432-90419=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 13516.177 ms (00:13.516)

[local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
UPDATE 1000000
Time: 15057.638 ms (00:15.058)

patched:

[local] zhjwpku@postgres:5432-98673=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 8248.777 ms (00:08.249)

[local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
UPDATE 1000000
Time: 10117.002 ms (00:10.117)

0002 cache fast-path metadata used by the index probe, at the current
time only comparison operator hash entries, operator function OIDs
and strategy numbers and subtypes for index scans. But this cache
doesn't buy any performance improvement.

Caching additional metadata should improve performance for foreign key checks.

Amit suggested introducing a mechanism for ri_triggers.c to register a
cleanup callback in the EState, which AfterTriggerEndQuery() could then
invoke to release per-statement cached metadata (such as the IndexScanDesc).
However, I haven't been able to implement this mechanism yet.

Thanks for working on this. I've taken your patches as a starting
point and reworked the series into two patches (attached): 1st is your
0001+0002 as the core patch that adds a gated fast-path alternative to
SPI and 2nd where I added per-statement resource caching. Doing the
latter turned out to be not so hard thanks to the structure you chose
to build the core fast path. Good call on adding the RLS and ACL test
cases, btw.

So, 0001 is a functionally complete fast path: concurrency handling,
REPEATABLE READ crosscheck, cross-type operators, security context,
and metadata caching. 0002 implements the per-statement resource
caching we discussed, though instead of sharing the EState between
trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
mechanism that fires at the end of each trigger-firing cycle
(per-statement for immediate constraints, or until COMMIT for deferred
ones). It layers resource caching on top so that the PK relation,
index, scan descriptor, and snapshot stay open across all FK trigger
invocations within a single trigger-firing cycle rather than being
opened and closed per row.

Note that phe previous 0002 (metadata caching) is folded into 0001,
and most of the new fast-path logic added in 0001 now lives in
ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
RI_FKey_check diff is just the gating call and SPI fallback.

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

The incremental gain from 0002 comes from eliminating per-row relation
open/close, scan begin/end, slot alloc/free, and replacing per-row
GetSnapshotData() with only curcid adjustment on the registered
snapshot copy in the cache.

The two current limitations are partitioned referenced tables and
temporal foreign keys. Partitioned PKs are relatively uncommon in
practice, so the non-partitioned case should cover most FK workloads,
so I'm not sure it's worth the added complexity to support them.
Temporal FKs are inherently multi-row, so they're a poor fit for a
single-probe fast path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I think the series is in reasonable shape but would appreciate extra
eyeballs, especially on the concurrency handling in ri_LockPKTuple()
in 0001 and the snapshot lifecycle in 0002. Or anything else that
catches one's eye.

--
Thanks, Amit Langote

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

--
Regards
Junwang Zhao

--
Thanks, Amit Langote

--
Regards
Junwang Zhao

#14

Junwang Zhao

zhjwpku@gmail.com

8 days ago

In reply to: Junwang Zhao (#13)

Re: Eliminating SPI / SQL from some RI triggers - take 3

Hi,

On Mon, Mar 2, 2026 at 11:30 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

On Sat, Feb 28, 2026 at 3:08 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

Thanks for testing, good to see similar numbers. I had forgotten to
note that these results are when these PK index probes don't do any
I/O, though you might be aware of that. Below, I report some numbers
that Tomas Vondra shared with me off-list where the probes do have to
perform I/O and there the benefits from only this patch set are only
marginal.

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

Good idea, done.

While polishing 0002, I revisited the snapshot caching semantics. The
previous commit message hand-waved about only curcid changing between
rows, but GetLatestSnapshot() also reflects other backends' commits,
so reusing the snapshot is a deliberate semantic change from the SPI
path. I think it's safe because curcid is all we need for
intra-statement visibility, concurrent commits either already happened
before our snapshot (and are visible) or are racing with our statement
and wouldn't be seen reliably even with per-row snapshots since the
order in which FK rows are checked is nondeterministic, and
LockTupleKeyShare prevents the PK row from disappearing regardless. In
essence, we're treating all the FK checks within a trigger-firing
cycle as a single plan execution that happens to scan N rows, rather
than N independent SPI queries each taking a fresh snapshot. That's
the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't
re-take GetLatestSnapshot() between rows either.

Similarly, the permission check (schema USAGE + table SELECT) is now
done once at cache entry creation in ri_FastPathGetEntry() rather than
on every flush.

nice improvement.

The RI check runs as the PK table owner, so we're
verifying that the owner can access their own table -- a condition
that won't change unless someone explicitly revokes from the owner,
which would also break the SPI path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I ended up going ahead with the batching and SAOP idea that David
mentioned -- I had a proof-of-concept working shortly after posting v3
and kept iterating on it. So attached set is now:

0001 - Core fast path (your 0001+0002 reworked, as before)

0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot)

0003 - FK row buffering: materialize FK tuples into a per-constraint
batch buffer (64 rows), flush when full or at batch end

0004 - SK_SEARCHARRAY for single-column FKs: build an array from the
buffered FK values and do one index scan instead of 64 separate tree
descents. Multi-column FKs fall back to a per-row loop.

0003 is pure infrastructure -- it doesn't improve performance on its
own because the per-row index descent still dominates. The payoff
comes in 0004.

Numbers (same machine as before, median of 3 runs):

numeric PK / bigint FK, 1M rows:
master: 2487 ms
0001..0004: 1168 ms (2.1x)

int PK / int FK, 500K rows:
master: 1043 ms
0001..0004: 335 ms (3.1x)

The int/int case benefits most because the per-row cost is lower, so
the SAOP traversal savings are a larger fraction of the total. The
numeric/bigint case still sees a solid improvement despite the
cross-type cast overhead.

Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case. In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:

master: 161617 ms
ri-check (0001..0004): 149446 ms (1.08x)
ri-check + i/o prefetching: 50885 ms (3.2x)

So the RI patches alone only give ~8% here since most time is waiting
on reads. But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.

impressive!

Tomas also caught a memory context bug in the batch flush path: the
cached scandesc lives in TopTransactionContext, but the btree AM
defers _bt_preprocess_keys allocation to the first getnext call, which
pallocs into CurrentMemoryContext. If that's a short-lived
per-trigger-row context, the scandesc has dangling pointers on the
next rescan. Fixed by switching to TopTransactionContext before the
probe loop.

Finally, I've fixed a number of other small and not-so-small bugs
found while polishing the old patches and made other stylistic
improvements. One notable change is that I introduced a FastPathMeta

Yeah, this is much better than the fpmeta_valid field.

struct to store the fast path metadata instead of dumping those arrays
in the RI_ConstraintInfo. It's allocated lazily on first use and holds
the per-key compare entries, operator procedures, and index strategy
info needed by the scan key construction, so RI_ConstraintInfo doesn't
pay for them when the fast path isn't used.

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi Amit,

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
design.

0001 adds a fast path optimization for foreign key constraint checks
that bypasses the SPI executor, the fast path applies when the referenced
table is not partitioned, and the constraint does not involve temporal
semantics.

With the following test:

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);

head:

[local] zhjwpku@postgres:5432-90419=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 13516.177 ms (00:13.516)

[local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
UPDATE 1000000
Time: 15057.638 ms (00:15.058)

patched:

[local] zhjwpku@postgres:5432-98673=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 8248.777 ms (00:08.249)

[local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
UPDATE 1000000
Time: 10117.002 ms (00:10.117)

0002 cache fast-path metadata used by the index probe, at the current
time only comparison operator hash entries, operator function OIDs
and strategy numbers and subtypes for index scans. But this cache
doesn't buy any performance improvement.

Caching additional metadata should improve performance for foreign key checks.

Amit suggested introducing a mechanism for ri_triggers.c to register a
cleanup callback in the EState, which AfterTriggerEndQuery() could then
invoke to release per-statement cached metadata (such as the IndexScanDesc).
However, I haven't been able to implement this mechanism yet.

Thanks for working on this. I've taken your patches as a starting
point and reworked the series into two patches (attached): 1st is your
0001+0002 as the core patch that adds a gated fast-path alternative to
SPI and 2nd where I added per-statement resource caching. Doing the
latter turned out to be not so hard thanks to the structure you chose
to build the core fast path. Good call on adding the RLS and ACL test
cases, btw.

So, 0001 is a functionally complete fast path: concurrency handling,
REPEATABLE READ crosscheck, cross-type operators, security context,
and metadata caching. 0002 implements the per-statement resource
caching we discussed, though instead of sharing the EState between
trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
mechanism that fires at the end of each trigger-firing cycle
(per-statement for immediate constraints, or until COMMIT for deferred
ones). It layers resource caching on top so that the PK relation,
index, scan descriptor, and snapshot stay open across all FK trigger
invocations within a single trigger-firing cycle rather than being
opened and closed per row.

Note that phe previous 0002 (metadata caching) is folded into 0001,
and most of the new fast-path logic added in 0001 now lives in
ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
RI_FKey_check diff is just the gating call and SPI fallback.

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

The incremental gain from 0002 comes from eliminating per-row relation
open/close, scan begin/end, slot alloc/free, and replacing per-row
GetSnapshotData() with only curcid adjustment on the registered
snapshot copy in the cache.

The two current limitations are partitioned referenced tables and
temporal foreign keys. Partitioned PKs are relatively uncommon in
practice, so the non-partitioned case should cover most FK workloads,
so I'm not sure it's worth the added complexity to support them.
Temporal FKs are inherently multi-row, so they're a poor fit for a
single-probe fast path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I think the series is in reasonable shape but would appreciate extra
eyeballs, especially on the concurrency handling in ri_LockPKTuple()
in 0001 and the snapshot lifecycle in 0002. Or anything else that
catches one's eye.

--
Thanks, Amit Langote

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

--
Regards
Junwang Zhao

--
Thanks, Amit Langote

--
Regards
Junwang Zhao

I had an offline discussion with Amit today. There were a few small things
that could be improved, so I posted a new version of the patch set.

+ if (ri_fastpath_is_applicable(riinfo))
+ {
+ bool found = ri_FastPathCheck(riinfo, fk_rel, newslot);
+
+ if (found)
+ return PointerGetDatum(NULL);
+
+ /*
+ * ri_FastPathCheck opens pk_rel internally; we need it for
+ * ri_ReportViolation.  Re-open briefly.
+ */
+ pk_rel = table_open(riinfo->pk_relid, RowShareLock);
+ ri_ReportViolation(riinfo, pk_rel, fk_rel,
+    newslot, NULL,
+    RI_PLAN_CHECK_LOOKUPPK, false, false);
+ }

Move ri_ReportViolation into ri_FastPathCheck, so table_open is no
longer needed, and ri_FastPathCheck now returns void. Since Amit
agreed this is the right approach, I included it directly in v5-0001.

After adding the batch fast path, the original ri_FastPathCheck is only
used by the ALTER TABLE validation path. This path cannot use the
cache because the registered AfterTriggerBatch callback will never run.
Therefore, the use_cache branch can be removed.

I made this change in v5-0004 and also updated some related comments.
Once we agree the changes are correct, it can be merged into v5-0003.

+ fk_slot = MakeSingleTupleTableSlot(RelationGetDescr(fk_rel),
+    &TTSOpsHeapTuple);

ri_FastPathBatchFlush creates a new fk_slot but does not cache it in
RI_FastPathEntry. I tried caching it in v5-0006 and ran some benchmarks,
it didn't show much improvement. This might be because the slot creation
function is called once per batch rather than once per row, so the overall
impact is minimal. I'm posting this here for Amit to take a look and decide
whether we should adopt it or drop it, since I mentioned the idea to
him earlier.

ri_FastPathFlushArray currently uses SK_SEARCHARRAY only for
single-column checks. I asked whether this could be extended to support
multi-column cases, and Amit encouraged me to look into it.

After a brief investigation, it seems that ScanKeyEntryInitialize only allows
passing a single subtype/collation/procedure, which makes it difficult to
handle multiple types. Based on this, my current understanding is that
SK_SEARCHARRAY may not work for multi-column checks.

--
Regards
Junwang Zhao

#15

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

2 days ago

In reply to: Junwang Zhao (#14)

Re: Eliminating SPI / SQL from some RI triggers - take 3

Hi Junwang,

Thanks for sending the new version.

On Tue, Mar 10, 2026 at ... Junwang Zhao <zhjwpku@gmail.com> wrote:

1.
Move ri_ReportViolation into ri_FastPathCheck, so table_open is no
longer needed, and ri_FastPathCheck now returns void.

Good, kept.

2.
After adding the batch fast path, the original ri_FastPathCheck is only
used by the ALTER TABLE validation path. This path cannot use the
cache because the registered AfterTriggerBatch callback will never run.
Therefore, the use_cache branch can be removed.

Agreed. I went a step further and restructured 0002 to avoid the
use_cache branching entirely. Instead of adding if/else blocks to
ri_FastPathCheck, 0002 now adds a separate ri_FastPathCheckCached()
function with its own resource lifecycle. 0003 then replaces it with
ri_FastPathBatchAdd() -- a clean swap rather than completely undoing
what 0002 added. This also removes the use_cache parameter from
ri_FastPathProbeOne; the memory context switch to
TopTransactionContext is now the caller's responsibility.

3.
ri_FastPathBatchFlush creates a new fk_slot but does not cache it in
RI_FastPathEntry. I tried caching it in v5-0006 and ran some benchmarks,
it didn't show much improvement.

I put the fk_slot in the cache entry since it's a small change.

4.
ri_FastPathFlushArray currently uses SK_SEARCHARRAY only for
single-column checks. [...] my current understanding is that
SK_SEARCHARRAY may not work for multi-column checks.

Right, I haven't investigated this deeply either. The FlushLoop
fallback is the right approach for now. If we want to explore a
SEARCHARRAY approach for multi-column keys in a follow-up, it would be
worth checking with Peter Geoghegan or someone else familiar with the
btree SAOP internals on how multiple array keys across columns
are iterated and whether that's usable at all for this use case.

Attached is v6, three patches -- combined the old 0003 (buffering) and
0004 (SK_SEARCHARRAY) into a single 0003, since the buffering alone
has no performance benefit (or at least only minor) and the split
added unnecessary diff/rebase churn.

The biggest change in this version is the snapshot handling. Looking
more carefully at what the SPI path actually does for RI_FKey_check
(non-partitioned PK, detectNewRows = false), I found that
ri_PerformCheck passes InvalidSnapshot to SPI_execute_snapshot, and
_SPI_execute_plan ends up doing
PushActiveSnapshot(GetTransactionSnapshot()). So the SPI path scans
with the transaction snapshot, not the latest
snapshot.

So I've changed the fast path to match: ri_FastPathCheck now uses
GetTransactionSnapshot() directly. Under READ COMMITTED this is a
fresh snapshot; under REPEATABLE READ it's the frozen
transaction-start snapshot, so PK rows committed after the transaction
started are simply not visible. This means the second index probe
(the IsolationUsesXactSnapshot crosscheck block) is no longer needed
and is removed. The existing fk-snapshot isolation test confirms this
is the correct behavior.

Other changes since v5:

* Fixed the batch callback firing during nested SPI: another AFTER
trigger doing DML via SPI would call AfterTriggerEndQuery at a nested
level, tearing down our cache mid-batch. Fixed by checking
query_depth inside FireAfterTriggerBatchCallbacks. Added a test case
with a trigger that auto-provisions PK rows via SPI.

* Security context is now restored before ri_ReportViolation, not
after (the ereport doesn't return).

* search_vals[] and matched[] moved from RI_FastPathEntry to stack
locals in ri_FastPathFlushArray -- they're rewritten from scratch on
every flush with no state carried between calls.

* Various comment fixes.

I think this is getting close to committable shape. That said, another
pair of eyes would be reassuring before I pull the trigger. Tomas, if
you've had a chance to look, would welcome your thoughts.

--
Thanks, Amit Langote

#16

Haibo Yan

tristan.yim@gmail.com

1 day ago

In reply to: Junwang Zhao (#14)

Re: Eliminating SPI / SQL from some RI triggers - take 3

Hi, Amit and Junwang

Thanks for the latest patch. I think the overall direction makes sense, and the single-column SK_SEARCHARRAY path looks like one of the most valuable optimizations here. The patch also seems to cover several important cases, including deferred constraints, duplicate FK values, and multi-column fallback behavior.

After reading through the patch, I have one major comments and a few smaller ones.

1. TopTransactionContext usage during batched flush may be too coarse-grained
My biggest concern is the use of TopTransactionContext around the batched flush path.
As written, ri_FastPathBatchFlush() switches to TopTransactionContext before calling ri_FastPathFlushArray() / ri_FastPathFlushLoop(). That seems broad enough that temporary allocations made during the flush may end up there.
In particular, in ri_FastPathFlushArray(), I think the objects worth checking carefully are the pass-by-reference Datums returned by the per-element cast call and stored in search_vals[], e.g.

```search_vals[i] = FunctionCall3(&entry->cast_func_finfo, ...);```
If those cast results are separately allocated in the current memory context, then pfree(arr) only frees the constructed array object itself; it does not obviously free those intermediate cast results. If so, those allocations could survive until end of transaction rather than just until the end of the current flush.
Maybe this is harmless in practice, but I think it needs a closer look. It might be better to use a dedicated short-lived context for per-flush temporary allocations, reset it after each flush, or otherwise separate allocations that really need transaction lifetime from those that are only needed transiently during batched processing.

2. RI_FastPathEntry comment mentions the wrong function name
The comment above RI_FastPathEntry says it contains resources needed by ri_FastPathFlushBatch(), but the function is named ri_FastPathBatchFlush().

3. RI_FASTPATH_BATCH_SIZE needs some rationale
RI_FASTPATH_BATCH_SIZE = 64 may well be a reasonable compromise, but right now it reads like a magic number.
This choice seems especially relevant because the patch has two opposing effects:
3-1. larger batches should amortize the array-scan work better,
3-2. but the matched[] bookkeeping in ri_FastPathFlushArray() is O(batch_size^2) in the worst case.
So I think it would help to include at least a brief rationale in a comment or in the commit message.

4. Commit message says the entry stashes fk_relid, but the code actually stashes riinfo
The commit message says the entry stashes fk_relid and can reopen the relation if needed. Unless I am misreading it, the code actually stores riinfo and later uses riinfo->fk_relid. The distinction is small, but I think the wording should match the implementation more closely.

Thanks again for working on this.

Best regards,

Haibo Yan

Show quoted text

On Mar 10, 2026, at 5:28 AM, Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi,

On Mon, Mar 2, 2026 at 11:30 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

On Sat, Feb 28, 2026 at 3:08 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

Thanks for testing, good to see similar numbers. I had forgotten to
note that these results are when these PK index probes don't do any
I/O, though you might be aware of that. Below, I report some numbers
that Tomas Vondra shared with me off-list where the probes do have to
perform I/O and there the benefits from only this patch set are only
marginal.

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

Good idea, done.

While polishing 0002, I revisited the snapshot caching semantics. The
previous commit message hand-waved about only curcid changing between
rows, but GetLatestSnapshot() also reflects other backends' commits,
so reusing the snapshot is a deliberate semantic change from the SPI
path. I think it's safe because curcid is all we need for
intra-statement visibility, concurrent commits either already happened
before our snapshot (and are visible) or are racing with our statement
and wouldn't be seen reliably even with per-row snapshots since the
order in which FK rows are checked is nondeterministic, and
LockTupleKeyShare prevents the PK row from disappearing regardless. In
essence, we're treating all the FK checks within a trigger-firing
cycle as a single plan execution that happens to scan N rows, rather
than N independent SPI queries each taking a fresh snapshot. That's
the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't
re-take GetLatestSnapshot() between rows either.

Similarly, the permission check (schema USAGE + table SELECT) is now
done once at cache entry creation in ri_FastPathGetEntry() rather than
on every flush.

nice improvement.

The RI check runs as the PK table owner, so we're
verifying that the owner can access their own table -- a condition
that won't change unless someone explicitly revokes from the owner,
which would also break the SPI path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I ended up going ahead with the batching and SAOP idea that David
mentioned -- I had a proof-of-concept working shortly after posting v3
and kept iterating on it. So attached set is now:

0001 - Core fast path (your 0001+0002 reworked, as before)

0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot)

0003 - FK row buffering: materialize FK tuples into a per-constraint
batch buffer (64 rows), flush when full or at batch end

0004 - SK_SEARCHARRAY for single-column FKs: build an array from the
buffered FK values and do one index scan instead of 64 separate tree
descents. Multi-column FKs fall back to a per-row loop.

0003 is pure infrastructure -- it doesn't improve performance on its
own because the per-row index descent still dominates. The payoff
comes in 0004.

Numbers (same machine as before, median of 3 runs):

numeric PK / bigint FK, 1M rows:
master: 2487 ms
0001..0004: 1168 ms (2.1x)

int PK / int FK, 500K rows:
master: 1043 ms
0001..0004: 335 ms (3.1x)

The int/int case benefits most because the per-row cost is lower, so
the SAOP traversal savings are a larger fraction of the total. The
numeric/bigint case still sees a solid improvement despite the
cross-type cast overhead.

Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case. In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:

master: 161617 ms
ri-check (0001..0004): 149446 ms (1.08x)
ri-check + i/o prefetching: 50885 ms (3.2x)

So the RI patches alone only give ~8% here since most time is waiting
on reads. But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.

impressive!

Tomas also caught a memory context bug in the batch flush path: the
cached scandesc lives in TopTransactionContext, but the btree AM
defers _bt_preprocess_keys allocation to the first getnext call, which
pallocs into CurrentMemoryContext. If that's a short-lived
per-trigger-row context, the scandesc has dangling pointers on the
next rescan. Fixed by switching to TopTransactionContext before the
probe loop.

Finally, I've fixed a number of other small and not-so-small bugs
found while polishing the old patches and made other stylistic
improvements. One notable change is that I introduced a FastPathMeta

Yeah, this is much better than the fpmeta_valid field.

struct to store the fast path metadata instead of dumping those arrays
in the RI_ConstraintInfo. It's allocated lazily on first use and holds
the per-key compare entries, operator procedures, and index strategy
info needed by the scan key construction, so RI_ConstraintInfo doesn't
pay for them when the fast path isn't used.

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi Amit,

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
design.

0001 adds a fast path optimization for foreign key constraint checks
that bypasses the SPI executor, the fast path applies when the referenced
table is not partitioned, and the constraint does not involve temporal
semantics.

With the following test:

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);

head:

[local] zhjwpku@postgres:5432-90419=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 13516.177 ms (00:13.516)

[local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
UPDATE 1000000
Time: 15057.638 ms (00:15.058)

patched:

[local] zhjwpku@postgres:5432-98673=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 8248.777 ms (00:08.249)

[local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
UPDATE 1000000
Time: 10117.002 ms (00:10.117)

0002 cache fast-path metadata used by the index probe, at the current
time only comparison operator hash entries, operator function OIDs
and strategy numbers and subtypes for index scans. But this cache
doesn't buy any performance improvement.

Caching additional metadata should improve performance for foreign key checks.

Amit suggested introducing a mechanism for ri_triggers.c to register a
cleanup callback in the EState, which AfterTriggerEndQuery() could then
invoke to release per-statement cached metadata (such as the IndexScanDesc).
However, I haven't been able to implement this mechanism yet.

Thanks for working on this. I've taken your patches as a starting
point and reworked the series into two patches (attached): 1st is your
0001+0002 as the core patch that adds a gated fast-path alternative to
SPI and 2nd where I added per-statement resource caching. Doing the
latter turned out to be not so hard thanks to the structure you chose
to build the core fast path. Good call on adding the RLS and ACL test
cases, btw.

So, 0001 is a functionally complete fast path: concurrency handling,
REPEATABLE READ crosscheck, cross-type operators, security context,
and metadata caching. 0002 implements the per-statement resource
caching we discussed, though instead of sharing the EState between
trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
mechanism that fires at the end of each trigger-firing cycle
(per-statement for immediate constraints, or until COMMIT for deferred
ones). It layers resource caching on top so that the PK relation,
index, scan descriptor, and snapshot stay open across all FK trigger
invocations within a single trigger-firing cycle rather than being
opened and closed per row.

Note that phe previous 0002 (metadata caching) is folded into 0001,
and most of the new fast-path logic added in 0001 now lives in
ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
RI_FKey_check diff is just the gating call and SPI fallback.

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

The incremental gain from 0002 comes from eliminating per-row relation
open/close, scan begin/end, slot alloc/free, and replacing per-row
GetSnapshotData() with only curcid adjustment on the registered
snapshot copy in the cache.

The two current limitations are partitioned referenced tables and
temporal foreign keys. Partitioned PKs are relatively uncommon in
practice, so the non-partitioned case should cover most FK workloads,
so I'm not sure it's worth the added complexity to support them.
Temporal FKs are inherently multi-row, so they're a poor fit for a
single-probe fast path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I think the series is in reasonable shape but would appreciate extra
eyeballs, especially on the concurrency handling in ri_LockPKTuple()
in 0001 and the snapshot lifecycle in 0002. Or anything else that
catches one's eye.

--
Thanks, Amit Langote

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

--
Regards
Junwang Zhao

--
Thanks, Amit Langote

--
Regards
Junwang Zhao

I had an offline discussion with Amit today. There were a few small things
that could be improved, so I posted a new version of the patch set.

1.
+ if (ri_fastpath_is_applicable(riinfo))
+ {
+ bool found = ri_FastPathCheck(riinfo, fk_rel, newslot);
+
+ if (found)
+ return PointerGetDatum(NULL);
+
+ /*
+ * ri_FastPathCheck opens pk_rel internally; we need it for
+ * ri_ReportViolation.  Re-open briefly.
+ */
+ pk_rel = table_open(riinfo->pk_relid, RowShareLock);
+ ri_ReportViolation(riinfo, pk_rel, fk_rel,
+    newslot, NULL,
+    RI_PLAN_CHECK_LOOKUPPK, false, false);
+ }
Move ri_ReportViolation into ri_FastPathCheck, so table_open is no
longer needed, and ri_FastPathCheck now returns void. Since Amit
agreed this is the right approach, I included it directly in v5-0001.

2.

After adding the batch fast path, the original ri_FastPathCheck is only
used by the ALTER TABLE validation path. This path cannot use the
cache because the registered AfterTriggerBatch callback will never run.
Therefore, the use_cache branch can be removed.

I made this change in v5-0004 and also updated some related comments.
Once we agree the changes are correct, it can be merged into v5-0003.

3.
+ fk_slot = MakeSingleTupleTableSlot(RelationGetDescr(fk_rel),
+    &TTSOpsHeapTuple);
ri_FastPathBatchFlush creates a new fk_slot but does not cache it in
RI_FastPathEntry. I tried caching it in v5-0006 and ran some benchmarks,
it didn't show much improvement. This might be because the slot creation
function is called once per batch rather than once per row, so the overall
impact is minimal. I'm posting this here for Amit to take a look and decide
whether we should adopt it or drop it, since I mentioned the idea to
him earlier.

4.

ri_FastPathFlushArray currently uses SK_SEARCHARRAY only for
single-column checks. I asked whether this could be extended to support
multi-column cases, and Amit encouraged me to look into it.

After a brief investigation, it seems that ScanKeyEntryInitialize only allows
passing a single subtype/collation/procedure, which makes it difficult to
handle multiple types. Based on this, my current understanding is that
SK_SEARCHARRAY may not work for multi-column checks.

--
Regards
Junwang Zhao
<v5-0005-Use-SK_SEARCHARRAY-for-batched-fast-path-FK-probe.patch><v5-0006-Reuse-FK-tuple-slot-across-fast-path-batches.patch><v5-0002-Cache-per-batch-resources-for-fast-path-foreign-k.patch><v5-0004-Refine-fast-path-FK-validation-path.patch><v5-0003-Buffer-FK-rows-for-batched-fast-path-probing.patch><v5-0001-Add-fast-path-for-foreign-key-constraint-checks.patch>

#17

Junwang Zhao

zhjwpku@gmail.com

about 2 hours ago

In reply to: Haibo Yan (#16)

Re: Eliminating SPI / SQL from some RI triggers - take 3

Hi Haibo,

On Tue, Mar 17, 2026 at 8:28 AM Haibo Yan <tristan.yim@gmail.com> wrote:

Hi, Amit and Junwang

Thanks for the latest patch. I think the overall direction makes sense, and the single-column SK_SEARCHARRAY path looks like one of the most valuable optimizations here. The patch also seems to cover several important cases, including deferred constraints, duplicate FK values, and multi-column fallback behavior.

After reading through the patch, I have one major comments and a few smaller ones.

Thanks for your review.

1. TopTransactionContext usage during batched flush may be too coarse-grained
My biggest concern is the use of TopTransactionContext around the batched flush path.
As written, ri_FastPathBatchFlush() switches to TopTransactionContext before calling ri_FastPathFlushArray() / ri_FastPathFlushLoop(). That seems broad enough that temporary allocations made during the flush may end up there.
In particular, in ri_FastPathFlushArray(), I think the objects worth checking carefully are the pass-by-reference Datums returned by the per-element cast call and stored in search_vals[], e.g.

```search_vals[i] = FunctionCall3(&entry->cast_func_finfo, ...);```
If those cast results are separately allocated in the current memory context, then pfree(arr) only frees the constructed array object itself; it does not obviously free those intermediate cast results. If so, those allocations could survive until end of transaction rather than just until the end of the current flush.
Maybe this is harmless in practice, but I think it needs a closer look. It might be better to use a dedicated short-lived context for per-flush temporary allocations, reset it after each flush, or otherwise separate allocations that really need transaction lifetime from those that are only needed transiently during batched processing.

Yeah, that concern is reasonable. After a brief discussion with Amit,
we now replace the TopTransactionContext usage with two
purpose-specific contexts, scan_cxt for index AM allocations, freed
at teardown, and flush_cxt for per-flush transient work, reset each flush,
TopTransactionContext is parent of scan_cxt, and scan_cxt is parent
of flush_cxt, so MemoryContextDelete(scan_cxt) in teardown cleans up
both.

These changes are in v7-0004.

2. RI_FastPathEntry comment mentions the wrong function name
The comment above RI_FastPathEntry says it contains resources needed by ri_FastPathFlushBatch(), but the function is named ri_FastPathBatchFlush().

Fixed.

3. RI_FASTPATH_BATCH_SIZE needs some rationale
RI_FASTPATH_BATCH_SIZE = 64 may well be a reasonable compromise, but right now it reads like a magic number.
This choice seems especially relevant because the patch has two opposing effects:
3-1. larger batches should amortize the array-scan work better,
3-2. but the matched[] bookkeeping in ri_FastPathFlushArray() is O(batch_size^2) in the worst case.
So I think it would help to include at least a brief rationale in a comment or in the commit message.

Added.

4. Commit message says the entry stashes fk_relid, but the code actually stashes riinfo
The commit message says the entry stashes fk_relid and can reopen the relation if needed. Unless I am misreading it, the code actually stores riinfo and later uses riinfo->fk_relid. The distinction is small, but I think the wording should match the implementation more closely.

I changed the commit message to:

Since the FK relation may already be closed by flush time (e.g. for
deferred constraints at COMMIT), reopens the relation using
entry->riinfo->fk_relid if needed.

Thanks again for working on this.

Best regards,

Haibo Yan

On Mar 10, 2026, at 5:28 AM, Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi,

On Mon, Mar 2, 2026 at 11:30 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

On Sat, Feb 28, 2026 at 3:08 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

Thanks for testing, good to see similar numbers. I had forgotten to
note that these results are when these PK index probes don't do any
I/O, though you might be aware of that. Below, I report some numbers
that Tomas Vondra shared with me off-list where the probes do have to
perform I/O and there the benefits from only this patch set are only
marginal.

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

Good idea, done.

While polishing 0002, I revisited the snapshot caching semantics. The
previous commit message hand-waved about only curcid changing between
rows, but GetLatestSnapshot() also reflects other backends' commits,
so reusing the snapshot is a deliberate semantic change from the SPI
path. I think it's safe because curcid is all we need for
intra-statement visibility, concurrent commits either already happened
before our snapshot (and are visible) or are racing with our statement
and wouldn't be seen reliably even with per-row snapshots since the
order in which FK rows are checked is nondeterministic, and
LockTupleKeyShare prevents the PK row from disappearing regardless. In
essence, we're treating all the FK checks within a trigger-firing
cycle as a single plan execution that happens to scan N rows, rather
than N independent SPI queries each taking a fresh snapshot. That's
the natural model -- a normal SELECT ... FOR KEY SHARE plan doesn't
re-take GetLatestSnapshot() between rows either.

Similarly, the permission check (schema USAGE + table SELECT) is now
done once at cache entry creation in ri_FastPathGetEntry() rather than
on every flush.

nice improvement.

The RI check runs as the PK table owner, so we're
verifying that the owner can access their own table -- a condition
that won't change unless someone explicitly revokes from the owner,
which would also break the SPI path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I ended up going ahead with the batching and SAOP idea that David
mentioned -- I had a proof-of-concept working shortly after posting v3
and kept iterating on it. So attached set is now:

0001 - Core fast path (your 0001+0002 reworked, as before)

0002 - Per-batch resource caching (PK relation, index, scandesc, snapshot)

0003 - FK row buffering: materialize FK tuples into a per-constraint
batch buffer (64 rows), flush when full or at batch end

0004 - SK_SEARCHARRAY for single-column FKs: build an array from the
buffered FK values and do one index scan instead of 64 separate tree
descents. Multi-column FKs fall back to a per-row loop.

0003 is pure infrastructure -- it doesn't improve performance on its
own because the per-row index descent still dominates. The payoff
comes in 0004.

Numbers (same machine as before, median of 3 runs):

numeric PK / bigint FK, 1M rows:
master: 2487 ms
0001..0004: 1168 ms (2.1x)

int PK / int FK, 500K rows:
master: 1043 ms
0001..0004: 335 ms (3.1x)

The int/int case benefits most because the per-row cost is lower, so
the SAOP traversal savings are a larger fraction of the total. The
numeric/bigint case still sees a solid improvement despite the
cross-type cast overhead.

Tomas Vondra also tested with an I/O-intensive workload (dataset
larger than shared_buffers, combined with his and Peter Geoghegan's
I/O prefetching patches) and confirmed that the batching + SAOP
approach helps there too, not just in the CPU-bound / memory-resident
case. In fact he showed that the patches here don't make a big dent
when the main bottleneck is I/O as shown in numbers that he shared in
an off-list email:

master: 161617 ms
ri-check (0001..0004): 149446 ms (1.08x)
ri-check + i/o prefetching: 50885 ms (3.2x)

So the RI patches alone only give ~8% here since most time is waiting
on reads. But the batching gives the prefetch machinery a window of
upcoming probes to issue readahead against, so the two together yield
3.2x.

impressive!

Tomas also caught a memory context bug in the batch flush path: the
cached scandesc lives in TopTransactionContext, but the btree AM
defers _bt_preprocess_keys allocation to the first getnext call, which
pallocs into CurrentMemoryContext. If that's a short-lived
per-trigger-row context, the scandesc has dangling pointers on the
next rescan. Fixed by switching to TopTransactionContext before the
probe loop.

Finally, I've fixed a number of other small and not-so-small bugs
found while polishing the old patches and made other stylistic
improvements. One notable change is that I introduced a FastPathMeta

Yeah, this is much better than the fpmeta_valid field.

struct to store the fast path metadata instead of dumping those arrays
in the RI_ConstraintInfo. It's allocated lazily on first use and holds
the per-key compare entries, operator procedures, and index strategy
info needed by the scan key construction, so RI_ConstraintInfo doesn't
pay for them when the fast path isn't used.

On Mon, Feb 23, 2026 at 10:45 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

Hi Amit,

On Thu, Feb 19, 2026 at 5:21 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Junwang,

On Mon, Dec 1, 2025 at 3:09 PM Junwang Zhao <zhjwpku@gmail.com> wrote:

As Amit has already stated, we are approaching a hybrid "fast-path + fallback"
design.

0001 adds a fast path optimization for foreign key constraint checks
that bypasses the SPI executor, the fast path applies when the referenced
table is not partitioned, and the constraint does not involve temporal
semantics.

With the following test:

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);

head:

[local] zhjwpku@postgres:5432-90419=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 13516.177 ms (00:13.516)

[local] zhjwpku@postgres:5432-90419=# update fk set a = a + 1;
UPDATE 1000000
Time: 15057.638 ms (00:15.058)

patched:

[local] zhjwpku@postgres:5432-98673=# insert into fk select
generate_series(1, 2000000, 2);
INSERT 0 1000000
Time: 8248.777 ms (00:08.249)

[local] zhjwpku@postgres:5432-98673=# update fk set a = a + 1;
UPDATE 1000000
Time: 10117.002 ms (00:10.117)

0002 cache fast-path metadata used by the index probe, at the current
time only comparison operator hash entries, operator function OIDs
and strategy numbers and subtypes for index scans. But this cache
doesn't buy any performance improvement.

Caching additional metadata should improve performance for foreign key checks.

Amit suggested introducing a mechanism for ri_triggers.c to register a
cleanup callback in the EState, which AfterTriggerEndQuery() could then
invoke to release per-statement cached metadata (such as the IndexScanDesc).
However, I haven't been able to implement this mechanism yet.

Thanks for working on this. I've taken your patches as a starting
point and reworked the series into two patches (attached): 1st is your
0001+0002 as the core patch that adds a gated fast-path alternative to
SPI and 2nd where I added per-statement resource caching. Doing the
latter turned out to be not so hard thanks to the structure you chose
to build the core fast path. Good call on adding the RLS and ACL test
cases, btw.

So, 0001 is a functionally complete fast path: concurrency handling,
REPEATABLE READ crosscheck, cross-type operators, security context,
and metadata caching. 0002 implements the per-statement resource
caching we discussed, though instead of sharing the EState between
trigger.c and ri_triggers.c it uses a new AfterTriggerBatchCallback
mechanism that fires at the end of each trigger-firing cycle
(per-statement for immediate constraints, or until COMMIT for deferred
ones). It layers resource caching on top so that the PK relation,
index, scan descriptor, and snapshot stay open across all FK trigger
invocations within a single trigger-firing cycle rather than being
opened and closed per row.

Note that phe previous 0002 (metadata caching) is folded into 0001,
and most of the new fast-path logic added in 0001 now lives in
ri_FastPathCheck() rather than inline in RI_FKey_check(), so the
RI_FKey_check diff is just the gating call and SPI fallback.

I re-ran the benchmarks (same test as yours, different machine):

create table pk (a numeric primary key);
create table fk (a bigint references pk);
insert into pk select generate_series(1, 2000000);
insert into fk select generate_series(1, 2000000, 2);

master: 2444 ms (median of 3 runs)
0001: 1382 ms (43% faster)
0001+0002: 1202 ms (51% faster, 13% over 0001 alone)

I can get similar improvement on my old mac intel chip:

master: 12963.993 ms
0001: 6641.692 ms, 48.8% faster
0001+0002: 5771.703 ms, 55.5% faster

Also, with int PK / int FK (1M rows):

create table pk (a int primary key);
create table fk (a int references pk);
insert into pk select generate_series(1, 1000000);
insert into fk select generate_series(1, 1000000);

master: 1000 ms
0001: 520 ms (48% faster)
0001+0002: 432 ms (57% faster, 17% over 0001 alone)

master: 11134.583 ms
0001: 5240.298 ms, 52.9% faster
0001+0002: 4554.215 ms, 59.1% faster

The incremental gain from 0002 comes from eliminating per-row relation
open/close, scan begin/end, slot alloc/free, and replacing per-row
GetSnapshotData() with only curcid adjustment on the registered
snapshot copy in the cache.

The two current limitations are partitioned referenced tables and
temporal foreign keys. Partitioned PKs are relatively uncommon in
practice, so the non-partitioned case should cover most FK workloads,
so I'm not sure it's worth the added complexity to support them.
Temporal FKs are inherently multi-row, so they're a poor fit for a
single-probe fast path.

David Rowley mentioned off-list that it might be worth batching
multiple FK values into a single index probe, leveraging the
ScalarArrayOp btree improvements from PostgreSQL 17. The idea would be
to buffer FK values across trigger invocations in the per-constraint
cache (0002 already has the right structure for this), build a
SK_SEARCHARRAY scan key, and let the btree AM walk the matching leaf
pages in one sorted traversal instead of one tree descent per row. The
locking and recheck would still be per-tuple, but the index traversal
cost drops significantly. Single-column FKs are the obvious starting
point. That seems worth exploring but can be done as a separate patch
on top of this.

I will take a look at this in the following weeks.

I think the series is in reasonable shape but would appreciate extra
eyeballs, especially on the concurrency handling in ri_LockPKTuple()
in 0001 and the snapshot lifecycle in 0002. Or anything else that
catches one's eye.

--
Thanks, Amit Langote

I don't have any additional comments on the patch except one minor nit,
maybe merge the following two if conditions into one, not a strong opinion
though.

if (use_cache)
{
/*
* The snapshot was registered once when the cache entry was created.
* We just patch curcid to reflect the new command counter.
* SnapshotSetCommandId() only patches process-global statics, not
* registered copies, so we do it directly.
*
* The xmin/xmax/xip fields don't need refreshing: within a single
* statement batch, only curcid changes between rows.
*/
Assert(fpentry && fpentry->snapshot != NULL);
snapshot = fpentry->snapshot;
snapshot->curcid = GetCurrentCommandId(false);
}
else
snapshot = RegisterSnapshot(GetLatestSnapshot());

if (use_cache)
{
pk_rel = fpentry->pk_rel;
idx_rel = fpentry->idx_rel;
scandesc = fpentry->scandesc;
slot = fpentry->slot;
}
else
{
pk_rel = table_open(riinfo->pk_relid, RowShareLock);
idx_rel = index_open(riinfo->conindid, AccessShareLock);
scandesc = index_beginscan(pk_rel, idx_rel,
snapshot, NULL,
riinfo->nkeys, 0);
slot = table_slot_create(pk_rel, NULL);
}

--
Regards
Junwang Zhao

--
Thanks, Amit Langote

--
Regards
Junwang Zhao

I had an offline discussion with Amit today. There were a few small things
that could be improved, so I posted a new version of the patch set.

1.
+ if (ri_fastpath_is_applicable(riinfo))
+ {
+ bool found = ri_FastPathCheck(riinfo, fk_rel, newslot);
+
+ if (found)
+ return PointerGetDatum(NULL);
+
+ /*
+ * ri_FastPathCheck opens pk_rel internally; we need it for
+ * ri_ReportViolation.  Re-open briefly.
+ */
+ pk_rel = table_open(riinfo->pk_relid, RowShareLock);
+ ri_ReportViolation(riinfo, pk_rel, fk_rel,
+    newslot, NULL,
+    RI_PLAN_CHECK_LOOKUPPK, false, false);
+ }
Move ri_ReportViolation into ri_FastPathCheck, so table_open is no
longer needed, and ri_FastPathCheck now returns void. Since Amit
agreed this is the right approach, I included it directly in v5-0001.

2.

After adding the batch fast path, the original ri_FastPathCheck is only
used by the ALTER TABLE validation path. This path cannot use the
cache because the registered AfterTriggerBatch callback will never run.
Therefore, the use_cache branch can be removed.

I made this change in v5-0004 and also updated some related comments.
Once we agree the changes are correct, it can be merged into v5-0003.

3.
+ fk_slot = MakeSingleTupleTableSlot(RelationGetDescr(fk_rel),
+    &TTSOpsHeapTuple);
ri_FastPathBatchFlush creates a new fk_slot but does not cache it in
RI_FastPathEntry. I tried caching it in v5-0006 and ran some benchmarks,
it didn't show much improvement. This might be because the slot creation
function is called once per batch rather than once per row, so the overall
impact is minimal. I'm posting this here for Amit to take a look and decide
whether we should adopt it or drop it, since I mentioned the idea to
him earlier.

4.

ri_FastPathFlushArray currently uses SK_SEARCHARRAY only for
single-column checks. I asked whether this could be extended to support
multi-column cases, and Amit encouraged me to look into it.

After a brief investigation, it seems that ScanKeyEntryInitialize only allows
passing a single subtype/collation/procedure, which makes it difficult to
handle multiple types. Based on this, my current understanding is that
SK_SEARCHARRAY may not work for multi-column checks.

--
Regards
Junwang Zhao
<v5-0005-Use-SK_SEARCHARRAY-for-batched-fast-path-FK-probe.patch><v5-0006-Reuse-FK-tuple-slot-across-fast-path-batches.patch><v5-0002-Cache-per-batch-resources-for-fast-path-foreign-k.patch><v5-0004-Refine-fast-path-FK-validation-path.patch><v5-0003-Buffer-FK-rows-for-batched-fast-path-probing.patch><v5-0001-Add-fast-path-for-foreign-key-constraint-checks.patch>

--
Regards
Junwang Zhao

Eliminating SPI / SQL from some RI triggers - take 3

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: