AIO writes vs hint bits vs checksums

Started by Andres Freundover 1 year ago24 messages
#1Andres Freund
andres@anarazel.de

Hi,

Currently we modify pages while just holding a share lock, for hint bit
writes. Writing a buffer out only requires a share lock. Because of that we
can't compute checksums and write out pages in-place, as a concurent hint bit
write can easily corrupt the checksum.

That's not great, but not awful for our current synchronous buffered IO, as
we only ever have a single page being written out at a time.

However, it becomes a problem even if we just want to write out in chunks
larger than a single page - we'd need to reserve not just one BLCKSZ sized
buffer for this, but make it PG_IOV_MAX * BLCKSZ sized. Perhaps still
tolerable.

With AIO this becomes a considerably bigger issue:
a) We can't just have one write in progress at a time, but many
b) To be able to implement AIO using workers the "source" or "target" memory
of the IO needs to be in shared memory

Besides that, the need to copy the buffers makes checkpoints with AIO
noticeably slower when checksums are enabled - it's not the checksum but the
copy that's the biggest source of the slowdown.

So far the AIO patchset has solved this by introducing a set of "bounce
buffers", which can be acquired and used as the source/target of IO when doing
it in-place into shared buffers isn't viable.

I am worried about that solution however, as either acquisition of bounce
buffers becomes a performance issue (that's how I did it at first, it was hard
to avoid regressions) or we reserve bounce buffers for each backend, in which
case the memory overhead for instances with relatively small amount of
shared_buffers and/or many connections can be significant.

Which lead me down the path of trying to avoid the need for the copy in the
first place: What if we don't modify pages while it's undergoing IO?

The naive approach would be to not set hint bits with just a shared lock - but
that doesn't seem viable at all. For performance we rely on hint bits being
set and in many cases we'll only encounter the page in shared mode. We could
implement a conditional lock upgrade to an exclusive lock and do so while
setting hint bits, but that'd obviously be concerning from a concurrency point
of view.

What I suspect we might want instead is something inbetween a share and an
exclusive lock, which is taken while setting a hint bit and which conflicts
with having an IO in progress.

On first blush it might sound attractive to introduce this on the level of
lwlocks. However, I doubt that is a good idea - it'd make lwlock.c more
complicated which would imply overhead for other users, while the precise
semantics would be fairly specific to buffer locking. A variant of this would
be to generalize lwlock.c to allow implementing different kinds of locks more
easily. But that's a significant project on its own and doesn't really seem
necessary for this specific project.

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples. That's not exactly free. I've
prototyped checking visibility of an entire page of tuples at once and it
indeed speeds up visibility checks substantially (in some cases seqscans are
over 20% faster!).

Once we have page-level visibility checks we can get the right to set hint
bits once for an entire page instead of doing it for every tuple - with that
in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
wins.

Having a page level approach to setting hint bits has other advantages:

E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
on the page and we don't mark a page dirty on hot standby. Which often will
result in hint bits notpersistently set on replicas until the page is frozen.

Another issue is that we'll often WAL log hint bits for a page (due to hint
bits being set), just to then immediately log another WAL record for the same
page (e.g. for pruning), which is obviously wasteful. With a different
interface we could combine the WAL records for both.

I've not prototyped either, but I'm fairly confident they'd be helpful.

Does this sound like a reasonable idea? Counterpoints? If it does sound
reasonable, I'll clean up my pile of hacks into something
semi-understandable...

Greetings,

Andres Freund

#2Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#1)
Re: AIO writes vs hint bits vs checksums

On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:

So far the AIO patchset has solved this by introducing a set of "bounce
buffers", which can be acquired and used as the source/target of IO when doing
it in-place into shared buffers isn't viable.

I am worried about that solution however, as either acquisition of bounce
buffers becomes a performance issue (that's how I did it at first, it was hard
to avoid regressions) or we reserve bounce buffers for each backend, in which
case the memory overhead for instances with relatively small amount of
shared_buffers and/or many connections can be significant.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples. That's not exactly free. I've
prototyped checking visibility of an entire page of tuples at once and it
indeed speeds up visibility checks substantially (in some cases seqscans are
over 20% faster!).

Nice! It sounds like you refactored the relationship between
heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
bit setting upward or the iterate-over-tuples downward. Is that about right?

Once we have page-level visibility checks we can get the right to set hint
bits once for an entire page instead of doing it for every tuple - with that
in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
wins.

How did page-level+BM_SETTING_HINTS performance compare to performance of the
page-level change w/o the BM_SETTING_HINTS change?

Having a page level approach to setting hint bits has other advantages:

E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
on the page and we don't mark a page dirty on hot standby. Which often will
result in hint bits notpersistently set on replicas until the page is frozen.

Nice way to improve that.

Does this sound like a reasonable idea? Counterpoints?

I guess the main part left to discuss is index scans or other scan types where
we'd either not do page-level visibility or we'd do page-level visibility
including tuples we wouldn't otherwise use. BM_SETTING_HINTS likely won't
show up so readily in index scan profiles, but the cost is still there. How
should we think about comparing the distributed cost of the buffer header
manipulations during index scans vs. the costs of bounce buffers?

Thanks,
nm

#3Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#2)
Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-09-24 12:43:40 -0700, Noah Misch wrote:

On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:

So far the AIO patchset has solved this by introducing a set of "bounce
buffers", which can be acquired and used as the source/target of IO when doing
it in-place into shared buffers isn't viable.

I am worried about that solution however, as either acquisition of bounce
buffers becomes a performance issue (that's how I did it at first, it was hard
to avoid regressions) or we reserve bounce buffers for each backend, in which
case the memory overhead for instances with relatively small amount of
shared_buffers and/or many connections can be significant.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples. That's not exactly free. I've
prototyped checking visibility of an entire page of tuples at once and it
indeed speeds up visibility checks substantially (in some cases seqscans are
over 20% faster!).

Nice! It sounds like you refactored the relationship between
heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
bit setting upward or the iterate-over-tuples downward. Is that about right?

I've tried about five variations, so I don't have one answer to this yet :).

One problem is that having repeated loops doing PageGetItemId(),
PageGetItem(), ItemIdGetLength() isn't exactly free. To some degree it can be
hidden by allowing for better superscalar execution, but not entirely.

I've been somewhat confused by the compiler generated code around ItemId
handling for a while, it looks way more expensive than it should - it
regularly is a bottleneck due to the sheer number of instructions being
executed leading to being frontend bound. But never quite looked into it
deeply enough to figure out what's causing it / how to fix it.

Once we have page-level visibility checks we can get the right to set hint
bits once for an entire page instead of doing it for every tuple - with that
in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
wins.

How did page-level+BM_SETTING_HINTS performance compare to performance of the
page-level change w/o the BM_SETTING_HINTS change?

Just ran that. There probably is a performance difference, but it's small
(<0.5%) making it somewhat hard to be certain. It looks like the main reason
for that is ConditionVariableBroadcast() on the iocv shows up even though
nobody is waiting.

I've been fighting that with AIO as well, so maybe it's time to figure out the
memory ordering rules that'd allow to check that without a full spinlock
acquisition.

If we figure it out, we perhaps should use the chance to get rid of
BM_PIN_COUNT_WAITER...

Does this sound like a reasonable idea? Counterpoints?

I guess the main part left to discuss is index scans or other scan types where
we'd either not do page-level visibility or we'd do page-level visibility
including tuples we wouldn't otherwise use. BM_SETTING_HINTS likely won't
show up so readily in index scan profiles, but the cost is still there.

I could indeed not make it show up in some simple index lookup heavy
workloads. I need to try some more extreme cases though (e.g. fetching all
tuples in a table via an index or having very long HOT chains).

If it's not visible cost-wise compared to all the other costs of index scans -
does it matter? If it's not visible it's either because it proportionally is
very small or because it's completely hidden by superscalar execution.

Another thing I forgot to mention that probably fits into the "tradeoffs"
bucket: Because BM_SETTING_HINTS would be just a single bit, one backend
setting hint bits would block out another backend setting hint bits. In most
situations that'll be fine, or even faster than not doing so due to reducing
cache line ping-pong, but in cases of multiple backends doing index lookups to
different unhinted tuples on the same page it could be a bit worse.

But I suspect that's fine because it's hard to believe that you could have
enough index lookups to unhinted tuples for that to be a bottleneck -
something has to produce all those unhinted tuples after all, and that's
rather more expensive. And for single-tuple visibility checks the window in
which hint bits are set is very small.

How should we think about comparing the distributed cost of the buffer
header manipulations during index scans vs. the costs of bounce buffers?

Well, the cost of bounce buffers would be born as long as postgres is up,
whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
really show up. If there are cases where the cost of the more expensive hint
bit logic does show up, it'll get a lot harder to weigh.

So far my prototype uses the path that avoids hint bits being set while IO is
going on all the time, not just when checksums are enabled. We could change
that, I guess. However, our habit of modifying buffers while IO is going on is
causing issues with filesystem level checksums as well, as evidenced by the
fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
tend to think it'd be better to just stop doing that alltogether (we also do
that for WAL, when writing out a partial page, but a potential fix there would
be different, I think).

A thing I forgot to mention: Bounce buffers are kind of an architectural dead
end, in that we wouln't need them in that form if we get to a threaded
postgres.

Zooming out (a lot) more: I like the idea of having a way to get the
permission to perform some kinds of modifications on a page without an
exlusive lock. While obviously a lot more work, I do think there's some
potential to have some fast-paths that perform work on a page level without
blocking out readers. E.g. some simple cases of insert could correctly be done
without blocking out readers (by ordering the update of the max page offset

Greetings,

Andres Freund

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#3)
Re: AIO writes vs hint bits vs checksums

On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres@anarazel.de> wrote:

Just ran that. There probably is a performance difference, but it's small
(<0.5%) making it somewhat hard to be certain. It looks like the main reason
for that is ConditionVariableBroadcast() on the iocv shows up even though
nobody is waiting.

. o O { Gotta fix that. Memory barriers might be enough to check for
empty wait list?, and even in the slow path, atomic wait lists or
something better than spinlocks... }

However, our habit of modifying buffers while IO is going on is
causing issues with filesystem level checksums as well, as evidenced by the
fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
tend to think it'd be better to just stop doing that alltogether (we also do
that for WAL, when writing out a partial page, but a potential fix there would
be different, I think).

+many. Interesting point re the WAL variant. For the record, here's
some discussion and a repro for that problem, which Andrew currently
works around in a build farm animal with mount options:

/messages/by-id/CA+hUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw@mail.gmail.com

#5Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#3)
Re: AIO writes vs hint bits vs checksums

On Tue, Sep 24, 2024 at 04:30:25PM -0400, Andres Freund wrote:

On 2024-09-24 12:43:40 -0700, Noah Misch wrote:

On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:

Besides that, the need to copy the buffers makes checkpoints with AIO
noticeably slower when checksums are enabled - it's not the checksum but the
copy that's the biggest source of the slowdown.

How big is that copy's contribution to the slowdown there? A measurable CPU
overhead on writes likely does outweigh the unmeasurable overhead on index
scans, but ...

Does this sound like a reasonable idea? Counterpoints?

How should we think about comparing the distributed cost of the buffer
header manipulations during index scans vs. the costs of bounce buffers?

Well, the cost of bounce buffers would be born as long as postgres is up,
whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
really show up.

... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision. From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers. I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability. Bounce
buffers have an implementation. They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.

Zooming out (a lot) more: I like the idea of having a way to get the
permission to perform some kinds of modifications on a page without an
exlusive lock. While obviously a lot more work, I do think there's some
potential to have some fast-paths that perform work on a page level without
blocking out readers. E.g. some simple cases of insert could correctly be done
without blocking out readers (by ordering the update of the max page offset

True.

#6Antonin Houska
ah@cybertec.at
In reply to: Andres Freund (#1)
Re: AIO writes vs hint bits vs checksums

Andres Freund <andres@anarazel.de> wrote:

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

An alternative approach: introduce a flag that tells that the checksum is
being computed, and disallow setting hint bits when that flag is set. As long
as the checksum computation takes take much less time than the IO, fewer hint
bit updates should be rejected.

Of course, SetHintBits() would have to update the checksum too. But if it
could determine where the hint bit is located in the buffer and if "some
intermediate state" of the computation was maintained for each page in shared
buffers, then the checksum update might be cheaper than the initial
computation. But I'm not sure I understand the algorithm enough.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#7Antonin Houska
ah@cybertec.at
In reply to: Antonin Houska (#6)
Re: AIO writes vs hint bits vs checksums

Antonin Houska <ah@cybertec.at> wrote:

Andres Freund <andres@anarazel.de> wrote:

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

An alternative approach: introduce a flag that tells that the checksum is
being computed, and disallow setting hint bits when that flag is set. As long
as the checksum computation takes take much less time than the IO, fewer hint
bit updates should be rejected.

Well, the checksum actually should not be computed during the IO, so the IO
would still disallow hint bit updates :-(

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#4)
Re: AIO writes vs hint bits vs checksums

On Wed, Sep 25, 2024 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres@anarazel.de> wrote:

However, our habit of modifying buffers while IO is going on is
causing issues with filesystem level checksums as well, as evidenced by the
fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
tend to think it'd be better to just stop doing that alltogether (we also do
that for WAL, when writing out a partial page, but a potential fix there would
be different, I think).

+many. Interesting point re the WAL variant. For the record, here's
some discussion and a repro for that problem, which Andrew currently
works around in a build farm animal with mount options:

/messages/by-id/CA+hUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw@mail.gmail.com

Here's an interesting new development in that area, this time from
OpenZFS, which committed its long awaited O_DIRECT support a couple of
weeks ago[1]https://github.com/openzfs/zfs/pull/10018/commits/d7b861e7cfaea867ae28ab46ab11fba89a5a1fda and seems to have taken a different direction since that
last discussion. Clearly it has the same checksum stability problem
as BTRFS and PostgreSQL itself, so an O_DIRECT mode with the goal of
avoiding copying and caching must confront that and break *something*,
or accept something like bounce buffers and give up the zero-copy
goal. Curiously, they seem to have landed on two different solutions
with three different possible behaviours: (1) On FreeBSD, temporarily
make the memory non-writeable, (2) On Linux, they couldn't do that so
they have an extra checksum verification on write. I haven't fully
grokked all this yet, or even tried it, and it's not released or
anything, but it looks a bit like all three behaviours are bad for our
current hint bit design: on FreeBSD, setting a hint bit might crash
(?) if a write is in progress in another process, and on Linux,
depending on zfs_vdev_direct_write_verify, either the concurrent write
might fail (= checkpointer failing on EIO because someone concurrently
set a hint bit) or a later read might fail (= file is permanently
corrupted and you don't find out until later, like btrfs). I plan to
look more closely soon and see if I understood that right...

[1]: https://github.com/openzfs/zfs/pull/10018/commits/d7b861e7cfaea867ae28ab46ab11fba89a5a1fda

#9Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#3)
13 attachment(s)
Re: AIO writes vs hint bits vs checksums

Hi,

Attached is a, unfortunately long, series of patches implementing what I
described upthread.

0001 Add very basic test for kill_prior_tuples

We currently don't exercise this patch for gist and hash, which seems
somewhat criminal.

0002 heapam: Move logic to handle HEAP_MOVED into a helper function

A prep patch, which consolidates HEAP_MOVED handling into a helper. This
isn't strictly necessary, but I got bothered by it while writing this patch
series.

0003 bufmgr: Add BufferLockHeldByMe()

Useful for writing assertions in later patches.

0004 heapam: Use exclusive lock on old page in CLUSTER

When I started looking into this I wanted to make sure that it's actually
safe to skip setting hint bits in all places. One place that's *not* safe
is cluster - we assume that all hint bits are set in rewrite_heap_tuple().

This might not strictly be required, but it seemed very fragile as
is. While using an exclusive lock does seem like it makes it a bit safer,
it's not a very good fix, as we still dirty the old buffer, which seems
like a shame when one uses VACUUM FULL.

0005 heapam: Only set tuple's block once per page in pagemode

Minor, but measurable, optimization.

0006 heapam: Add batch mode mvcc check and use it in page mode

This is a good bit faster on its own, but is required to avoid a
performance regression later, when setting hint bits only only when no IO
going on at the same.

0007 bufmgr: Make it easier to change number of buffer state bits

This just makes it easier to change the division of bits in
BufferDesc->state.

0008 bufmgr: Add interface to acquire right to set hint bits

The main change. Adds BufferPrepareToSetHintBits() which, if true is
returned, gives the rigth to modify a buffer without an exclusive
lock. Once the modification is done BufferFinishSetHintBits() needs to be
called.

I dithered a bit on the name, we don't really call them hint bits in all
the places that modify buffers without an exlusive lock.

The new infra is unused as of this commit.

0009 heapam: Acquire right to set hint bits

Moves over heapam to the new interface.

0010 Acquire right to set hint bits in the remaining places

Moves indexams and freespace/ over to the new interface.

0011 bufmgr: Detect some missing BufferPrepareToSetHintBits() calls

Now that we don't set hint bits without calling
BufferPrepareToSetHintBits() first, we can detect many missing calls by
adding an assertion to MarkBufferDirtyHint().

0012 bufmgr: Don't copy pages while writing out

Finally remove the copy of the buffer while writing it out.

0013 WIP: bufmgr: Detect some bad buffer accesses

This is a lot more thorough infrastructure for detecting modifications of
buffers without appropriate locks, by using mprotect() to detect such
cases. Only works when not using huge pages, and not on windows (not sure
how mprotect() is spelled there) and not arm macos (due to 16kB pages).

Has two modes, one just disallows modifications, and is reasonably
cheap. The more expensive mode disallows any access when the buffer is
unpinned. The commit message has some performance numbers.

This has proven to be a good investment as far as this thread goes - I
wasn't initially aware that freespace modified buffers without a lock and
this quickly found that.

Greetings,

Andres Freund

Attachments:

v1-0001-Add-very-basic-test-for-kill_prior_tuples.patchtext/x-diff; charset=us-asciiDownload
From 13ce1a3c5840e3d76a25fdcbeeda24fe4d2d4017 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 15:48:50 -0400
Subject: [PATCH v1 01/14] Add very basic test for kill_prior_tuples

Previously our tests did not exercise kill_prior_tuples for hash and gist. The
latter only because it does not work if all the dead tuples are on one
page (or something like that).

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 .../isolation/expected/index-killtuples.out   | 338 ++++++++++++++++++
 .../isolation/specs/index-killtuples.spec     | 124 +++++++
 2 files changed, 462 insertions(+)
 create mode 100644 src/test/isolation/expected/index-killtuples.out
 create mode 100644 src/test/isolation/specs/index-killtuples.spec

diff --git a/src/test/isolation/expected/index-killtuples.out b/src/test/isolation/expected/index-killtuples.out
new file mode 100644
index 00000000000..1361b60c267
--- /dev/null
+++ b/src/test/isolation/expected/index-killtuples.out
@@ -0,0 +1,338 @@
+unused step name: delete_1
+Parsed test spec with 1 sessions
+
+starting permutation: create_table fill_500 create_btree flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_btree: CREATE INDEX kill_prior_tuple_btree ON kill_prior_tuple USING btree (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                0
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+
+starting permutation: create_table fill_500 create_ext_btree_gist create_gist flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gist
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_ext_btree_gist: CREATE EXTENSION btree_gist;
+step create_gist: CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                0
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gist: DROP EXTENSION btree_gist;
+
+starting permutation: create_table fill_10 create_ext_btree_gist create_gist flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gist
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_10: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 10) g(i);
+step create_ext_btree_gist: CREATE EXTENSION btree_gist;
+step create_gist: CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gist: DROP EXTENSION btree_gist;
+
+starting permutation: create_table fill_500 create_hash flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_hash: CREATE INDEX kill_prior_tuple_hash ON kill_prior_tuple USING hash (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                0
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+
+starting permutation: create_table fill_500 create_ext_btree_gin create_gin flush disable_seq delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gin
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_ext_btree_gin: CREATE EXTENSION btree_gin;
+step create_gin: CREATE INDEX kill_prior_tuple_gin ON kill_prior_tuple USING gin (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                             
+-----------------------------------------------------------------------
+Bitmap Heap Scan on kill_prior_tuple (actual rows=0 loops=1)           
+  Recheck Cond: (key = 1)                                              
+  Heap Blocks: exact=1                                                 
+  ->  Bitmap Index Scan on kill_prior_tuple_gin (actual rows=1 loops=1)
+        Index Cond: (key = 1)                                          
+(5 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                             
+-----------------------------------------------------------------------
+Bitmap Heap Scan on kill_prior_tuple (actual rows=0 loops=1)           
+  Recheck Cond: (key = 1)                                              
+  Heap Blocks: exact=1                                                 
+  ->  Bitmap Index Scan on kill_prior_tuple_gin (actual rows=1 loops=1)
+        Index Cond: (key = 1)                                          
+(5 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gin: DROP EXTENSION btree_gin;
diff --git a/src/test/isolation/specs/index-killtuples.spec b/src/test/isolation/specs/index-killtuples.spec
new file mode 100644
index 00000000000..d597be02c23
--- /dev/null
+++ b/src/test/isolation/specs/index-killtuples.spec
@@ -0,0 +1,124 @@
+# Basic testing of killtuples / kill_prior_tuples / all_dead testing
+# for various index AMs
+#
+# This tests just enough to ensure that the kill* routines are actually
+# executed and does something approximately reasonble. It's *not* sufficient
+# testing for adding killitems support to a new AM!
+
+setup
+{
+    CREATE TABLE counter(heap_accesses int);
+    INSERT INTO counter(heap_accesses) VALUES (0);
+}
+
+teardown
+{
+    DROP TABLE counter;
+}
+
+session s1
+# to ensure GUCs are reset
+setup { RESET ALL; }
+
+step disable_seq { SET enable_seqscan = false; }
+
+step disable_bitmap { SET enable_bitmapscan = false; }
+
+# use a temporary table to make sure no other session can interfere with
+# visibility determinations
+step create_table { CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null); }
+
+step fill_10 { INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 10) g(i); }
+
+step fill_500 { INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i); }
+
+# column-less select to make output easier to read
+step flush { SELECT FROM pg_stat_force_next_flush(); }
+
+step measure { UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple'); }
+
+step result { SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple'; }
+
+step access { EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1; }
+
+step delete { DELETE FROM kill_prior_tuple; }
+step delete_1 { DELETE FROM kill_prior_tuple WHERE cat = 1; }
+
+step drop_table { DROP TABLE IF EXISTS kill_prior_tuple; }
+
+### steps for testing btree indexes ###
+step create_btree { CREATE INDEX kill_prior_tuple_btree ON kill_prior_tuple USING btree (key); }
+
+### steps for testing gist indexes ###
+# Creating the extensions takes time, so we don't want to do so when testing
+# other AMs
+step create_ext_btree_gist { CREATE EXTENSION btree_gist; }
+step drop_ext_btree_gist { DROP EXTENSION btree_gist; }
+step create_gist { CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key); }
+
+### steps for testing gin indexes ###
+# See create_ext_btree_gist
+step create_ext_btree_gin { CREATE EXTENSION btree_gin; }
+step drop_ext_btree_gin { DROP EXTENSION btree_gin; }
+step create_gin { CREATE INDEX kill_prior_tuple_gin ON kill_prior_tuple USING gin (key); }
+
+### steps for testing hash indexes ###
+step create_hash { CREATE INDEX kill_prior_tuple_hash ON kill_prior_tuple USING hash (key); }
+
+
+# test killtuples with btree inex
+permutation
+  create_table fill_500 create_btree flush
+  disable_seq disable_bitmap
+  # show each access to non-deleted tuple increments heap_blks_*
+  measure access flush result
+  measure access flush result
+  delete flush
+  # first access after accessing deleted tuple still needs to access heap
+  measure access flush result
+  # but after kill_prior_tuple did its thing, we shouldn't access heap anymore
+  measure access flush result
+  drop_table
+
+# Same as first permutation, except testing gist
+permutation
+  create_table fill_500 create_ext_btree_gist create_gist flush
+  disable_seq disable_bitmap
+  measure access flush result
+  measure access flush result
+  delete flush
+  measure access flush result
+  measure access flush result
+  drop_table drop_ext_btree_gist
+
+# Test gist, but with fewer rows - shows that killitems doesn't work anymore!
+permutation
+  create_table fill_10 create_ext_btree_gist create_gist flush
+  disable_seq disable_bitmap
+  measure access flush result
+  measure access flush result
+  delete flush
+  measure access flush result
+  measure access flush result
+  drop_table drop_ext_btree_gist
+
+# Same as first permutation, except testing hash
+permutation
+  create_table fill_500 create_hash flush
+  disable_seq disable_bitmap
+  measure access flush result
+  measure access flush result
+  delete flush
+  measure access flush result
+  measure access flush result
+  drop_table
+
+# # Similar to first permutation, except that gin does not have killtuples support
+permutation
+  create_table fill_500 create_ext_btree_gin create_gin flush
+  disable_seq
+  delete flush
+  measure access flush result
+  # will still fetch from heap
+  measure access flush result
+  drop_table drop_ext_btree_gin
-- 
2.46.0.519.g2e7b89e038

v1-0002-heapam-Move-logic-to-handle-HEAP_MOVED-into-a-hel.patchtext/x-diff; charset=us-asciiDownload
From cddf2ae3d5d4300a99e8bff6a98ac8fee82fe169 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 23 Sep 2024 12:23:33 -0400
Subject: [PATCH v1 02/14] heapam: Move logic to handle HEAP_MOVED into a
 helper function

Before we dealt with this in 6 near identical and one very similar copy.

The helper function errors out when encountering a
HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or
in-progress. It'd be preferrable to do that change separately, but otherwise
it'd not be possible to deduplicate the handling in
HeapTupleSatisfiesVacuum().

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam_visibility.c | 307 ++++----------------
 1 file changed, 61 insertions(+), 246 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01f..057b88767db 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -144,6 +144,55 @@ HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 	SetHintBits(tuple, buffer, infomask, xid);
 }
 
+/*
+ * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
+ * adjust hint bits. See the comment for SetHintBits() for more background.
+ *
+ * This helper returns false if the row ought to be invisible, true otherwise.
+ */
+static inline bool
+HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
+{
+	TransactionId xvac;
+
+	/* only used by pre-9.0 binary upgrades */
+	if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
+		return true;
+
+	xvac = HeapTupleHeaderGetXvac(tuple);
+
+	if (TransactionIdIsCurrentTransactionId(xvac))
+		elog(ERROR, "encountered tuple with HEAP_MOVED considered current");
+
+	if (TransactionIdIsInProgress(xvac))
+		elog(ERROR, "encountered tuple with HEAP_MOVED considered in-progress");
+
+	if (tuple->t_infomask & HEAP_MOVED_OFF)
+	{
+		if (TransactionIdDidCommit(xvac))
+		{
+			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
+						InvalidTransactionId);
+			return false;
+		}
+		SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
+					InvalidTransactionId);
+	}
+	else if (tuple->t_infomask & HEAP_MOVED_IN)
+	{
+		if (TransactionIdDidCommit(xvac))
+			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
+						InvalidTransactionId);
+		else
+		{
+			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
+						InvalidTransactionId);
+			return false;
+		}
+	}
+
+	return true;
+}
 
 /*
  * HeapTupleSatisfiesSelf
@@ -179,45 +228,8 @@ HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
@@ -372,45 +384,8 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 
 		/*
 		 * An invalid Xmin can be left behind by a speculative insertion that
@@ -468,45 +443,8 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return TM_Invisible;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return TM_Invisible;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return TM_Invisible;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return TM_Invisible;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return TM_Invisible;
-				}
-			}
-		}
+		else if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (HeapTupleHeaderGetCmin(tuple) >= curcid)
@@ -756,45 +694,8 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
@@ -970,45 +871,8 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!XidInMVCCSnapshot(xvac, snapshot))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (XidInMVCCSnapshot(xvac, snapshot))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
@@ -1213,57 +1077,8 @@ HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *de
 	{
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return HEAPTUPLE_DEAD;
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return HEAPTUPLE_DELETE_IN_PROGRESS;
-			if (TransactionIdIsInProgress(xvac))
-				return HEAPTUPLE_DELETE_IN_PROGRESS;
-			if (TransactionIdDidCommit(xvac))
-			{
-				SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-							InvalidTransactionId);
-				return HEAPTUPLE_DEAD;
-			}
-			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-						InvalidTransactionId);
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			if (TransactionIdIsInProgress(xvac))
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			if (TransactionIdDidCommit(xvac))
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			else
-			{
-				SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-							InvalidTransactionId);
-				return HEAPTUPLE_DEAD;
-			}
-		}
-		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
-		{
-			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			/* only locked? run infomask-only check first, for performance */
-			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) ||
-				HeapTupleHeaderIsOnlyLocked(tuple))
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			/* inserted and then deleted by same xact */
-			if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple)))
-				return HEAPTUPLE_DELETE_IN_PROGRESS;
-			/* deleting subtransaction must have aborted */
-			return HEAPTUPLE_INSERT_IN_PROGRESS;
-		}
+		else if (!HeapTupleCleanMoved(tuple, buffer))
+			return HEAPTUPLE_DEAD;
 		else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			/*
-- 
2.46.0.519.g2e7b89e038

v1-0003-bufmgr-Add-BufferLockHeldByMe.patchtext/x-diff; charset=us-asciiDownload
From aa054c44b41669c44f238f127c20bd48dff3af11 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 23 Sep 2024 12:51:37 -0400
Subject: [PATCH v1 03/14] bufmgr: Add BufferLockHeldByMe()

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |  1 +
 src/backend/storage/buffer/bufmgr.c | 41 +++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..ded46a57889 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -291,6 +291,7 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern bool BufferLockHeldByMe(Buffer buffer, int mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0f02bf62fa3..88d18e85d64 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5492,6 +5492,47 @@ IsBufferCleanupOK(Buffer buffer)
 	return false;
 }
 
+/*
+ * BufferLockHeldByMe - does the backend have the buffer locked?
+ *
+ * This likely should only be used for asserts etc.
+ *
+ * Note that this can only be called for non-temp buffers - there is no
+ * correct value to return for temporary buffers. One might think that just
+ * returning true for temp buffers would work, but the caller might assert
+ * that a lock is *not* held.
+ */
+bool
+BufferLockHeldByMe(Buffer buffer, int mode)
+{
+	BufferDesc *buf;
+	LWLockMode	lwmode;
+
+	/*
+	 * Can't hold a lock without a pin, there never should be uncertainty
+	 * about having a pin.
+	 */
+	Assert(BufferIsPinned(buffer));
+
+	/* there'd be no correct value to return */
+	Assert(!BufferIsLocal(buffer));
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		lwmode = LW_EXCLUSIVE;
+	else if (mode == BUFFER_LOCK_SHARE)
+		lwmode = LW_SHARED;
+	else
+	{
+		Assert(false);
+		pg_unreachable();
+		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+	}
+
+	return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(buf), lwmode);
+}
+
 
 /*
  *	Functions for buffer I/O handling
-- 
2.46.0.519.g2e7b89e038

v1-0004-heapam-Use-exclusive-lock-on-old-page-in-CLUSTER.patchtext/x-diff; charset=us-asciiDownload
From 08c1b45cfd7c223b146d228e997d16624274d0b4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 11:45:36 -0400
Subject: [PATCH v1 04/14] heapam: Use exclusive lock on old page in CLUSTER

To be able to guarantee that we can set the hint bit, acquire an exclusive
lock on the old buffer. We need the hint bits to be set as otherwise
reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple() will
get confused.

It'd be better if we somehow coulda void setting hint bits on the old
page. One reason to use VACUUM FULL are very bloated tables - rewriting most
of the old table before during VACUUM FULL doesn't.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam_handler.c    | 13 ++++++++++++-
 src/backend/access/heap/heapam_visibility.c |  7 +++++++
 src/backend/storage/buffer/bufmgr.c         |  2 +-
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..009601445aa 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -833,7 +833,18 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		tuple = ExecFetchSlotHeapTuple(slot, false, NULL);
 		buf = hslot->buffer;
 
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		/*
+		 * To be able to guarantee that we can set the hint bit, acquire an
+		 * exclusive lock on the old buffer. We need the hint bits to be set
+		 * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
+		 * heap_freeze_tuple() will get confused.
+		 *
+		 * It'd be better if we somehow could avoid setting hint bits on the
+		 * old page. One reason to use VACUUM FULL are very bloated tables -
+		 * rewriting most of the old table before during VACUUM FULL doesn't
+		 * exactly help...
+		 */
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
 		{
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 057b88767db..b7aa8bb7a52 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -141,6 +141,13 @@ void
 HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 					 uint16 infomask, TransactionId xid)
 {
+	/*
+	 * The uses from heapam.c rely on being able to perform the hint bit
+	 * updates, which can only be guaranteed if we are holding an exclusive
+	 * lock on the buffer - which all callers are doing.
+	 */
+	Assert(BufferIsLocal(buffer) || BufferLockHeldByMe(buffer, BUFFER_LOCK_EXCLUSIVE));
+
 	SetHintBits(tuple, buffer, infomask, xid);
 }
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 88d18e85d64..f56f7f9f441 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5527,7 +5527,7 @@ BufferLockHeldByMe(Buffer buffer, int mode)
 	{
 		Assert(false);
 		pg_unreachable();
-		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+		lwmode = LW_EXCLUSIVE;	/* assuage compiler */
 	}
 
 	return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(buf), lwmode);
-- 
2.46.0.519.g2e7b89e038

v1-0005-heapam-Only-set-tuple-s-block-once-per-page-in-pa.patchtext/x-diff; charset=us-asciiDownload
From 9b56778c495de3ed7b179083c926598ab5fbaf57 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 11:50:13 -0400
Subject: [PATCH v1 05/14] heapam: Only set tuple's block once per page in
 pagemode

Due to splitting the block id into two 16 bit integers, BlockIdSet() is more
expensive than one might think. Doing it once per returned shows up as small
but relibly reproducible cost.  It's simple enough to just set the block
number so once per block in pagemode, so do so.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1748eafa100..d0cb4b1e29b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -981,6 +981,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
 
+		/* set block once per page, instead of doing so for every tuple */
+		BlockIdSet(&tuple->t_self.ip_blkid, scan->rs_cblock);
+
 		/* lineindex now references the next or previous visible tid */
 continue_page:
 
@@ -995,7 +998,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
+			tuple->t_self.ip_posid = lineoff;
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
-- 
2.46.0.519.g2e7b89e038

v1-0006-heapam-Add-batch-mode-mvcc-check-and-use-it-in-pa.patchtext/x-diff; charset=us-asciiDownload
From ed74f0f2c6dfc4e4afe31ac162200c0783bf2240 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 13:16:36 -0400
Subject: [PATCH v1 06/14] heapam: Add batch mode mvcc check and use it in page
 mode

There are two reasons for doing so:

1) It is generally faster to perform checks in a batched fashion and making
   sequential scans faster is nice.

2) We would like to stop setting hint bits while pages are being written
   out. The necessary locking becomes visible for page mode scans if done for
   every tuple. With batching the overhead can be amortized to only happen
   once per page.

There are substantial further optimization opportunities along these
lines:

- Right now HeapTupleSatisfiesMVCCBatch() simply uses the single-tuple
  HeapTupleSatisfiesMVCC(), relying on the compiler to inline it. We could
  instead write an explicitly optimized version that avoids repeated xid
  tests.

- Introduce batched version of the serializability test

- Introduce batched version of HeapTupleSatisfiesVacuum

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/access/heapam.h                 |  28 +++++
 src/backend/access/heap/heapam.c            | 110 ++++++++++++++++----
 src/backend/access/heap/heapam_visibility.c |  50 +++++++++
 src/tools/pgindent/typedefs.list            |   1 +
 4 files changed, 166 insertions(+), 23 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 96cf82f97b7..d4a790251cc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -417,6 +417,34 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
+/*
+ * FIXME: define to be removed
+ *
+ * Without this I see worse performance. But it's a bit ugly, so I thought
+ * it'd be useful to leave a way in for others to experiment with this.
+ */
+#define BATCHMVCC_FEWER_ARGS
+
+#ifdef BATCHMVCC_FEWER_ARGS
+typedef struct BatchMVCCState
+{
+	HeapTupleData tuples[MaxHeapTuplesPerPage];
+	bool		visible[MaxHeapTuplesPerPage];
+} BatchMVCCState;
+#endif
+
+extern int	HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
+										int ntups,
+#ifdef BATCHMVCC_FEWER_ARGS
+										BatchMVCCState *batchmvcc,
+#else
+										HeapTupleData *tuples,
+										bool *visible,
+#endif
+										OffsetNumber *vistuples_dense);
+
+
+
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d0cb4b1e29b..99656a1aac5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -437,42 +437,106 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 					BlockNumber block, int lines,
 					bool all_visible, bool check_serializable)
 {
+	Oid			relid = RelationGetRelid(scan->rs_base.rs_rd);
+#ifdef BATCHMVCC_FEWER_ARGS
+	BatchMVCCState batchmvcc;
+#else
+	HeapTupleData tuples[MaxHeapTuplesPerPage];
+	bool		visible[MaxHeapTuplesPerPage];
+#endif
 	int			ntup = 0;
-	OffsetNumber lineoff;
+	int			nvis = 0;
 
-	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	/* page at a time should have been disabled otherwise */
+	Assert(IsMVCCSnapshot(snapshot));
+
+	/* first find all tuples on the page */
+	for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
 	{
 		ItemId		lpp = PageGetItemId(page, lineoff);
-		HeapTupleData loctup;
-		bool		valid;
+		HeapTuple	tup;
 
-		if (!ItemIdIsNormal(lpp))
+		if (unlikely(!ItemIdIsNormal(lpp)))
 			continue;
 
-		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-		loctup.t_len = ItemIdGetLength(lpp);
-		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
-		ItemPointerSet(&(loctup.t_self), block, lineoff);
-
-		if (all_visible)
-			valid = true;
-		else
-			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
-
-		if (check_serializable)
-			HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
-												&loctup, buffer, snapshot);
-
-		if (valid)
+		/*
+		 * If the page is not all-visible or we need to check serializability,
+		 * maintain enough state to be able to refind the tuple efficiently,
+		 * without again needing to extract it from the page.
+		 */
+		if (!all_visible || check_serializable)
 		{
-			scan->rs_vistuples[ntup] = lineoff;
-			ntup++;
+#ifdef BATCHMVCC_FEWER_ARGS
+			tup = &batchmvcc.tuples[ntup];
+#else
+			tup = &tuples[ntup];
+#endif
+
+			tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+			tup->t_len = ItemIdGetLength(lpp);
+			tup->t_tableOid = relid;
+			ItemPointerSet(&(tup->t_self), block, lineoff);
 		}
+
+		/*
+		 * If the page is all visible, these fields won'otherwise wont be
+		 * populated in loop below.
+		 */
+		if (all_visible)
+		{
+			if (check_serializable)
+			{
+#ifdef BATCHMVCC_FEWER_ARGS
+				batchmvcc.visible[ntup] = true;
+#else
+				visible[ntup] = true;
+#endif
+			}
+			scan->rs_vistuples[ntup] = lineoff;
+		}
+
+		ntup++;
 	}
 
 	Assert(ntup <= MaxHeapTuplesPerPage);
 
-	return ntup;
+	/* unless the page is all visible, test visibility for all tuples one go */
+	if (all_visible)
+		nvis = ntup;
+	else
+		nvis = HeapTupleSatisfiesMVCCBatch(snapshot, buffer,
+										   ntup,
+#ifdef BATCHMVCC_FEWER_ARGS
+										   &batchmvcc,
+#else
+										   tuples, visible,
+#endif
+										   scan->rs_vistuples
+			);
+
+	/*
+	 * So far we don't have batch API for testing serializabilty, so do so
+	 * one-by-one.
+	 */
+	if (check_serializable)
+	{
+		for (int i = 0; i < ntup; i++)
+		{
+#ifdef BATCHMVCC_FEWER_ARGS
+			HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
+												scan->rs_base.rs_rd,
+												&batchmvcc.tuples[i],
+												buffer, snapshot);
+#else
+			HeapCheckForSerializableConflictOut(visible[i],
+												scan->rs_base.rs_rd,
+												&tuples[i],
+												buffer, snapshot);
+#endif
+		}
+	}
+
+	return nvis;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index b7aa8bb7a52..4baafc01a2c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1575,6 +1575,56 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 		return true;
 }
 
+/*
+ * Perform HeaptupleSatisfiesMVCC() on each passed in tuple. This is more
+ * efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
+ *
+ * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
+ * visibility is set in batchmvcc->visible[]. In addition, ->vistuples_dense
+ * is set to contain the offsets of visible tuples.
+ *
+ * Returns the number of visible tuples.
+ */
+int
+HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
+							int ntups,
+#ifdef BATCHMVCC_FEWER_ARGS
+							BatchMVCCState *batchmvcc,
+#else
+							HeapTupleData *tuples,
+							bool *visible,
+#endif
+							OffsetNumber *vistuples_dense)
+{
+	int			nvis = 0;
+
+	Assert(IsMVCCSnapshot(snapshot));
+
+	for (int i = 0; i < ntups; i++)
+	{
+		bool		valid;
+#ifdef BATCHMVCC_FEWER_ARGS
+		HeapTuple	tup = &batchmvcc->tuples[i];
+#else
+		HeapTuple	tup = &tuples[i];
+#endif
+
+		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer);
+#ifdef BATCHMVCC_FEWER_ARGS
+		batchmvcc->visible[i] = valid;
+#else
+		visible[i] = valid;
+#endif
+		if (likely(valid))
+		{
+			vistuples_dense[nvis] = tup->t_self.ip_posid;
+			nvis++;
+		}
+	}
+
+	return nvis;
+}
+
 /*
  * HeapTupleSatisfiesVisibility
  *		True iff heap tuple satisfies a time qual.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 171a7dd5d2b..15c7673465b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -244,6 +244,7 @@ Barrier
 BaseBackupCmd
 BaseBackupTargetHandle
 BaseBackupTargetType
+BatchMVCCState
 BeginDirectModify_function
 BeginForeignInsert_function
 BeginForeignModify_function
-- 
2.46.0.519.g2e7b89e038

v1-0007-bufmgr-Make-it-easier-to-change-number-of-buffer-.patchtext/x-diff; charset=us-asciiDownload
From 7a3db51e47865fb2d3691e02f39cda790c9f6d10 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 18:01:14 -0400
Subject: [PATCH v1 07/14] bufmgr: Make it easier to change number of buffer
 state bits

In an upcoming commit I'd like to change the number of bits for the usage
count (the current max is 5, fitting in three bits, but we resere four
bits). Until now that required adjusting a bunch of magic constants, now the
constants are defined based on the number of bits reserved.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index eda6c699212..d8444651736 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -39,12 +39,19 @@
  *
  * The definition of buffer state components is below.
  */
+#define BUF_REFCOUNT_BITS 18
+#define BUF_USAGECOUNT_BITS 4
+#define BUF_FLAG_BITS 10
+
+StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
+				 "buffer bit counts need to equal 32");
+
 #define BUF_REFCOUNT_ONE 1
-#define BUF_REFCOUNT_MASK ((1U << 18) - 1)
-#define BUF_USAGECOUNT_MASK 0x003C0000U
-#define BUF_USAGECOUNT_ONE (1U << 18)
-#define BUF_USAGECOUNT_SHIFT 18
-#define BUF_FLAG_MASK 0xFFC00000U
+#define BUF_REFCOUNT_MASK ((1U << BUF_REFCOUNT_BITS) - 1)
+#define BUF_USAGECOUNT_MASK (((1U << BUF_USAGECOUNT_BITS) - 1) << (BUF_REFCOUNT_BITS))
+#define BUF_USAGECOUNT_ONE (1U << BUF_REFCOUNT_BITS)
+#define BUF_USAGECOUNT_SHIFT BUF_REFCOUNT_BITS
+#define BUF_FLAG_MASK (((1U << BUF_FLAG_BITS) - 1) << (BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS))
 
 /* Get refcount and usagecount from buffer state */
 #define BUF_STATE_GET_REFCOUNT(state) ((state) & BUF_REFCOUNT_MASK)
@@ -77,6 +84,9 @@
  */
 #define BM_MAX_USAGE_COUNT	5
 
+StaticAssertDecl(BM_MAX_USAGE_COUNT < ((1 << BUF_USAGECOUNT_BITS) - 1),
+				 "BM_MAX_USAGE_COUNT doesn't fit in BUF_USAGECOUNT_BITS bits");
+
 /*
  * Buffer tag identifies which disk block the buffer contains.
  *
-- 
2.46.0.519.g2e7b89e038

v1-0008-bufmgr-Add-interface-to-acquire-right-to-set-hint.patchtext/x-diff; charset=us-asciiDownload
From ab2a7ea025a32053f6f201d14ee764a235c480ce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 13:59:13 -0400
Subject: [PATCH v1 08/14] bufmgr: Add interface to acquire right to set hint
 bits

At the moment hint bits can be set with just a share lock on a page (and in
one place even without any lock). Because of this we need to copy pages while
writing them out, as otherwise the checksum could be corrupted.

The need to copy the page is problematic for the AIO patchset:

1) Instead of just needing a single buffer for a copied page we need one for
   each page that's potentially undergoing IO
2) To be able to use the "worker" AIO implementation the copied page needs to
   reside in shared memory.

Even without AIO copying the page isn't free...

This commit starts to address that by adding BufferPrepareToSetHintBits(),
which needs to be called before setting hint bits on a
buffer. BufferPrepareToSetHintBits() only allows hint bit writes if there's no
ongoing IO and while hint bits are being set no IO is allowed to be started.

To know that a buffer is undergoing IO a new BufferDesc state flag is used,
BM_SETTING_HINTS. Theoretically it'd be possible to reuse BM_IO_IN_PROGRESS,
but that'd make it harder to debug the system.

The new interface is not yet used, that will happen in subsequent commits, to
make review a bit easier. Therefore we cannot yet rely on not needing a copy
of the buffer during IO.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h           |  22 ++-
 src/include/storage/bufmgr.h                  |   3 +
 src/backend/storage/buffer/README             |   6 +-
 src/backend/storage/buffer/bufmgr.c           | 177 +++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |   1 +
 5 files changed, 198 insertions(+), 11 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index d8444651736..1f8a0482e90 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -31,8 +31,8 @@
  * Buffer state is a single 32-bit variable where following data is combined.
  *
  * - 18 bits refcount
- * - 4 bits usage count
- * - 10 bits of flags
+ * - 3 bits usage count
+ * - 11 bits of flags
  *
  * Combining these values allows to perform some operations without locking
  * the buffer header, by modifying them together with a CAS loop.
@@ -40,8 +40,8 @@
  * The definition of buffer state components is below.
  */
 #define BUF_REFCOUNT_BITS 18
-#define BUF_USAGECOUNT_BITS 4
-#define BUF_FLAG_BITS 10
+#define BUF_USAGECOUNT_BITS 3
+#define BUF_FLAG_BITS 11
 
 StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
 				 "buffer bit counts need to equal 32");
@@ -63,6 +63,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
  * Note: BM_TAG_VALID essentially means that there is a buffer hashtable
  * entry associated with the buffer's tag.
  */
+#define BM_SETTING_HINTS		(1U << 21)	/* hint bits are being set */
 #define BM_LOCKED				(1U << 22)	/* buffer header is locked */
 #define BM_DIRTY				(1U << 23)	/* data needs writing */
 #define BM_VALID				(1U << 24)	/* data is valid */
@@ -396,6 +397,7 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
 /* ResourceOwner callbacks to hold buffer I/Os and pins */
 extern PGDLLIMPORT const ResourceOwnerDesc buffer_io_resowner_desc;
 extern PGDLLIMPORT const ResourceOwnerDesc buffer_pin_resowner_desc;
+extern PGDLLIMPORT const ResourceOwnerDesc buffer_setting_hints_resowner_desc;
 
 /* Convenience wrappers over ResourceOwnerRemember/Forget */
 static inline void
@@ -418,6 +420,18 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 {
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
+static inline void
+ResourceOwnerRememberBufferSettingHints(ResourceOwner owner, Buffer buffer)
+{
+	ResourceOwnerRemember(owner, Int32GetDatum(buffer),
+						  &buffer_setting_hints_resowner_desc);
+}
+static inline void
+ResourceOwnerForgetBufferSettingHints(ResourceOwner owner, Buffer buffer)
+{
+	ResourceOwnerForget(owner, Int32GetDatum(buffer),
+						&buffer_setting_hints_resowner_desc);
+}
 
 /*
  * Internal buffer management routines
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ded46a57889..a37872377bc 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -300,6 +300,9 @@ extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
 extern bool EvictUnpinnedBuffer(Buffer buf);
 
+extern bool BufferPrepareToSetHintBits(Buffer buf);
+extern void BufferFinishSetHintBits(Buffer buf);
+
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 011af7aff3e..756ad542910 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -58,7 +58,8 @@ tuple while they are doing visibility checks.
 4. It is considered OK to update tuple commit status bits (ie, OR the
 values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
 HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
-pin on a buffer.  This is OK because another backend looking at the tuple
+pin on a buffer after calling BufferPrepareToSetHintBits().
+This is OK because another backend looking at the tuple
 at about the same time would OR the same bits into the field, so there
 is little or no risk of conflicting update; what's more, if there did
 manage to be a conflict it would merely mean that one bit-update would
@@ -68,6 +69,9 @@ great harm is done if they get reset to zero by conflicting updates.
 Note, however, that a tuple is frozen by setting both HEAP_XMIN_INVALID
 and HEAP_XMIN_COMMITTED; this is a critical update and accordingly requires
 an exclusive buffer lock (and it must also be WAL-logged).
+Calling BufferPrepareToSetHintBits() ensures that write IO cannot happen at
+the same time as modifying hint bits, which can lead the checksum computed at
+the start of the write to not match anymore.
 
 5. To physically remove a tuple or compact free space on a page, one
 must hold a pin and an exclusive lock, *and* observe while holding the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f56f7f9f441..ee8b898121c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -222,6 +222,8 @@ static void ResOwnerReleaseBufferIO(Datum res);
 static char *ResOwnerPrintBufferIO(Datum res);
 static void ResOwnerReleaseBufferPin(Datum res);
 static char *ResOwnerPrintBufferPin(Datum res);
+static void ResOwnerReleaseBufferSettingHints(Datum res);
+static char *ResOwnerPrintBufferSettingHints(Datum res);
 
 const ResourceOwnerDesc buffer_io_resowner_desc =
 {
@@ -241,6 +243,15 @@ const ResourceOwnerDesc buffer_pin_resowner_desc =
 	.DebugPrint = ResOwnerPrintBufferPin
 };
 
+const ResourceOwnerDesc buffer_setting_hints_resowner_desc =
+{
+	.name = "buffer setting hints",
+	.release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
+	.release_priority = RELEASE_PRIO_BUFFER_IOS,
+	.ReleaseResource = ResOwnerReleaseBufferSettingHints,
+	.DebugPrint = ResOwnerPrintBufferSettingHints
+};
+
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
  * entry. This has to be called before using NewPrivateRefCountEntry() to fill
@@ -1737,7 +1748,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* some sanity checks while we hold the buffer header lock */
 	Assert(BUF_STATE_GET_REFCOUNT(victim_buf_state) == 1);
-	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_IO_IN_PROGRESS)));
+	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY
+								 | BM_IO_IN_PROGRESS | BM_SETTING_HINTS)));
 
 	victim_buf_hdr->tag = newTag;
 
@@ -2092,7 +2104,8 @@ again:
 	buf_state = pg_atomic_read_u32(&buf_hdr->state);
 
 	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
-	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY)));
+	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY
+						  | BM_IO_IN_PROGRESS | BM_SETTING_HINTS)));
 
 	CheckBufferIsPinnedOnce(buf);
 #endif
@@ -5544,7 +5557,8 @@ BufferLockHeldByMe(Buffer buffer, int mode)
  */
 
 /*
- * WaitIO -- Block until the IO_IN_PROGRESS flag on 'buf' is cleared.
+ * WaitIO -- Block until the IO_IN_PROGRESS and SETTING_HINTS flags on 'buf'
+ * are cleared.
  */
 static void
 WaitIO(BufferDesc *buf)
@@ -5555,6 +5569,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		uint32		wait_event;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5564,9 +5579,21 @@ WaitIO(BufferDesc *buf)
 		buf_state = LockBufHdr(buf);
 		UnlockBufHdr(buf, buf_state);
 
-		if (!(buf_state & BM_IO_IN_PROGRESS))
+		if (likely((buf_state & (BM_IO_IN_PROGRESS | BM_SETTING_HINTS)) == 0))
 			break;
-		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
+		else if (buf_state & BM_IO_IN_PROGRESS)
+		{
+			Assert(!(buf_state & BM_SETTING_HINTS));
+			wait_event = WAIT_EVENT_BUFFER_IO;
+		}
+		else
+		{
+			Assert(buf_state & BM_SETTING_HINTS);
+			Assert(!(buf_state & BM_IO_IN_PROGRESS));
+			wait_event = WAIT_EVENT_BUFFER_SETTING_HINTS;
+		}
+
+		ConditionVariableSleep(cv, wait_event);
 	}
 	ConditionVariableCancelSleep();
 }
@@ -5606,7 +5633,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 	{
 		buf_state = LockBufHdr(buf);
 
-		if (!(buf_state & BM_IO_IN_PROGRESS))
+		if (!(buf_state & (BM_IO_IN_PROGRESS | BM_SETTING_HINTS)))
 			break;
 		UnlockBufHdr(buf, buf_state);
 		if (nowait)
@@ -5623,6 +5650,8 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 		return false;
 	}
 
+	Assert(!(buf_state & BM_SETTING_HINTS));
+
 	buf_state |= BM_IO_IN_PROGRESS;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5661,6 +5690,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
+	Assert(!(buf_state & BM_SETTING_HINTS));
 
 	buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
@@ -5697,6 +5727,7 @@ AbortBufferIO(Buffer buffer)
 
 	buf_state = LockBufHdr(buf_hdr);
 	Assert(buf_state & (BM_IO_IN_PROGRESS | BM_TAG_VALID));
+	Assert(!(buf_state & BM_SETTING_HINTS));
 
 	if (!(buf_state & BM_VALID))
 	{
@@ -6122,6 +6153,23 @@ ResOwnerPrintBufferPin(Datum res)
 	return DebugPrintBufferRefcount(DatumGetInt32(res));
 }
 
+
+static void
+ResOwnerReleaseBufferSettingHints(Datum res)
+{
+	Buffer		buffer = DatumGetInt32(res);
+
+	BufferFinishSetHintBits(buffer);
+}
+
+static char *
+ResOwnerPrintBufferSettingHints(Datum res)
+{
+	Buffer		buffer = DatumGetInt32(res);
+
+	return psprintf("lost track of setting hint bits on buffer %d", buffer);
+}
+
 /*
  * Try to evict the current block in a shared buffer.
  *
@@ -6184,3 +6232,120 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Try to acquire the right to set hint bits for buf.
+ *
+ * It's only permitted to set hint bits in a buffer if the buffer is not
+ * undergoing IO. Otherwise the page level checksum could be corrupted. This
+ * could happen both for PG's checksum and on the OS/filesystem
+ * level. E.g. btrfs with direct-io relies on the page to not change while IO
+ * is going on.
+ *
+ * We can't just check if IO going on at the time BufferPrepareToSetHintBits()
+ * is called, we also need to block IO from starting before we're done setting
+ * hints. This is achieved by setting BM_SETTING_HINTS for the buffer and
+ * having StartBufferIO()/WaitIO() wait for that.  We could combine
+ * BM_SETTING_HINTS and BM_IO_IN_PROGRESS into one, as they can never be set
+ * at the same time, but that seems unnecessarily confusing.
+ *
+ * Because we use only a single bit (BM_SETTING_HINTS) to track whether hint
+ * bits are currently being set, we cannot allow multiple backends to set
+ * hints at the same time - it'd be unknown whether BM_SETTING_HINTS would
+ * need to be remain set when a backend finishes setting hint bits.  In almost
+ * all situations the two backends would just set the same hint bits anyway,
+ * so this is unlikely to be a problem.
+ *
+ * If allowed to set hint bits, the caller needs to call
+ * BufferFinishSetHintBits() once done setting hint bits. In case of an error
+ * occuring before BufferFinishSetHintBits() is reached, the cleanup will be
+ * done via resowner.c.
+ *
+ * It can make sense to amortize the cost of BufferPrepareToSetHintBits() +
+ * BufferFinishSetHintBits() over multiple hint bit sets on a page.
+ *
+ * Returns whether caller is allowed to set hint bits.
+ */
+bool
+BufferPrepareToSetHintBits(Buffer buf)
+{
+	BufferDesc *desc;
+	uint32		old_buf_state;
+
+	Assert(BufferIsPinned(buf));
+
+	if (BufferIsLocal(buf))
+		return true;
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+
+	desc = GetBufferDescriptor(buf - 1);
+
+	/*
+	 * We could use LockBufHdr() instead, but a CAS loop turns out to be
+	 * slightly faster and has better concurrency behaviour due to not
+	 * blocking other backends.
+	 */
+	old_buf_state = pg_atomic_read_u32(&desc->state);
+
+	for (;;)
+	{
+		uint32		new_buf_state;
+
+		if (unlikely(old_buf_state & BM_LOCKED))
+			old_buf_state = WaitBufHdrUnlocked(desc);
+
+		if (unlikely((old_buf_state & (BM_SETTING_HINTS | BM_IO_IN_PROGRESS)) != 0))
+			return false;
+
+		new_buf_state = old_buf_state;
+
+		new_buf_state |= BM_SETTING_HINTS;
+
+		if (pg_atomic_compare_exchange_u32(&desc->state, &old_buf_state,
+										   new_buf_state))
+		{
+			ResourceOwnerRememberBufferSettingHints(CurrentResourceOwner, buf);
+			return true;
+		}
+	}
+}
+
+/*
+ * Release the permission to set hint bits on the current page.
+ */
+void
+BufferFinishSetHintBits(Buffer buf)
+{
+	BufferDesc *desc;
+	uint32		buf_state;
+	uint32		new_buf_state;
+
+	if (BufferIsLocal(buf))
+		return;
+
+	desc = GetBufferDescriptor(buf - 1);
+
+	buf_state = LockBufHdr(desc);
+	new_buf_state = buf_state;
+	new_buf_state &= ~BM_SETTING_HINTS;
+	UnlockBufHdr(desc, new_buf_state);
+
+	Assert(buf_state & (BM_TAG_VALID));
+	Assert(buf_state & (BM_VALID));
+	Assert(buf_state & (BM_SETTING_HINTS));
+	Assert(!(buf_state & (BM_IO_IN_PROGRESS)));
+
+	ResourceOwnerForgetBufferSettingHints(CurrentResourceOwner, buf);
+
+	/*
+	 * Wake everyone that might be waiting for a chance to perform IO (i.e.
+	 * WaitIO()).
+	 *
+	 * XXX: Right now this is somewhat expensive, due to
+	 * ConditionVariableBroadcast() acquiring its spinlock unconditionally. I
+	 * don't see a good way to avoid that from the bufmgr.c side, we don't
+	 * know if there is somebody waiting.
+	 */
+	ConditionVariableBroadcast(BufferDescriptorGetIOCV(desc));
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6f..a1222a1cfbb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -111,6 +111,7 @@ BGWORKER_SHUTDOWN	"Waiting for background worker to shut down."
 BGWORKER_STARTUP	"Waiting for background worker to start up."
 BTREE_PAGE	"Waiting for the page number needed to continue a parallel B-tree scan to become available."
 BUFFER_IO	"Waiting for buffer I/O to complete."
+BUFFER_SETTING_HINTS	"Waiting for hint bit setting to complete."
 CHECKPOINT_DELAY_COMPLETE	"Waiting for a backend that blocks a checkpoint from completing."
 CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from starting."
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
-- 
2.46.0.519.g2e7b89e038

v1-0009-heapam-Acquire-right-to-set-hint-bits.patchtext/x-diff; charset=us-asciiDownload
From 1271da941a9b093abb4565ddb2893f23eafbb33b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 14:05:15 -0400
Subject: [PATCH v1 09/14] heapam: Acquire right to set hint bits

This commit starts to use BufferPrepareToSetHintBits(), introduced in a
previous commit, for the reasons explained in said commit.

To amortize the cost of BufferPrepareToSetHintBits() for cases where hint bits
are set at a high frequency, HeapTupleSatisfiesMVCCBatch() uses the new
SetHintBitsExt() which defers BufferFinishSetHintBits() until all hint bits on
a page have been set.

It's likely worth introducing additional batch visibility routines, e.g. for
vacuuming, but I did not find a regression with the state as of this
commit. So that's left for later.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam_visibility.c | 116 +++++++++++++++++---
 src/tools/pgindent/typedefs.list            |   1 +
 2 files changed, 99 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 4baafc01a2c..61fbac6e7f5 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -80,10 +80,35 @@
 
 
 /*
- * SetHintBits()
+ * To be allowed to set hint bits SetHintBits() needs to call
+ * BufferPrepareToSetHintBits(). However, that's not free, and some callsites
+ * call SetHintBits() on numerous tuples in a row. For those it makes sense to
+ * amortize the cost of BufferPrepareToSetHintBits(). Additionally it's
+ * desirable to defer the cost of BufferPrepareToSetHintBits() until a hint
+ * bit needs to actually be set. This enum serves as the necessary state space
+ * passed to SetHintbitsExt().
+ */
+typedef enum SetHintBitsState
+{
+	/* not yet checked if hint bits may be set */
+	SHB_INITIAL,
+	/* failed to get permission to set hint bits, don't check again */
+	SHB_DISABLED,
+	/* allowed to set hint bits */
+	SHB_ENABLED,
+} SetHintBitsState;
+
+/*
+ * SetHintBitsExt()
  *
  * Set commit/abort hint bits on a tuple, if appropriate at this time.
  *
+ * To be allowed to set a hint bit on a tuple, the page must not be undergoing
+ * IO at this time (otherwise we e.g. could corrupt PG's page checksum or even
+ * the filesystem's, as is known to happen with btrfs). The right to set a
+ * hint bit is acquired on a page level with BufferPrepareToSetHintBits().
+ * Only a single backend gets the right to set hint bits at a time.
+ *
  * It is only safe to set a transaction-committed hint bit if we know the
  * transaction's commit record is guaranteed to be flushed to disk before the
  * buffer, or if the table is temporary or unlogged and will be obliterated by
@@ -111,9 +136,45 @@
  * InvalidTransactionId if no check is needed.
  */
 static inline void
-SetHintBits(HeapTupleHeader tuple, Buffer buffer,
-			uint16 infomask, TransactionId xid)
+SetHintBitsExt(HeapTupleHeader tuple, Buffer buffer,
+			   uint16 infomask, TransactionId xid, SetHintBitsState *state)
 {
+	/*
+	 * In batched mode and we previously did not get permission to set hint
+	 * bits. Don't try again, in all likelihood IO is still going on.
+	 */
+	if (state && *state == SHB_DISABLED)
+		return;
+
+	/*
+	 * If not batching or this is the first hint that we'd like to set on the
+	 * page, check if we are allowed to do so.
+	 *
+	 * XXX: Should we defer this until after the TransactionIdIsValid() check
+	 * below?
+	 */
+	if ((!state || *state == SHB_INITIAL))
+	{
+		if (BufferPrepareToSetHintBits(buffer))
+		{
+			if (state)
+				*state = SHB_ENABLED;
+		}
+		else
+		{
+			/*
+			 * If we hold an exclusive lock nobody else should be able to
+			 * prevent us from setting hint bits. This is important as there
+			 * are a few hint bit sets that are important for correctness -
+			 * all those happen with the buffer exclusively locked.
+			 */
+			Assert(!BufferLockHeldByMe(buffer, BUFFER_LOCK_EXCLUSIVE));
+			if (state)
+				*state = SHB_DISABLED;
+			return;
+		}
+	}
+
 	if (TransactionIdIsValid(xid))
 	{
 		/* NB: xid must be known committed here! */
@@ -123,12 +184,27 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 			BufferGetLSNAtomic(buffer) < commitLSN)
 		{
 			/* not flushed and no LSN interlock, so don't set hint */
-			return;
+			goto out;
 		}
 	}
 
 	tuple->t_infomask |= infomask;
 	MarkBufferDirtyHint(buffer, true);
+
+out:
+	if (!state)
+		BufferFinishSetHintBits(buffer);
+}
+
+/*
+ * Simple wrapper around SetHintBitExt(), use when operating on a single
+ * tuple.
+ */
+static inline void
+SetHintBits(HeapTupleHeader tuple, Buffer buffer,
+			uint16 infomask, TransactionId xid)
+{
+	SetHintBitsExt(tuple, buffer, infomask, xid, NULL);
 }
 
 /*
@@ -864,9 +940,9 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * inserting/deleting transaction was still running --- which was more cycles
  * and more contention on ProcArrayLock.
  */
-static bool
+static inline bool
 HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
-					   Buffer buffer)
+					   Buffer buffer, SetHintBitsState *state)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
@@ -912,8 +988,8 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 			if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
 			{
 				/* deleting subtransaction must have aborted */
-				SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
-							InvalidTransactionId);
+				SetHintBitsExt(tuple, buffer, HEAP_XMAX_INVALID,
+							   InvalidTransactionId, state);
 				return true;
 			}
 
@@ -925,13 +1001,13 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
 			return false;
 		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
-			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-						HeapTupleHeaderGetRawXmin(tuple));
+			SetHintBitsExt(tuple, buffer, HEAP_XMIN_COMMITTED,
+						   HeapTupleHeaderGetRawXmin(tuple), state);
 		else
 		{
 			/* it must have aborted or crashed */
-			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-						InvalidTransactionId);
+			SetHintBitsExt(tuple, buffer, HEAP_XMIN_INVALID,
+						   InvalidTransactionId, state);
 			return false;
 		}
 	}
@@ -994,14 +1070,14 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
 		{
 			/* it must have aborted or crashed */
-			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
-						InvalidTransactionId);
+			SetHintBitsExt(tuple, buffer, HEAP_XMAX_INVALID,
+						   InvalidTransactionId, state);
 			return true;
 		}
 
 		/* xmax transaction committed */
-		SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
-					HeapTupleHeaderGetRawXmax(tuple));
+		SetHintBitsExt(tuple, buffer, HEAP_XMAX_COMMITTED,
+					   HeapTupleHeaderGetRawXmax(tuple), state);
 	}
 	else
 	{
@@ -1597,6 +1673,7 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 							OffsetNumber *vistuples_dense)
 {
 	int			nvis = 0;
+	SetHintBitsState state = SHB_INITIAL;
 
 	Assert(IsMVCCSnapshot(snapshot));
 
@@ -1609,7 +1686,7 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 		HeapTuple	tup = &tuples[i];
 #endif
 
-		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer);
+		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
 #ifdef BATCHMVCC_FEWER_ARGS
 		batchmvcc->visible[i] = valid;
 #else
@@ -1622,6 +1699,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 		}
 	}
 
+	if (state == SHB_ENABLED)
+		BufferFinishSetHintBits(buffer);
+
 	return nvis;
 }
 
@@ -1641,7 +1721,7 @@ HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 	switch (snapshot->snapshot_type)
 	{
 		case SNAPSHOT_MVCC:
-			return HeapTupleSatisfiesMVCC(htup, snapshot, buffer);
+			return HeapTupleSatisfiesMVCC(htup, snapshot, buffer, NULL);
 		case SNAPSHOT_SELF:
 			return HeapTupleSatisfiesSelf(htup, snapshot, buffer);
 		case SNAPSHOT_ANY:
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 15c7673465b..04b4f5cb240 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2604,6 +2604,7 @@ SetConstraintStateData
 SetConstraintTriggerData
 SetExprState
 SetFunctionReturnMode
+SetHintBitsState
 SetOp
 SetOpCmd
 SetOpPath
-- 
2.46.0.519.g2e7b89e038

v1-0010-Acquire-right-to-set-hint-bits-in-the-remaining-p.patchtext/x-diff; charset=us-asciiDownload
From 853d9e57765e89db711ac6d9751061a2327a571f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 16:44:11 -0400
Subject: [PATCH v1 10/14] Acquire right to set hint bits in the remaining
 places

Use BufferPrepareToSetHintBits() in the remaining places that modify buffers
without an exclusive lock. The remaining places are indexes with support for
kill_prior_tuples and the freespace code.

After this we do not need to copy buffers to write them out anymore. That
change is done separately however.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/gist/gistget.c         | 12 +++++----
 src/backend/access/hash/hashutil.c        |  6 +++++
 src/backend/access/nbtree/nbtinsert.c     | 31 +++++++++++++++--------
 src/backend/access/nbtree/nbtutils.c      |  7 +++++
 src/backend/storage/freespace/freespace.c | 14 ++++++----
 src/backend/storage/freespace/fsmpage.c   |  8 +++++-
 6 files changed, 57 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index b35b8a97577..c0a39768fdf 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -57,17 +57,16 @@ gistkillitems(IndexScanDesc scan)
 	gistcheckpage(scan->indexRelation, buffer);
 	page = BufferGetPage(buffer);
 
+	if (!BufferPrepareToSetHintBits(buffer))
+		goto unlock;
+
 	/*
 	 * If page LSN differs it means that the page was modified since the last
 	 * read. killedItems could be not valid so LP_DEAD hints applying is not
 	 * safe.
 	 */
 	if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
-	{
-		UnlockReleaseBuffer(buffer);
-		so->numKilled = 0;		/* reset counter */
-		return;
-	}
+		goto unlock;
 
 	Assert(GistPageIsLeaf(page));
 
@@ -89,6 +88,9 @@ gistkillitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(buffer, true);
 	}
 
+	BufferFinishSetHintBits(buffer);
+
+unlock:
 	UnlockReleaseBuffer(buffer);
 
 	/*
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 20028f5cd14..a9adf95bafa 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -572,6 +572,9 @@ _hash_kill_items(IndexScanDesc scan)
 	else
 		buf = _hash_getbuf(rel, blkno, HASH_READ, LH_OVERFLOW_PAGE);
 
+	if (!BufferPrepareToSetHintBits(so->currPos.buf))
+		goto unlock_page;
+
 	page = BufferGetPage(buf);
 	opaque = HashPageGetOpaque(page);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -613,6 +616,9 @@ _hash_kill_items(IndexScanDesc scan)
 		MarkBufferDirtyHint(buf, true);
 	}
 
+	BufferFinishSetHintBits(so->currPos.buf);
+
+unlock_page:
 	if (so->hashso_bucket_buf == so->currPos.buf ||
 		havePin)
 		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 99043da8412..eef498ada7c 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -679,20 +679,31 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				{
 					/*
 					 * The conflicting tuple (or all HOT chains pointed to by
-					 * all posting list TIDs) is dead to everyone, so mark the
-					 * index entry killed.
+					 * all posting list TIDs) is dead to everyone, so try to
+					 * mark the index entry killed. It's ok if we're not
+					 * allowed to, this isn't required for correctness.
 					 */
-					ItemIdMarkDead(curitemid);
-					opaque->btpo_flags |= BTP_HAS_GARBAGE;
+					Buffer		buf;
 
-					/*
-					 * Mark buffer with a dirty hint, since state is not
-					 * crucial. Be sure to mark the proper buffer dirty.
-					 */
+					/* Be sure to operate on the proper buffer */
 					if (nbuf != InvalidBuffer)
-						MarkBufferDirtyHint(nbuf, true);
+						buf = nbuf;
 					else
-						MarkBufferDirtyHint(insertstate->buf, true);
+						buf = insertstate->buf;
+
+					if (BufferPrepareToSetHintBits(buf))
+					{
+						ItemIdMarkDead(curitemid);
+						opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+						/*
+						 * Mark buffer with a dirty hint, since state is not
+						 * crucial.
+						 */
+						MarkBufferDirtyHint(buf, true);
+
+						BufferFinishSetHintBits(buf);
+					}
 				}
 
 				/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 76be65123c8..ea618a5565d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -4210,6 +4210,10 @@ _bt_killitems(IndexScanDesc scan)
 		}
 	}
 
+	/* if we're not able to set hint bits, there's no point continuing */
+	if (!BufferPrepareToSetHintBits(so->currPos.buf))
+		goto unlock_page;
+
 	opaque = BTPageGetOpaque(page);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -4329,6 +4333,9 @@ _bt_killitems(IndexScanDesc scan)
 		MarkBufferDirtyHint(so->currPos.buf, true);
 	}
 
+	BufferFinishSetHintBits(so->currPos.buf);
+
+unlock_page:
 	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
 }
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index b1262ac3b69..864c7e231ee 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -904,12 +904,16 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
 	max_avail = fsm_get_max_avail(page);
 
 	/*
-	 * Reset the next slot pointer. This encourages the use of low-numbered
-	 * pages, increasing the chances that a later vacuum can truncate the
-	 * relation.  We don't bother with a lock here, nor with marking the page
-	 * dirty if it wasn't already, since this is just a hint.
+	 * Try to reset the next slot pointer. This encourages the use of
+	 * low-numbered pages, increasing the chances that a later vacuum can
+	 * truncate the relation.  We don't bother with a lock here, nor with
+	 * marking the page dirty if it wasn't already, since this is just a hint.
 	 */
-	((FSMPage) PageGetContents(page))->fp_next_slot = 0;
+	if (BufferPrepareToSetHintBits(buf))
+	{
+		((FSMPage) PageGetContents(page))->fp_next_slot = 0;
+		BufferFinishSetHintBits(buf);
+	}
 
 	ReleaseBuffer(buf);
 
diff --git a/src/backend/storage/freespace/fsmpage.c b/src/backend/storage/freespace/fsmpage.c
index ba85cad0697..84bfa410dc0 100644
--- a/src/backend/storage/freespace/fsmpage.c
+++ b/src/backend/storage/freespace/fsmpage.c
@@ -300,7 +300,13 @@ restart:
 	 *
 	 * Wrap-around is handled at the beginning of this function.
 	 */
-	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);
+	if (exclusive_lock_held || BufferPrepareToSetHintBits(buf))
+	{
+		fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);
+
+		if (!exclusive_lock_held)
+			BufferFinishSetHintBits(buf);
+	}
 
 	return slot;
 }
-- 
2.46.0.519.g2e7b89e038

v1-0011-bufmgr-Detect-some-missing-BufferPrepareToSetHint.patchtext/x-diff; charset=us-asciiDownload
From 1199f480b9d4f4be9095307bbd41dee3d92d338e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 18:11:23 -0400
Subject: [PATCH v1 11/14] bufmgr: Detect some missing
 BufferPrepareToSetHintBits() calls

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ee8b898121c..5c7aa6662c5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5018,6 +5018,23 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 	/* here, either share or exclusive lock is OK */
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
+	/*
+	 * A share-locked buffer may only be marked dirty for hints after having
+	 * gotten the permission to do so with BufferPrepareToSetHintBits(). It'd
+	 * perhaps be cheap enough to test this even outside of assert enabled
+	 * builds, but LWLockHeldByMeInMode() says "debug support only".
+	 */
+#ifdef USE_ASSERT_CHECKING
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		if (!LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE))
+		{
+			Assert(buf_state & BM_SETTING_HINTS);
+		}
+	}
+#endif
+
 	/*
 	 * This routine might get called many times on the same page, if we are
 	 * making the first scan after commit of an xact that added/deleted many
-- 
2.46.0.519.g2e7b89e038

v1-0012-bufmgr-Don-t-copy-pages-while-writing-out.patchtext/x-diff; charset=us-asciiDownload
From c8f3dd69d09a1e6bb2f393e091aff02edc47a5f2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 14:14:35 -0400
Subject: [PATCH v1 12/14] bufmgr: Don't copy pages while writing out

After the series of preceding commits introducing and using
BufferPrepareToSetHintBits() hint bits cannot be set while IO is going
on. Therefore we do not need to copy pages while they are being written out
anymore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufpage.h         |  3 +-
 src/backend/access/hash/hashpage.c    |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 23 +++++--------
 src/backend/storage/buffer/localbuf.c |  2 +-
 src/backend/storage/page/bufpage.c    | 48 ++++-----------------------
 src/backend/storage/smgr/bulk_write.c |  2 +-
 6 files changed, 18 insertions(+), 62 deletions(-)

diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6222d46e535..e8cadc73aad 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,7 +507,6 @@ extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
 extern void PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum);
 extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
-extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
-extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetChecksum(Page page, BlockNumber blkno);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index d09c349e28f..2becf48bf06 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1029,7 +1029,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 					zerobuf.data,
 					true);
 
-	PageSetChecksumInplace(page, lastblock);
+	PageSetChecksum(page, lastblock);
 	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data,
 			   false);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5c7aa6662c5..a123e4e7f40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1996,9 +1996,8 @@ again:
 	/*
 	 * If the buffer was dirty, try to write it out.  There is a race
 	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
+	 * buffer header lock above.  We will recheck the dirty bit after
+	 * re-locking the buffer header.
 	 */
 	if (buf_state & BM_DIRTY)
 	{
@@ -3786,9 +3785,8 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * The caller must hold a pin on the buffer and have share-locked the
  * buffer contents.  (Note: a share-lock does not prevent updates of
- * hint bits in the buffer, so the page could change while the write
- * is in progress, but we assume that that will not invalidate the data
- * written.)
+ * hint bits in the buffer, but hint bits are not allowed to be set after
+ * StartBufferIO() succeeded.)
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
@@ -3801,7 +3799,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	ErrorContextCallback errcallback;
 	instr_time	io_start;
 	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
 
 	/*
@@ -3867,12 +3864,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufBlock = BufHdrGetBlock(buf);
 
-	/*
-	 * Update page checksum if desired.  Since we have only shared lock on the
-	 * buffer, other processes might be updating hint bits in it, so we must
-	 * copy the page to private storage if we do checksumming.
-	 */
-	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
+	/* Update page checksum if desired. */
+	PageSetChecksum((Page) bufBlock, buf->tag.blockNum);
 
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
@@ -3882,7 +3875,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	smgrwrite(reln,
 			  BufTagGetForkNum(&buf->tag),
 			  buf->tag.blockNum,
-			  bufToWrite,
+			  bufBlock,
 			  false);
 
 	/*
@@ -4531,7 +4524,7 @@ FlushRelationBuffers(Relation rel)
 				errcallback.previous = error_context_stack;
 				error_context_stack = &errcallback;
 
-				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+				PageSetChecksum(localpage, bufHdr->tag.blockNum);
 
 				io_start = pgstat_prepare_io_time(track_io_timing);
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 8da7dd6c98a..e75a1960d10 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -242,7 +242,7 @@ GetLocalVictimBuffer(void)
 		/* Find smgr relation for buffer */
 		oreln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag), MyProcNumber);
 
-		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+		PageSetChecksum(localpage, bufHdr->tag.blockNum);
 
 		io_start = pgstat_prepare_io_time(track_io_timing);
 
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index be6f1f62d29..2930fd936cd 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1495,51 +1495,15 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 /*
  * Set checksum for a page in shared buffers.
  *
- * If checksums are disabled, or if the page is not initialized, just return
- * the input.  Otherwise, we must make a copy of the page before calculating
- * the checksum, to prevent concurrent modifications (e.g. setting hint bits)
- * from making the final checksum invalid.  It doesn't matter if we include or
- * exclude hints during the copy, as long as we write a valid page and
- * associated checksum.
+ * If checksums are disabled, or if the page is not initialized, just
+ * return. Otherwise compute and set the checksum.
  *
- * Returns a pointer to the block-sized data that needs to be written. Uses
- * statically-allocated memory, so the caller must immediately write the
- * returned page and not refer to it again.
- */
-char *
-PageSetChecksumCopy(Page page, BlockNumber blkno)
-{
-	static char *pageCopy = NULL;
-
-	/* If we don't need a checksum, just return the passed-in data */
-	if (PageIsNew(page) || !DataChecksumsEnabled())
-		return (char *) page;
-
-	/*
-	 * We allocate the copy space once and use it over on each subsequent
-	 * call.  The point of palloc'ing here, rather than having a static char
-	 * array, is first to ensure adequate alignment for the checksumming code
-	 * and second to avoid wasting space in processes that never call this.
-	 */
-	if (pageCopy == NULL)
-		pageCopy = MemoryContextAllocAligned(TopMemoryContext,
-											 BLCKSZ,
-											 PG_IO_ALIGN_SIZE,
-											 0);
-
-	memcpy(pageCopy, (char *) page, BLCKSZ);
-	((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
-	return pageCopy;
-}
-
-/*
- * Set checksum for a page in private memory.
- *
- * This must only be used when we know that no other process can be modifying
- * the page buffer.
+ * In the past this needed to be done on a copy of the page, due to the
+ * possibility of e.g. hint bits being set concurrently. However, this is not
+ * necessary anymore as hint bits won't be set while IO is going on.
  */
 void
-PageSetChecksumInplace(Page page, BlockNumber blkno)
+PageSetChecksum(Page page, BlockNumber blkno)
 {
 	/* If we don't need a checksum, just return */
 	if (PageIsNew(page) || !DataChecksumsEnabled())
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
index f0a65bfe242..2c294ddab1f 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -277,7 +277,7 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 		BlockNumber blkno = pending_writes[i].blkno;
 		Page		page = pending_writes[i].buf->data;
 
-		PageSetChecksumInplace(page, blkno);
+		PageSetChecksum(page, blkno);
 
 		if (blkno >= bulkstate->pages_written)
 		{
-- 
2.46.0.519.g2e7b89e038

v1-0013-WIP-bufmgr-Detect-some-bad-buffer-accesses.patchtext/x-diff; charset=us-asciiDownload
From d4e63de7705ec094affa0f877c905713a7dc6c32 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 18:03:53 -0400
Subject: [PATCH v1 13/14] WIP: bufmgr: Detect some bad buffer accesses

I wrote this mainly to ensure that I did not miss converting any hint bit sets
to BufferPrepareToSetHintBits(), but it seems like it might be more generally
useful.

If we do want to include it, it needs a bit more polishing.

On my workstation, the performance effect of this test infrastructure is as
follows:

base:
real	1m4.613s
user	4m31.409s
sys	3m20.445s

ENFORCE_BUFFER_PROT

real	1m11.912s
user	4m27.332s
sys	3m28.063s

ENFORCE_BUFFER_PROT + ENFORCE_BUFFER_PROT_READ
real	1m33.455s
user	4m32.188s
sys	3m41.275s

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/pg_config_manual.h        |  24 ++++
 src/include/storage/buf_internals.h   |   9 ++
 src/include/storage/pg_shmem.h        |   2 +
 src/backend/storage/buffer/buf_init.c |  10 ++
 src/backend/storage/buffer/bufmgr.c   | 154 ++++++++++++++++++++++++--
 src/backend/utils/misc/guc_tables.c   |   2 +-
 6 files changed, 191 insertions(+), 10 deletions(-)

diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e49eb13e43c..8ea96bd1f1b 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -361,3 +361,27 @@
  * Enable tracing of syncscan operations (see also the trace_syncscan GUC var).
  */
 /* #define TRACE_SYNCSCAN */
+
+
+/*
+ * On some operating systems we know how to change whether pages are
+ * readable/writeable. We can use that to verify that we are following buffer
+ * locking rules, we can make pages inaccessible or read-only when we don't
+ * have sufficient locks etc. This obviously is fairly expensive, so by
+ * default we only so in assert enabled builds.
+ */
+#if defined(USE_ASSERT_CHECKING) && !defined(WIN32)
+#define ENFORCE_BUFFER_PROT
+#endif
+
+/*
+ * Protecting pages against being modified without an exclusive lock /
+ * BufferPrepareToSetHintBits() is reasonably cheap, neither happens *that*
+ * often. Pinning/unpinning buffers is a lot more common, making it more
+ * expensive to call mprotect() that often.
+ *
+ * Therefore disable this by default, even in assert enabled builds.
+ */
+#ifdef ENFORCE_BUFFER_PROT
+/* #define ENFORCE_BUFFER_PROT_READ */
+#endif
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1f8a0482e90..f542c245e27 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -457,6 +457,15 @@ extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
 extern bool have_free_buffer(void);
 
+#ifdef ENFORCE_BUFFER_PROT
+extern void SetBufferProtection(Buffer buf, bool allow_reads, bool allow_writes);
+#else
+static inline void
+SetBufferProtection(Buffer buf, bool allow_reads, bool allow_writes)
+{
+}
+#endif
+
 /* buf_table.c */
 extern Size BufTableShmemSize(int size);
 extern void InitBufTable(int size);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 3065ff5be71..42a12688b9a 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -63,6 +63,8 @@ typedef enum
 	SHMEM_TYPE_MMAP,
 }			PGShmemType;
 
+extern PGDLLIMPORT int huge_pages_status;
+
 #ifndef WIN32
 extern PGDLLIMPORT unsigned long UsedShmemSegID;
 #else
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 56761a8eedc..0e02d8e1e9b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -135,6 +135,16 @@ BufferManagerShmemInit(void)
 							 LWTRANCHE_BUFFER_CONTENT);
 
 			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+
+			/*
+			 * Unused buffers are inaccessible. But if we're not enforcing
+			 * making buffers inaccessible without a pin, we won't mark
+			 * buffers as accessible during pinning, therefore we better don't
+			 * make them initially inaccessible.
+			 */
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+			SetBufferProtection(i + 1, false, false);
+#endif
 		}
 
 		/* Correct last entry of linked list */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a123e4e7f40..50947b1021f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,10 @@
 #include <sys/file.h>
 #include <unistd.h>
 
+#ifdef ENFORCE_BUFFER_PROT
+#include <sys/mman.h>			/* for mprotect() */
+#endif							/* ENFORCE_BUFFER_PROT */
+
 #include "access/tableam.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -53,6 +57,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
@@ -1065,8 +1070,6 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 	if (need_to_zero)
 	{
-		memset(BufferGetPage(buffer), 0, BLCKSZ);
-
 		/*
 		 * Grab the buffer content lock before marking the page as valid, to
 		 * make sure that no other backend sees the zeroed page before the
@@ -1079,7 +1082,12 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		 * already valid.)
 		 */
 		if (!isLocalBuf)
+		{
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+			SetBufferProtection(buffer, true, true);
+		}
+
+		memset(BufferGetPage(buffer), 0, BLCKSZ);
 
 		if (isLocalBuf)
 		{
@@ -1505,6 +1513,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_first_block = blocknum + i;
 		io_buffers_len = 1;
 
+		if (persistence != RELPERSISTENCE_TEMP)
+			SetBufferProtection(io_buffers[0], true, true);
+
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
@@ -1520,7 +1531,13 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				   BufferGetBlockNumber(buffers[i]) + 1);
 
 			io_buffers[io_buffers_len] = buffers[++i];
-			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+			io_pages[io_buffers_len] = BufferGetBlock(buffers[i]);
+
+			/* smgrreadv() needs to modify the buffer */
+			if (persistence != RELPERSISTENCE_TEMP)
+				SetBufferProtection(io_buffers[io_buffers_len], true, true);
+
+			io_buffers_len++;
 		}
 
 		io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1580,6 +1597,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				TerminateBufferIO(bufHdr, false, BM_VALID, true);
 			}
 
+			if (persistence != RELPERSISTENCE_TEMP)
+				SetBufferProtection(io_buffers[j], true, false);
+
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
 											  operation->smgr->smgr_rlocator.locator.spcOid,
@@ -2233,7 +2253,9 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
 
 		/* new buffers are zero-filled */
+		SetBufferProtection(buffers[i], true, true);
 		MemSet((char *) buf_block, 0, BLCKSZ);
+		SetBufferProtection(buffers[i], true, false);
 	}
 
 	/*
@@ -2460,7 +2482,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		}
 
 		if (lock)
-			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
 	}
@@ -2719,6 +2741,10 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 				 * non-accessible in any case.
 				 */
 				VALGRIND_MAKE_MEM_DEFINED(BufHdrGetBlock(buf), BLCKSZ);
+
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+				SetBufferProtection(BufferDescriptorGetBuffer(buf), true, false);
+#endif
 				break;
 			}
 		}
@@ -2791,6 +2817,10 @@ PinBuffer_Locked(BufferDesc *buf)
 	 */
 	VALGRIND_MAKE_MEM_DEFINED(BufHdrGetBlock(buf), BLCKSZ);
 
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+	SetBufferProtection(BufferDescriptorGetBuffer(buf), true, false);
+#endif
+
 	/*
 	 * Since we hold the buffer spinlock, we can update the buffer state and
 	 * release the lock in one operation.
@@ -2850,6 +2880,10 @@ UnpinBufferNoOwner(BufferDesc *buf)
 		 */
 		VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);
 
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+		SetBufferProtection(BufferDescriptorGetBuffer(buf), false, false);
+#endif
+
 		/* I'd better not still hold the buffer content lock */
 		Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(buf)));
 
@@ -3670,6 +3704,69 @@ CheckForBufferLeaks(void)
 #endif
 }
 
+/*
+ * To verify that we are following buffer locking rules, we can make pages
+ * inaccessible or read-only when we don't have sufficient locks etc.
+ *
+ * XXX: It might be possible to fold the VALGRIND_MAKE_MEM_NOACCESS() /
+ * VALGRIND_MAKE_MEM_DEFINED() calls into this.
+ */
+#ifdef ENFORCE_BUFFER_PROT
+void
+SetBufferProtection(Buffer buf, bool allow_reads, bool allow_writes)
+{
+	static long	pagesz = 0;
+	int			prot = PROT_NONE;
+	int			rc;
+
+	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN &&
+		   huge_pages_status != HUGE_PAGES_TRY);
+	StaticAssertStmt(PROT_NONE == 0, "that can't be right");
+
+	if (unlikely(pagesz == 0))
+	{
+		pagesz = sysconf(_SC_PAGESIZE);
+
+		elog(DEBUG1, "sysconf(_SC_PAGESIZE) = %ld", pagesz);
+		if (pagesz == -1)
+		{
+			elog(ERROR, "sysconf(_SC_PAGESIZE) failed: %m");
+		}
+		else if (pagesz > BLCKSZ)
+		{
+			elog(DEBUG1, "pagesz > BLCKSZ, disabling buffer protection mode");
+			pagesz = -1;
+		}
+		else if(BLCKSZ % pagesz != 0)
+		{
+			elog(DEBUG1, "BLCKSZ %% pagesz != 0, disabling buffer protection mode");
+			pagesz = -1;
+		}
+		else if (huge_pages_status == HUGE_PAGES_ON)
+		{
+			/* can't set status in a granular enough way */
+			elog(DEBUG1, "huge pages enabled, disabling buffer protection mode");
+			pagesz = -1;
+		}
+	}
+
+	/* disabled */
+	if (pagesz == -1)
+		return;
+
+	if (allow_reads)
+		prot |= PROT_READ;
+
+	if (allow_writes)
+		prot |= PROT_WRITE;
+
+	rc = mprotect(BufferGetBlock(buf), BLCKSZ, prot);
+
+	if (rc != 0)
+		elog(ERROR, "mprotect(%d, %d) failed: %m", buf, prot);
+}
+#endif							/* ENFORCE_BUFFER_PROT */
+
 /*
  * Helper routine to issue warnings when a buffer is unexpectedly pinned
  */
@@ -3865,7 +3962,10 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	bufBlock = BufHdrGetBlock(buf);
 
 	/* Update page checksum if desired. */
+	SetBufferProtection(BufferDescriptorGetBuffer(buf), true, true);
 	PageSetChecksum((Page) bufBlock, buf->tag.blockNum);
+	/* FIXME: could theoretically be exclusively locked */
+	SetBufferProtection(BufferDescriptorGetBuffer(buf), true, false);
 
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
@@ -5181,19 +5281,38 @@ void
 LockBuffer(Buffer buffer, int mode)
 {
 	BufferDesc *buf;
+	LWLock	   *content_lock;
 
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return;					/* local buffers need no lock */
 
 	buf = GetBufferDescriptor(buffer - 1);
+	content_lock = BufferDescriptorGetContentLock(buf);
 
 	if (mode == BUFFER_LOCK_UNLOCK)
-		LWLockRelease(BufferDescriptorGetContentLock(buf));
+	{
+#ifdef ENFORCE_BUFFER_PROT
+		bool		was_exclusive;
+
+		was_exclusive = LWLockHeldByMeInMode(content_lock, LW_EXCLUSIVE);
+#endif							/* ENFORCE_BUFFER_PROT */
+
+		LWLockRelease(content_lock);
+
+#ifdef ENFORCE_BUFFER_PROT
+		if (was_exclusive)
+			SetBufferProtection(buffer, true, false);
+#endif							/* ENFORCE_BUFFER_PROT */
+	}
 	else if (mode == BUFFER_LOCK_SHARE)
-		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
+		LWLockAcquire(content_lock, LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
-		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	{
+		LWLockAcquire(content_lock, LW_EXCLUSIVE);
+
+		SetBufferProtection(buffer, true, true);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5207,6 +5326,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
+	bool		ret;
 
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
@@ -5214,8 +5334,13 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = GetBufferDescriptor(buffer - 1);
 
-	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-									LW_EXCLUSIVE);
+	ret = LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+								   LW_EXCLUSIVE);
+
+	if (ret)
+		SetBufferProtection(buffer, true, true);
+
+	return ret;
 }
 
 /*
@@ -6316,6 +6441,12 @@ BufferPrepareToSetHintBits(Buffer buf)
 										   new_buf_state))
 		{
 			ResourceOwnerRememberBufferSettingHints(CurrentResourceOwner, buf);
+
+#ifdef ENFORCE_BUFFER_PROT
+			if (!LWLockHeldByMeInMode(BufferDescriptorGetContentLock(desc), LW_EXCLUSIVE))
+				SetBufferProtection(buf, true, true);
+#endif							/* ENFORCE_BUFFER_PROT */
+
 			return true;
 		}
 	}
@@ -6358,4 +6489,9 @@ BufferFinishSetHintBits(Buffer buf)
 	 * know if there is somebody waiting.
 	 */
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(desc));
+
+#ifdef ENFORCE_BUFFER_PROT
+	if (!LWLockHeldByMeInMode(BufferDescriptorGetContentLock(desc), LW_EXCLUSIVE))
+		SetBufferProtection(buf, true, false);
+#endif							/* ENFORCE_BUFFER_PROT */
 }
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8a67f01200c..4c94c20f0cb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -561,7 +561,7 @@ static int	ssl_renegotiation_limit;
  */
 int			huge_pages = HUGE_PAGES_TRY;
 int			huge_page_size;
-static int	huge_pages_status = HUGE_PAGES_UNKNOWN;
+int			huge_pages_status = HUGE_PAGES_UNKNOWN;
 
 /*
  * These variables are all dummies that don't do anything, except in some
-- 
2.46.0.519.g2e7b89e038

#10Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#9)
Re: AIO writes vs hint bits vs checksums

On 30/10/2024 04:21, Andres Freund wrote:

Attached is a, unfortunately long, series of patches implementing what I
described upthread.

Review of the preparatory patches:

0001 Add very basic test for kill_prior_tuples

We currently don't exercise this patch for gist and hash, which seems
somewhat criminal.

Interesting to use the isolationtester for this. There's just one
session, so you're just using it to define reusable steps with handy
names. I'm fine with that, but please add a comment to explain it. I
wonder if it'd be more straightforward to make it a regular pg_regress
test though. There would be some repetition, but would it be so bad?

You forgot to add the new test to 'isolation_schedule'.

typos:
"inex" -> "index"
"does something approximately reasonble" -> "do something
approximately reasonable"

0002 heapam: Move logic to handle HEAP_MOVED into a helper function

A prep patch, which consolidates HEAP_MOVED handling into a helper. This
isn't strictly necessary, but I got bothered by it while writing this patch
series.

+1

+/*
+ * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
+ * adjust hint bits. See the comment for SetHintBits() for more background.
+ *
+ * This helper returns false if the row ought to be invisible, true otherwise.
+ */
+static inline bool
+HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
+{
+	TransactionId xvac;
+
+	/* only used by pre-9.0 binary upgrades */
+	if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
+		return true;
+
+ 	...
+}

This is so unlikely these days that this function probably should not be
inlined anymore.

0003 bufmgr: Add BufferLockHeldByMe()

Useful for writing assertions in later patches.

+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		lwmode = LW_EXCLUSIVE;
+	else if (mode == BUFFER_LOCK_SHARE)
+		lwmode = LW_SHARED;
+	else
+	{
+		Assert(false);
+		pg_unreachable();
+		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+	}

I just learned a new word :-).

This is fine, but to avoid the assuaging, you could also do this:

if (mode == BUFFER_LOCK_EXCLUSIVE)
lwmode = LW_EXCLUSIVE;
else
{
Assert(mode == BUFFER_LOCK_SHARE);
lwmode = LW_SHARED;
}

0004 heapam: Use exclusive lock on old page in CLUSTER

When I started looking into this I wanted to make sure that it's actually
safe to skip setting hint bits in all places. One place that's *not* safe
is cluster - we assume that all hint bits are set in rewrite_heap_tuple().

This might not strictly be required, but it seemed very fragile as
is. While using an exclusive lock does seem like it makes it a bit safer,
it's not a very good fix, as we still dirty the old buffer, which seems
like a shame when one uses VACUUM FULL.

+		/*
+		 * To be able to guarantee that we can set the hint bit, acquire an
+		 * exclusive lock on the old buffer. We need the hint bits to be set
+		 * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
+		 * heap_freeze_tuple() will get confused.
+		 *

Hmm, I don't see any comments in the reform_and_rewrite_tuple() ->
rewrite_heap_tuple() -> heap_freeze_tuple() path about requiring hint
bits to be set. How does it get confused?

0005 heapam: Only set tuple's block once per page in pagemode

Minor, but measurable, optimization.

+1, this makes sense independently of all the other patches.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1748eafa100..d0cb4b1e29b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -981,6 +981,9 @@ heapgettup_pagemode(HeapScanDesc scan,
linesleft = scan->rs_ntuples;
lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
+		/* set block once per page, instead of doing so for every tuple */
+		BlockIdSet(&tuple->t_self.ip_blkid, scan->rs_cblock);
+
/* lineindex now references the next or previous visible tid */
continue_page:

@@ -995,7 +998,7 @@ continue_page:

tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
+			tuple->t_self.ip_posid = lineoff;

/* skip any tuples that don't match the scan key */
if (key != NULL &&

Should probably use ItemPointerSetBlockNumber and
ItemPointerSetOffsetNumber.

0006 heapam: Add batch mode mvcc check and use it in page mode

This is a good bit faster on its own, but is required to avoid a
performance regression later, when setting hint bits only only when no IO
going on at the same.

The BATCHMVCC_FEWER_ARGS stuff is pretty weird. Hard to imagine that
it'd make any difference how you pass those arguments.

+		if (likely(valid))
+		{
+			vistuples_dense[nvis] = tup->t_self.ip_posid;
+			nvis++;
+		}

Is the likely(valid) hint important for performance here? It would be a
bad hint for a table with lots of dead tuples. I feel we'd better rely
on the branch predictor for this one.

0007 bufmgr: Make it easier to change number of buffer state bits

This just makes it easier to change the division of bits in
BufferDesc->state.

Looks good to me.

--
Heikki Linnakangas
Neon (https://neon.tech)

#11Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#1)
Re: AIO writes vs hint bits vs checksums

On 24/09/2024 18:55, Andres Freund wrote:

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

It'd be nice to implement this without having to set and clear
BM_SETTING_HINTS every time. Could we put the overhead on the
FlushBuffer() instead?

How about something like this:

To set hint bits:

1. Lock page in SHARED mode.
2. Read BM_IO_IN_PROGRESS
3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
4. Unlock page

To flush a buffer:

1. Lock page in SHARED mode
2. Set BM_IO_IN_PROGRESS
3. Read the lock count on the buffer lock, to see if we're the only locker.
4. If anyone else is holding the lock, upgrade it to exclusive mode, and
immediately downgrade back to share mode.
5. calculate CRC, flush the buffer
6. Clear BM_IO_IN_PROGRESS and unlock page.

This goes back to the idea of adding LWLock support for this, but the
amount of changes could be pretty small. The missing operation we don't
have today is reading the share-lock count on the lock in step 3. That
seems simple to add.

Acquiring the exclusive lock in step 4 is just a way to wait for all the
existing share-lockers to release the lock. You wouldn't need to block
new share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we
need, but it currently ignores share-lockers. So doing this "properly"
would require more changes to LWLocks. Briefly acquiring an exclusive
lock seems acceptable though.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples.

Yes, batching the visibility checks makes sense in any case.

--
Heikki Linnakangas
Neon (https://neon.tech)

#12Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#11)
Re: AIO writes vs hint bits vs checksums

Hi,

Thanks for the quick review!

On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:

On 24/09/2024 18:55, Andres Freund wrote:

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

It'd be nice to implement this without having to set and clear
BM_SETTING_HINTS every time.

True - but it's worth noting that we only need to set BM_SETTING_HINTS if
there are any hint bits that need to be set. So in the most common paths we
won't need to deal with it at all.

Could we put the overhead on the FlushBuffer()
instead?

How about something like this:

To set hint bits:

1. Lock page in SHARED mode.
2. Read BM_IO_IN_PROGRESS
3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
4. Unlock page

To flush a buffer:

1. Lock page in SHARED mode
2. Set BM_IO_IN_PROGRESS
3. Read the lock count on the buffer lock, to see if we're the only locker.
4. If anyone else is holding the lock, upgrade it to exclusive mode, and
immediately downgrade back to share mode.
5. calculate CRC, flush the buffer
6. Clear BM_IO_IN_PROGRESS and unlock page.

This goes back to the idea of adding LWLock support for this, but the amount
of changes could be pretty small. The missing operation we don't have today
is reading the share-lock count on the lock in step 3. That seems simple to
add.

I've played around with a bunch of ideas like this. There are two main
reasons I didn't like them that much in the end:

1) The worst case latency impacts seemed to make them not that
interesting. A buffer that is heavily contended with share locks might not
get down to zero share lockers for quite a while. That's not a problem for
individual FlushBuffer() calls, but it could very well add up to a decent
sized delay for something like a checkpoint that has to flush a lot of
buffers.

Waiting for all pre-existing share lockers is easier said than done. We
don't record the acquisition order anywhere and a share-lock release won't
wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
exactly be free and the cost would be born by all lwlock users.

2) They are based on holding an lwlock. But it's actually quite conceivable
that we'd want to set something hint-bit-like without any lock, as we
e.g. currently do for freespace/. That would work with something like the
approach I chose here, but not if we rely on lwlocks.

E.g. with a bit of work we could actually do sequential scans without
blocking concurrent buffer modifications by carefully ordering when
pd_lower is modified and making sure that tuple header fields are written
in a sensible order.

Acquiring the exclusive lock in step 4 is just a way to wait for all the
existing share-lockers to release the lock. You wouldn't need to block new
share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but
it currently ignores share-lockers. So doing this "properly" would require
more changes to LWLocks. Briefly acquiring an exclusive lock seems
acceptable though.

I don't think LW_WAIT_UNTIL_FREE is that easily extended to cover something
like this, because share lockers can acquire/release the lock without waking
anybody. And changing that would be rather expensive for everyone...

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples.

Yes, batching the visibility checks makes sense in any case.

Cool. The wins by doing that are noticeable...

Greetings,

Andres Freund

#13Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#10)
Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-10-30 13:29:01 +0200, Heikki Linnakangas wrote:

On 30/10/2024 04:21, Andres Freund wrote:

Attached is a, unfortunately long, series of patches implementing what I
described upthread.

Review of the preparatory patches:

0001 Add very basic test for kill_prior_tuples

We currently don't exercise this patch for gist and hash, which seems
somewhat criminal.

Interesting to use the isolationtester for this. There's just one session,
so you're just using it to define reusable steps with handy names.

Yea. I had started out writing it as a pg_regress style test and it quickly got very
hard to understand.

I'm fine with that, but please add a comment to explain it.

Makes sense.

I wonder if it'd be more straightforward to make it a regular pg_regress
test though. There would be some repetition, but would it be so bad?

I found it to be quite bad. If testing just one AM it's ok-ish, but once you
test 2-3 it gets very long and repetitive. I guess we could use functions or
such to make it a bit less painful - but that point, is it actually simpler?

You forgot to add the new test to 'isolation_schedule'.

Oops.

typos:
"inex" -> "index"
"does something approximately reasonble" -> "do something approximately
reasonable"

Oops^2.

+/*
+ * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
+ * adjust hint bits. See the comment for SetHintBits() for more background.
+ *
+ * This helper returns false if the row ought to be invisible, true otherwise.
+ */
+static inline bool
+HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
+{
+	TransactionId xvac;
+
+	/* only used by pre-9.0 binary upgrades */
+	if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
+		return true;
+
+ 	...
+}

This is so unlikely these days that this function probably should not be
inlined anymore.

Because I put the check for moved in the function it is important that it's
inlined. The compiler can understand that the rest of the function is unlikely
to be executed and put it together with other less likely to be executed code.

0003 bufmgr: Add BufferLockHeldByMe()

Useful for writing assertions in later patches.

+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		lwmode = LW_EXCLUSIVE;
+	else if (mode == BUFFER_LOCK_SHARE)
+		lwmode = LW_SHARED;
+	else
+	{
+		Assert(false);
+		pg_unreachable();
+		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+	}

I just learned a new word :-).

Heh.

This is fine, but to avoid the assuaging, you could also do this:

if (mode == BUFFER_LOCK_EXCLUSIVE)
lwmode = LW_EXCLUSIVE;
else
{
Assert(mode == BUFFER_LOCK_SHARE);
lwmode = LW_SHARED;
}

I have no opinion about which version is better...

0004 heapam: Use exclusive lock on old page in CLUSTER

When I started looking into this I wanted to make sure that it's actually
safe to skip setting hint bits in all places. One place that's *not* safe
is cluster - we assume that all hint bits are set in rewrite_heap_tuple().

This might not strictly be required, but it seemed very fragile as
is. While using an exclusive lock does seem like it makes it a bit safer,
it's not a very good fix, as we still dirty the old buffer, which seems
like a shame when one uses VACUUM FULL.

+		/*
+		 * To be able to guarantee that we can set the hint bit, acquire an
+		 * exclusive lock on the old buffer. We need the hint bits to be set
+		 * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
+		 * heap_freeze_tuple() will get confused.
+		 *

Hmm, I don't see any comments in the reform_and_rewrite_tuple() ->
rewrite_heap_tuple() -> heap_freeze_tuple() path about requiring hint bits
to be set.

There's this comment is in heap_prepare_freeze_tuple():

* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
* (else we should be removing the tuple, not freezing it).

I don't remember the precise details of what all goes wrong, I think there was
more than one avenue.

One example is stuff like this:
/*
* If the tuple has been updated, check the old-to-new mapping hash table.
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
!HeapTupleHeaderIndicatesMovedPartitions(old_tuple->t_data) &&
!(ItemPointerEquals(&(old_tuple->t_self),
&(old_tuple->t_data->t_ctid))))

if the original tuple had an xmax but we didn't unset it in
HeapTupleSatisfiesVacuum(), due to not being allowed to write a hint bit,
we'll behave differently here. But at the same time
heapam_relation_copy_for_cluster() will treat the tuple as dead and call
rewrite_heap_dead_tuple().

How does it get confused?

Corrupted visibility information after a CLUSTER.

0006 heapam: Add batch mode mvcc check and use it in page mode

This is a good bit faster on its own, but is required to avoid a
performance regression later, when setting hint bits only only when no IO
going on at the same.

The BATCHMVCC_FEWER_ARGS stuff is pretty weird. Hard to imagine that it'd
make any difference how you pass those arguments.

I unfortunately found it to make a difference :(, even after trying to account
for binary layout difference by randomizing section order via linker flags.

+		if (likely(valid))
+		{
+			vistuples_dense[nvis] = tup->t_self.ip_posid;
+			nvis++;
+		}

Is the likely(valid) hint important for performance here? It would be a bad
hint for a table with lots of dead tuples. I feel we'd better rely on the
branch predictor for this one.

It did :(

Greetings,

Andres Freund

#14Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#1)
Re: AIO writes vs hint bits vs checksums

On Tue, 2024-09-24 at 11:55 -0400, Andres Freund wrote:

What I suspect we might want instead is something inbetween a share
and an
exclusive lock, which is taken while setting a hint bit and which
conflicts
with having an IO in progress.

I am starting to wonder if a shared content locks are really the right
concept at all. It makes sense for simple mutexes, but we are already
more complex than that, and you are suggesting adding on to that
complexity. Which I agree is a good idea, I'm just wondering if we
could go even further.

The README states that a shared lock is necessary for visibility
checking, but can we just be more careful with the ordering and
atomicity of visibility changes in the page?

* carefully order reads and writes of xmin/xmax/hints (would
that create too many read barriers in the tqual.c code?)
* write line pointer after tuple is written

We would still have pins and cleanup locks to prevent data removal.
We'd have the logic you suggest that would prevent modification during
IO. And there would still need to be an exclusive content locks so that
two inserts don't try to allocate the same line pointer, or lock the
same tuple.

If PD_ALL_VISIBLE is set it's even simpler.

Am I missing some major hazards?

Regards,
Jeff Davis

#15Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#14)
Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-10-30 10:47:35 -0700, Jeff Davis wrote:

On Tue, 2024-09-24 at 11:55 -0400, Andres Freund wrote:

What I suspect we might want instead is something inbetween a share
and an
exclusive lock, which is taken while setting a hint bit and which
conflicts
with having an IO in progress.

I am starting to wonder if a shared content locks are really the right
concept at all. It makes sense for simple mutexes, but we are already
more complex than that, and you are suggesting adding on to that
complexity.

What I am proposing isn't making the content lock more complicated, it's
orthogonal to the content lock.

Which I agree is a good idea, I'm just wondering if we could go even
further.

The README states that a shared lock is necessary for visibility
checking, but can we just be more careful with the ordering and
atomicity of visibility changes in the page?

* carefully order reads and writes of xmin/xmax/hints (would
that create too many read barriers in the tqual.c code?)
* write line pointer after tuple is written

It's possible, but it'd be a lot of work. And you wouldn't need to just do
this for heap, but all the indexes too, to make progress on the
don't-set-hint-bits-during-io front. So I don't think it makes sense to tie
these things together.

I do think that it's an argument for not importing all the complexity into
lwlock.c though.

We would still have pins and cleanup locks to prevent data removal.

As-is cleanup locks only work in coordination with content locks. While
cleanup is ongoing we need to prevent anybody from starting to look at the
page - without acquiring something like a shared lock that's not easy.

We'd have the logic you suggest that would prevent modification during
IO. And there would still need to be an exclusive content locks so that
two inserts don't try to allocate the same line pointer, or lock the
same tuple.

If PD_ALL_VISIBLE is set it's even simpler.

Am I missing some major hazards?

I don't think anything fundamental, but it's a decidedly nontrivial amount of
work.

Greetings,

Andres Freund

#16Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#12)
Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-10-30 09:58:30 -0400, Andres Freund wrote:

On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:

Could we put the overhead on the FlushBuffer()
instead?

How about something like this:

To set hint bits:

1. Lock page in SHARED mode.
2. Read BM_IO_IN_PROGRESS
3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
4. Unlock page

To flush a buffer:

1. Lock page in SHARED mode
2. Set BM_IO_IN_PROGRESS
3. Read the lock count on the buffer lock, to see if we're the only locker.
4. If anyone else is holding the lock, upgrade it to exclusive mode, and
immediately downgrade back to share mode.
5. calculate CRC, flush the buffer
6. Clear BM_IO_IN_PROGRESS and unlock page.

This goes back to the idea of adding LWLock support for this, but the amount
of changes could be pretty small. The missing operation we don't have today
is reading the share-lock count on the lock in step 3. That seems simple to
add.

I've played around with a bunch of ideas like this. There are two main
reasons I didn't like them that much in the end:

1) The worst case latency impacts seemed to make them not that
interesting. A buffer that is heavily contended with share locks might not
get down to zero share lockers for quite a while. That's not a problem for
individual FlushBuffer() calls, but it could very well add up to a decent
sized delay for something like a checkpoint that has to flush a lot of
buffers.

Waiting for all pre-existing share lockers is easier said than done. We
don't record the acquisition order anywhere and a share-lock release won't
wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
exactly be free and the cost would be born by all lwlock users.

2) They are based on holding an lwlock. But it's actually quite conceivable
that we'd want to set something hint-bit-like without any lock, as we
e.g. currently do for freespace/. That would work with something like the
approach I chose here, but not if we rely on lwlocks.

E.g. with a bit of work we could actually do sequential scans without
blocking concurrent buffer modifications by carefully ordering when
pd_lower is modified and making sure that tuple header fields are written
in a sensible order.

I still don't like this idea a whole lot - but perhaps we could get reduce the
overhead of my proposal some, to get closer to yours. When setting hint bits
for many tuples on a page the overhead of my approach is neglegible, but when doing it
for individual tuples it's a bit less neglegible.

We can reduce the efficiency difference substantially by adding a bufmgr.c API
that set hints on a page. That function can set the hint bit while holding the
buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and
thus also doesn't need to do resowner.c accounting.

To see the worst case overhead, I
a) disabling the "batch" optimization
b) disabled checksums, as that otherwise would hide small efficiency
differences
c) used an unlogged table

and measured the performance difference for a previously-unhinted sequential
scan of a narrow table that immediately discards all tuples due to OFFSET -
afaict the worst case for the proposed new behaviour.

Previously this was 30.8% slower than master. Now it's only 1.9% slower.

With the batch optimization enabled, the patchset is 7.5% faster.

I also looked at the performance impact on scans that cannot use the batched
approach. The worst case I could think of was a large ordered indexscan of a
previously unhinted table.

For an IOS, the performance difference is a slowdown of 0.65%.

But the difference being so small is partially just due to IOS being
considerably slower than a plain index scan when all tuples need to be fetched
from the table (we need to address that...). Forcing a non-IOS IOS scan using
enable_indexonlyscan, I get a slowdown of 5.0%.

Given that this is an intentionally pathological workload - there's no IO /
the workload is CPU bound, yet the data is only ever accessed once, with a
query that doesn't ever actually look at the data - I'm quite happy with that.

As soon as I make it a query that actually uses the data, the difference
vanishes in the noise.

Greetings,

Andres Freund

#17Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#16)
14 attachment(s)
Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-11-01 14:10:54 -0400, Andres Freund wrote:

I still don't like this idea a whole lot - but perhaps we could get reduce the
overhead of my proposal some, to get closer to yours. When setting hint bits
for many tuples on a page the overhead of my approach is neglegible, but when doing it
for individual tuples it's a bit less neglegible.

We can reduce the efficiency difference substantially by adding a bufmgr.c API
that set hints on a page. That function can set the hint bit while holding the
buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and
thus also doesn't need to do resowner.c accounting.

To see the worst case overhead, I
a) disabling the "batch" optimization
b) disabled checksums, as that otherwise would hide small efficiency
differences
c) used an unlogged table

and measured the performance difference for a previously-unhinted sequential
scan of a narrow table that immediately discards all tuples due to OFFSET -
afaict the worst case for the proposed new behaviour.

Previously this was 30.8% slower than master. Now it's only 1.9% slower.

With the batch optimization enabled, the patchset is 7.5% faster.

I also looked at the performance impact on scans that cannot use the batched
approach. The worst case I could think of was a large ordered indexscan of a
previously unhinted table.

For an IOS, the performance difference is a slowdown of 0.65%.

But the difference being so small is partially just due to IOS being
considerably slower than a plain index scan when all tuples need to be fetched
from the table (we need to address that...). Forcing a non-IOS IOS scan using
enable_indexonlyscan, I get a slowdown of 5.0%.

The attached patchset implements this approach and a few additional
optimizations:

- The hint bit functions now take care of marking the buffer dirty, this saves
one trip through bufmgr.c and eventually could be used to remove one atomic
operation. It also looks cleaner imo.

- I had not removed the page copying from XLogSaveBufferForHint(), changing
that does make the patchset faster even in the index-scan scan, if a hint
bit log needs to be emitted.

XLogSaveBufferForHint() should probably use XLogRegisterBuffer() instead of
XLogRegisterBlock(), but we'd need to adjust assertions...

- I added a patch to optimize LockBufHdr() by avoiding init_local_spin_delay()
in the common path. With this applied on *both sides*, the regression
vanishes (but see more below).

I've not fully polished the later patches, I would like to get some agreement
on the approach before doing that.

I spent an unhealthily large amouunt of time trying to benchmark this. Largely
because there's some very odd bimodal performance distribution that I can't
figure out. I benchmarked a whole-table indexscan on an unhinted relation. IFF
testing a WAL logged relation, there are two different "centers" on a graph of
latencies of individual benchmark times.

This happens even on master, on a quiesced system, with the database on tmpfs,
turbo mode disabled etc. On two different systems. Unfortunately there are
fairly long stretches of time where I see one duration, which makes it hard to
just tackle this by running the benchmark for long.

I settled for just comparing both the slow and the fast times separately.

Greetings,

Andres Freund

Attachments:

v2-0001-Add-very-basic-test-for-kill_prior_tuples.patchtext/x-diff; charset=us-asciiDownload
From ea6c984bc10e19e4ae3c36281c37802943fd132a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 15:48:50 -0400
Subject: [PATCH v2 01/15] Add very basic test for kill_prior_tuples

Previously our tests did not exercise kill_prior_tuples for hash and gist. The
latter only because it does not work if all the dead tuples are on one
page (or something like that).

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 .../isolation/expected/index-killtuples.out   | 337 ++++++++++++++++++
 .../isolation/specs/index-killtuples.spec     | 127 +++++++
 2 files changed, 464 insertions(+)
 create mode 100644 src/test/isolation/expected/index-killtuples.out
 create mode 100644 src/test/isolation/specs/index-killtuples.spec

diff --git a/src/test/isolation/expected/index-killtuples.out b/src/test/isolation/expected/index-killtuples.out
new file mode 100644
index 00000000000..af46a3c8ead
--- /dev/null
+++ b/src/test/isolation/expected/index-killtuples.out
@@ -0,0 +1,337 @@
+Parsed test spec with 1 sessions
+
+starting permutation: create_table fill_500 create_btree flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_btree: CREATE INDEX kill_prior_tuple_btree ON kill_prior_tuple USING btree (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                0
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+
+starting permutation: create_table fill_500 create_ext_btree_gist create_gist flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gist
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_ext_btree_gist: CREATE EXTENSION btree_gist;
+step create_gist: CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                0
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gist: DROP EXTENSION btree_gist;
+
+starting permutation: create_table fill_10 create_ext_btree_gist create_gist flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gist
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_10: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 10) g(i);
+step create_ext_btree_gist: CREATE EXTENSION btree_gist;
+step create_gist: CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gist: DROP EXTENSION btree_gist;
+
+starting permutation: create_table fill_500 create_hash flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_hash: CREATE INDEX kill_prior_tuple_hash ON kill_prior_tuple USING hash (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                0
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+
+starting permutation: create_table fill_500 create_ext_btree_gin create_gin flush disable_seq delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gin
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_ext_btree_gin: CREATE EXTENSION btree_gin;
+step create_gin: CREATE INDEX kill_prior_tuple_gin ON kill_prior_tuple USING gin (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                             
+-----------------------------------------------------------------------
+Bitmap Heap Scan on kill_prior_tuple (actual rows=0 loops=1)           
+  Recheck Cond: (key = 1)                                              
+  Heap Blocks: exact=1                                                 
+  ->  Bitmap Index Scan on kill_prior_tuple_gin (actual rows=1 loops=1)
+        Index Cond: (key = 1)                                          
+(5 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                             
+-----------------------------------------------------------------------
+Bitmap Heap Scan on kill_prior_tuple (actual rows=0 loops=1)           
+  Recheck Cond: (key = 1)                                              
+  Heap Blocks: exact=1                                                 
+  ->  Bitmap Index Scan on kill_prior_tuple_gin (actual rows=1 loops=1)
+        Index Cond: (key = 1)                                          
+(5 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gin: DROP EXTENSION btree_gin;
diff --git a/src/test/isolation/specs/index-killtuples.spec b/src/test/isolation/specs/index-killtuples.spec
new file mode 100644
index 00000000000..3896c935290
--- /dev/null
+++ b/src/test/isolation/specs/index-killtuples.spec
@@ -0,0 +1,127 @@
+# Basic testing of killtuples / kill_prior_tuples / all_dead testing
+# for various index AMs
+#
+# This tests just enough to ensure that the kill* routines are actually
+# executed and does something approximately reasonble. It's *not* sufficient
+# testing for adding killitems support to a new AM!
+#
+# This doesn't really need to be an isolation test, it could be written as a
+# regular regression test. However, writing it as an isolation test ends up a
+# *lot* less verbose.
+
+setup
+{
+    CREATE TABLE counter(heap_accesses int);
+    INSERT INTO counter(heap_accesses) VALUES (0);
+}
+
+teardown
+{
+    DROP TABLE counter;
+}
+
+session s1
+# to ensure GUCs are reset
+setup { RESET ALL; }
+
+step disable_seq { SET enable_seqscan = false; }
+
+step disable_bitmap { SET enable_bitmapscan = false; }
+
+# use a temporary table to make sure no other session can interfere with
+# visibility determinations
+step create_table { CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null); }
+
+step fill_10 { INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 10) g(i); }
+
+step fill_500 { INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i); }
+
+# column-less select to make output easier to read
+step flush { SELECT FROM pg_stat_force_next_flush(); }
+
+step measure { UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple'); }
+
+step result { SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple'; }
+
+step access { EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1; }
+
+step delete { DELETE FROM kill_prior_tuple; }
+
+step drop_table { DROP TABLE IF EXISTS kill_prior_tuple; }
+
+### steps for testing btree indexes ###
+step create_btree { CREATE INDEX kill_prior_tuple_btree ON kill_prior_tuple USING btree (key); }
+
+### steps for testing gist indexes ###
+# Creating the extensions takes time, so we don't want to do so when testing
+# other AMs
+step create_ext_btree_gist { CREATE EXTENSION btree_gist; }
+step drop_ext_btree_gist { DROP EXTENSION btree_gist; }
+step create_gist { CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key); }
+
+### steps for testing gin indexes ###
+# See create_ext_btree_gist
+step create_ext_btree_gin { CREATE EXTENSION btree_gin; }
+step drop_ext_btree_gin { DROP EXTENSION btree_gin; }
+step create_gin { CREATE INDEX kill_prior_tuple_gin ON kill_prior_tuple USING gin (key); }
+
+### steps for testing hash indexes ###
+step create_hash { CREATE INDEX kill_prior_tuple_hash ON kill_prior_tuple USING hash (key); }
+
+
+# test killtuples with btree inex
+permutation
+  create_table fill_500 create_btree flush
+  disable_seq disable_bitmap
+  # show each access to non-deleted tuple increments heap_blks_*
+  measure access flush result
+  measure access flush result
+  delete flush
+  # first access after accessing deleted tuple still needs to access heap
+  measure access flush result
+  # but after kill_prior_tuple did its thing, we shouldn't access heap anymore
+  measure access flush result
+  drop_table
+
+# Same as first permutation, except testing gist
+permutation
+  create_table fill_500 create_ext_btree_gist create_gist flush
+  disable_seq disable_bitmap
+  measure access flush result
+  measure access flush result
+  delete flush
+  measure access flush result
+  measure access flush result
+  drop_table drop_ext_btree_gist
+
+# Test gist, but with fewer rows - shows that killitems doesn't work anymore!
+permutation
+  create_table fill_10 create_ext_btree_gist create_gist flush
+  disable_seq disable_bitmap
+  measure access flush result
+  measure access flush result
+  delete flush
+  measure access flush result
+  measure access flush result
+  drop_table drop_ext_btree_gist
+
+# Same as first permutation, except testing hash
+permutation
+  create_table fill_500 create_hash flush
+  disable_seq disable_bitmap
+  measure access flush result
+  measure access flush result
+  delete flush
+  measure access flush result
+  measure access flush result
+  drop_table
+
+# # Similar to first permutation, except that gin does not have killtuples support
+permutation
+  create_table fill_500 create_ext_btree_gin create_gin flush
+  disable_seq
+  delete flush
+  measure access flush result
+  # will still fetch from heap
+  measure access flush result
+  drop_table drop_ext_btree_gin
-- 
2.45.2.746.g06e570c0df.dirty

v2-0002-heapam-Move-logic-to-handle-HEAP_MOVED-into-a-hel.patchtext/x-diff; charset=us-asciiDownload
From 99feca5829e028ae43c11910abd280283ce96b2a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 23 Sep 2024 12:23:33 -0400
Subject: [PATCH v2 02/15] heapam: Move logic to handle HEAP_MOVED into a
 helper function

Before we dealt with this in 6 near identical and one very similar copy.

The helper function errors out when encountering a
HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or
in-progress. It'd be preferrable to do that change separately, but otherwise
it'd not be possible to deduplicate the handling in
HeapTupleSatisfiesVacuum().

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam_visibility.c | 307 ++++----------------
 1 file changed, 61 insertions(+), 246 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01f..057b88767db 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -144,6 +144,55 @@ HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 	SetHintBits(tuple, buffer, infomask, xid);
 }
 
+/*
+ * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
+ * adjust hint bits. See the comment for SetHintBits() for more background.
+ *
+ * This helper returns false if the row ought to be invisible, true otherwise.
+ */
+static inline bool
+HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
+{
+	TransactionId xvac;
+
+	/* only used by pre-9.0 binary upgrades */
+	if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
+		return true;
+
+	xvac = HeapTupleHeaderGetXvac(tuple);
+
+	if (TransactionIdIsCurrentTransactionId(xvac))
+		elog(ERROR, "encountered tuple with HEAP_MOVED considered current");
+
+	if (TransactionIdIsInProgress(xvac))
+		elog(ERROR, "encountered tuple with HEAP_MOVED considered in-progress");
+
+	if (tuple->t_infomask & HEAP_MOVED_OFF)
+	{
+		if (TransactionIdDidCommit(xvac))
+		{
+			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
+						InvalidTransactionId);
+			return false;
+		}
+		SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
+					InvalidTransactionId);
+	}
+	else if (tuple->t_infomask & HEAP_MOVED_IN)
+	{
+		if (TransactionIdDidCommit(xvac))
+			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
+						InvalidTransactionId);
+		else
+		{
+			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
+						InvalidTransactionId);
+			return false;
+		}
+	}
+
+	return true;
+}
 
 /*
  * HeapTupleSatisfiesSelf
@@ -179,45 +228,8 @@ HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
@@ -372,45 +384,8 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 
 		/*
 		 * An invalid Xmin can be left behind by a speculative insertion that
@@ -468,45 +443,8 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return TM_Invisible;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return TM_Invisible;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return TM_Invisible;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return TM_Invisible;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return TM_Invisible;
-				}
-			}
-		}
+		else if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (HeapTupleHeaderGetCmin(tuple) >= curcid)
@@ -756,45 +694,8 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
@@ -970,45 +871,8 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!XidInMVCCSnapshot(xvac, snapshot))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (XidInMVCCSnapshot(xvac, snapshot))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
@@ -1213,57 +1077,8 @@ HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *de
 	{
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return HEAPTUPLE_DEAD;
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return HEAPTUPLE_DELETE_IN_PROGRESS;
-			if (TransactionIdIsInProgress(xvac))
-				return HEAPTUPLE_DELETE_IN_PROGRESS;
-			if (TransactionIdDidCommit(xvac))
-			{
-				SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-							InvalidTransactionId);
-				return HEAPTUPLE_DEAD;
-			}
-			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-						InvalidTransactionId);
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			if (TransactionIdIsInProgress(xvac))
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			if (TransactionIdDidCommit(xvac))
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			else
-			{
-				SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-							InvalidTransactionId);
-				return HEAPTUPLE_DEAD;
-			}
-		}
-		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
-		{
-			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			/* only locked? run infomask-only check first, for performance */
-			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) ||
-				HeapTupleHeaderIsOnlyLocked(tuple))
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			/* inserted and then deleted by same xact */
-			if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple)))
-				return HEAPTUPLE_DELETE_IN_PROGRESS;
-			/* deleting subtransaction must have aborted */
-			return HEAPTUPLE_INSERT_IN_PROGRESS;
-		}
+		else if (!HeapTupleCleanMoved(tuple, buffer))
+			return HEAPTUPLE_DEAD;
 		else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			/*
-- 
2.45.2.746.g06e570c0df.dirty

v2-0003-bufmgr-Add-BufferLockHeldByMe.patchtext/x-diff; charset=us-asciiDownload
From 368a687b4526b47186ed739f84eb6c93e75dbc87 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 23 Sep 2024 12:51:37 -0400
Subject: [PATCH v2 03/15] bufmgr: Add BufferLockHeldByMe()

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufmgr.h        |  1 +
 src/backend/storage/buffer/bufmgr.c | 41 +++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..ded46a57889 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -291,6 +291,7 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern bool BufferLockHeldByMe(Buffer buffer, int mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0f02bf62fa3..88d18e85d64 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5492,6 +5492,47 @@ IsBufferCleanupOK(Buffer buffer)
 	return false;
 }
 
+/*
+ * BufferLockHeldByMe - does the backend have the buffer locked?
+ *
+ * This likely should only be used for asserts etc.
+ *
+ * Note that this can only be called for non-temp buffers - there is no
+ * correct value to return for temporary buffers. One might think that just
+ * returning true for temp buffers would work, but the caller might assert
+ * that a lock is *not* held.
+ */
+bool
+BufferLockHeldByMe(Buffer buffer, int mode)
+{
+	BufferDesc *buf;
+	LWLockMode	lwmode;
+
+	/*
+	 * Can't hold a lock without a pin, there never should be uncertainty
+	 * about having a pin.
+	 */
+	Assert(BufferIsPinned(buffer));
+
+	/* there'd be no correct value to return */
+	Assert(!BufferIsLocal(buffer));
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		lwmode = LW_EXCLUSIVE;
+	else if (mode == BUFFER_LOCK_SHARE)
+		lwmode = LW_SHARED;
+	else
+	{
+		Assert(false);
+		pg_unreachable();
+		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+	}
+
+	return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(buf), lwmode);
+}
+
 
 /*
  *	Functions for buffer I/O handling
-- 
2.45.2.746.g06e570c0df.dirty

v2-0004-heapam-Use-exclusive-lock-on-old-page-in-CLUSTER.patchtext/x-diff; charset=us-asciiDownload
From 74fb75c2c20a4881bf3f8341d085ff9f964a0c40 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 11:45:36 -0400
Subject: [PATCH v2 04/15] heapam: Use exclusive lock on old page in CLUSTER

To be able to guarantee that we can set the hint bit, acquire an exclusive
lock on the old buffer. We need the hint bits to be set as otherwise
reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple() will
get confused.

It'd be better if we somehow coulda void setting hint bits on the old
page. One reason to use VACUUM FULL are very bloated tables - rewriting most
of the old table before during VACUUM FULL doesn't.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam_handler.c    | 13 ++++++++++++-
 src/backend/access/heap/heapam_visibility.c |  7 +++++++
 src/backend/storage/buffer/bufmgr.c         |  2 +-
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..009601445aa 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -833,7 +833,18 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		tuple = ExecFetchSlotHeapTuple(slot, false, NULL);
 		buf = hslot->buffer;
 
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		/*
+		 * To be able to guarantee that we can set the hint bit, acquire an
+		 * exclusive lock on the old buffer. We need the hint bits to be set
+		 * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
+		 * heap_freeze_tuple() will get confused.
+		 *
+		 * It'd be better if we somehow could avoid setting hint bits on the
+		 * old page. One reason to use VACUUM FULL are very bloated tables -
+		 * rewriting most of the old table before during VACUUM FULL doesn't
+		 * exactly help...
+		 */
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
 		{
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 057b88767db..b7aa8bb7a52 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -141,6 +141,13 @@ void
 HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 					 uint16 infomask, TransactionId xid)
 {
+	/*
+	 * The uses from heapam.c rely on being able to perform the hint bit
+	 * updates, which can only be guaranteed if we are holding an exclusive
+	 * lock on the buffer - which all callers are doing.
+	 */
+	Assert(BufferIsLocal(buffer) || BufferLockHeldByMe(buffer, BUFFER_LOCK_EXCLUSIVE));
+
 	SetHintBits(tuple, buffer, infomask, xid);
 }
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 88d18e85d64..f56f7f9f441 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5527,7 +5527,7 @@ BufferLockHeldByMe(Buffer buffer, int mode)
 	{
 		Assert(false);
 		pg_unreachable();
-		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+		lwmode = LW_EXCLUSIVE;	/* assuage compiler */
 	}
 
 	return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(buf), lwmode);
-- 
2.45.2.746.g06e570c0df.dirty

v2-0005-heapam-Only-set-tuple-s-block-once-per-page-in-pa.patchtext/x-diff; charset=us-asciiDownload
From acc54c0839baede43d821a8e43e67d6b0ff1e4ed Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 11:50:13 -0400
Subject: [PATCH v2 05/15] heapam: Only set tuple's block once per page in
 pagemode

Due to splitting the block id into two 16 bit integers, BlockIdSet() is more
expensive than one might think. Doing it once per returned shows up as small
but relibly reproducible cost.  It's simple enough to just set the block
number so once per block in pagemode, so do so.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d00300c5dcb..f9086369728 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -981,6 +981,9 @@ heapgettup_pagemode(HeapScanDesc scan,
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
 
+		/* set block once per page, instead of doing so for every tuple */
+		BlockIdSet(&tuple->t_self.ip_blkid, scan->rs_cblock);
+
 		/* lineindex now references the next or previous visible tid */
 continue_page:
 
@@ -995,7 +998,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
+			ItemPointerSetOffsetNumber(&tuple->t_self, lineoff);
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
-- 
2.45.2.746.g06e570c0df.dirty

v2-0006-bufmgr-Separate-slow-fast-path-of-LockBufHdr.patchtext/x-diff; charset=us-asciiDownload
From 6c6178d9c19b51711cf6f3ccd7e18809fb955cce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 19 Nov 2024 12:08:38 -0500
Subject: [PATCH v2 06/15] bufmgr: Separate slow/fast path of LockBufHdr

This mainly is beneficial for some upcoming patches that use LockBufHdr() in
reasonably common paths. But I do see a very small improvement when scanning a
large table fully resident in the kernel page cache.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h | 33 ++++++++++++++++++++++++++++-
 src/backend/storage/buffer/bufmgr.c |  8 ++++---
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index eda6c699212..e2ac85c8d97 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -355,7 +355,38 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
  * Functions for acquiring/releasing a shared buffer header's spinlock.  Do
  * not apply these to local buffers!
  */
-extern uint32 LockBufHdr(BufferDesc *desc);
+extern uint32 LockBufHdrSlow(BufferDesc *desc);
+
+/*
+ * Lock buffer header - set BM_LOCKED in buffer state.
+ *
+ * Implemented as a static inline in the header due to freelist.c using it in
+ * a reasonably common path.
+ */
+static inline uint32
+LockBufHdr(BufferDesc *desc)
+{
+	uint32		old_buf_state;
+
+	Assert(!BufferIsLocal(BufferDescriptorGetBuffer(desc)));
+
+	/* fast path, intended to be inlined into calling functions*/
+
+	/* set BM_LOCKED flag */
+	old_buf_state = pg_atomic_fetch_or_u32(&desc->state, BM_LOCKED);
+	if (likely(!(old_buf_state & BM_LOCKED)))
+		return old_buf_state | BM_LOCKED;
+
+	/*
+	 * The overhead of the spin delay code and the retry loop itself would be
+	 * noticeable due to the buffer header lock being taken very
+	 * frequently. Therefore it is moved into a separate function marked
+	 * pg_noinline.
+	 */
+
+	/* slow path with loop etc, not intended to be inlined */
+	return LockBufHdrSlow(desc);
+}
 
 static inline void
 UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f56f7f9f441..6ad686d9820 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5796,10 +5796,12 @@ rlocator_comparator(const void *p1, const void *p2)
 }
 
 /*
- * Lock buffer header - set BM_LOCKED in buffer state.
+ * Slow path helper for LockBufHdr().
+ *
+ * See comment in LockBufHdr() for the reason this is moved out of line.
  */
-uint32
-LockBufHdr(BufferDesc *desc)
+pg_noinline uint32
+LockBufHdrSlow(BufferDesc *desc)
 {
 	SpinDelayStatus delayStatus;
 	uint32		old_buf_state;
-- 
2.45.2.746.g06e570c0df.dirty

v2-0007-heapam-Add-batch-mode-mvcc-check-and-use-it-in-pa.patchtext/x-diff; charset=us-asciiDownload
From 55f7dfa4b6ac10ef9f73264e781164df39e33c40 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 13:16:36 -0400
Subject: [PATCH v2 07/15] heapam: Add batch mode mvcc check and use it in page
 mode

There are two reasons for doing so:

1) It is generally faster to perform checks in a batched fashion and making
   sequential scans faster is nice.

2) We would like to stop setting hint bits while pages are being written
   out. The necessary locking becomes visible for page mode scans if done for
   every tuple. With batching the overhead can be amortized to only happen
   once per page.

There are substantial further optimization opportunities along these
lines:

- Right now HeapTupleSatisfiesMVCCBatch() simply uses the single-tuple
  HeapTupleSatisfiesMVCC(), relying on the compiler to inline it. We could
  instead write an explicitly optimized version that avoids repeated xid
  tests.

- Introduce batched version of the serializability test

- Introduce batched version of HeapTupleSatisfiesVacuum

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/access/heapam.h                 |  28 ++++++
 src/backend/access/heap/heapam.c            | 104 ++++++++++++++++----
 src/backend/access/heap/heapam_visibility.c |  50 ++++++++++
 src/tools/pgindent/typedefs.list            |   1 +
 4 files changed, 163 insertions(+), 20 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 96cf82f97b7..d4a790251cc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -417,6 +417,34 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
+/*
+ * FIXME: define to be removed
+ *
+ * Without this I see worse performance. But it's a bit ugly, so I thought
+ * it'd be useful to leave a way in for others to experiment with this.
+ */
+#define BATCHMVCC_FEWER_ARGS
+
+#ifdef BATCHMVCC_FEWER_ARGS
+typedef struct BatchMVCCState
+{
+	HeapTupleData tuples[MaxHeapTuplesPerPage];
+	bool		visible[MaxHeapTuplesPerPage];
+} BatchMVCCState;
+#endif
+
+extern int	HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
+										int ntups,
+#ifdef BATCHMVCC_FEWER_ARGS
+										BatchMVCCState *batchmvcc,
+#else
+										HeapTupleData *tuples,
+										bool *visible,
+#endif
+										OffsetNumber *vistuples_dense);
+
+
+
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f9086369728..1781e2c0571 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -437,42 +437,106 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 					BlockNumber block, int lines,
 					bool all_visible, bool check_serializable)
 {
+	Oid			relid = RelationGetRelid(scan->rs_base.rs_rd);
+#ifdef BATCHMVCC_FEWER_ARGS
+	BatchMVCCState batchmvcc;
+#else
+	HeapTupleData tuples[MaxHeapTuplesPerPage];
+	bool		visible[MaxHeapTuplesPerPage];
+#endif
 	int			ntup = 0;
-	OffsetNumber lineoff;
+	int			nvis = 0;
 
-	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	/* page at a time should have been disabled otherwise */
+	Assert(IsMVCCSnapshot(snapshot));
+
+	/* first find all tuples on the page */
+	for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
 	{
 		ItemId		lpp = PageGetItemId(page, lineoff);
-		HeapTupleData loctup;
-		bool		valid;
+		HeapTuple	tup;
 
-		if (!ItemIdIsNormal(lpp))
+		if (unlikely(!ItemIdIsNormal(lpp)))
 			continue;
 
-		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-		loctup.t_len = ItemIdGetLength(lpp);
-		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
-		ItemPointerSet(&(loctup.t_self), block, lineoff);
+		/*
+		 * If the page is not all-visible or we need to check serializability,
+		 * maintain enough state to be able to refind the tuple efficiently,
+		 * without again needing to extract it from the page.
+		 */
+		if (!all_visible || check_serializable)
+		{
+#ifdef BATCHMVCC_FEWER_ARGS
+			tup = &batchmvcc.tuples[ntup];
+#else
+			tup = &tuples[ntup];
+#endif
 
+			tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+			tup->t_len = ItemIdGetLength(lpp);
+			tup->t_tableOid = relid;
+			ItemPointerSet(&(tup->t_self), block, lineoff);
+		}
+
+		/*
+		 * If the page is all visible, these fields won'otherwise wont be
+		 * populated in loop below.
+		 */
 		if (all_visible)
-			valid = true;
-		else
-			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
-
-		if (check_serializable)
-			HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
-												&loctup, buffer, snapshot);
-
-		if (valid)
 		{
+			if (check_serializable)
+			{
+#ifdef BATCHMVCC_FEWER_ARGS
+				batchmvcc.visible[ntup] = true;
+#else
+				visible[ntup] = true;
+#endif
+			}
 			scan->rs_vistuples[ntup] = lineoff;
-			ntup++;
 		}
+
+		ntup++;
 	}
 
 	Assert(ntup <= MaxHeapTuplesPerPage);
 
-	return ntup;
+	/* unless the page is all visible, test visibility for all tuples one go */
+	if (all_visible)
+		nvis = ntup;
+	else
+		nvis = HeapTupleSatisfiesMVCCBatch(snapshot, buffer,
+										   ntup,
+#ifdef BATCHMVCC_FEWER_ARGS
+										   &batchmvcc,
+#else
+										   tuples, visible,
+#endif
+										   scan->rs_vistuples
+			);
+
+	/*
+	 * So far we don't have batch API for testing serializabilty, so do so
+	 * one-by-one.
+	 */
+	if (check_serializable)
+	{
+		for (int i = 0; i < ntup; i++)
+		{
+#ifdef BATCHMVCC_FEWER_ARGS
+			HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
+												scan->rs_base.rs_rd,
+												&batchmvcc.tuples[i],
+												buffer, snapshot);
+#else
+			HeapCheckForSerializableConflictOut(visible[i],
+												scan->rs_base.rs_rd,
+												&tuples[i],
+												buffer, snapshot);
+#endif
+		}
+	}
+
+	return nvis;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index b7aa8bb7a52..4baafc01a2c 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1575,6 +1575,56 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 		return true;
 }
 
+/*
+ * Perform HeaptupleSatisfiesMVCC() on each passed in tuple. This is more
+ * efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
+ *
+ * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
+ * visibility is set in batchmvcc->visible[]. In addition, ->vistuples_dense
+ * is set to contain the offsets of visible tuples.
+ *
+ * Returns the number of visible tuples.
+ */
+int
+HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
+							int ntups,
+#ifdef BATCHMVCC_FEWER_ARGS
+							BatchMVCCState *batchmvcc,
+#else
+							HeapTupleData *tuples,
+							bool *visible,
+#endif
+							OffsetNumber *vistuples_dense)
+{
+	int			nvis = 0;
+
+	Assert(IsMVCCSnapshot(snapshot));
+
+	for (int i = 0; i < ntups; i++)
+	{
+		bool		valid;
+#ifdef BATCHMVCC_FEWER_ARGS
+		HeapTuple	tup = &batchmvcc->tuples[i];
+#else
+		HeapTuple	tup = &tuples[i];
+#endif
+
+		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer);
+#ifdef BATCHMVCC_FEWER_ARGS
+		batchmvcc->visible[i] = valid;
+#else
+		visible[i] = valid;
+#endif
+		if (likely(valid))
+		{
+			vistuples_dense[nvis] = tup->t_self.ip_posid;
+			nvis++;
+		}
+	}
+
+	return nvis;
+}
+
 /*
  * HeapTupleSatisfiesVisibility
  *		True iff heap tuple satisfies a time qual.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 08521d51a9b..2d278960078 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -244,6 +244,7 @@ Barrier
 BaseBackupCmd
 BaseBackupTargetHandle
 BaseBackupTargetType
+BatchMVCCState
 BeginDirectModify_function
 BeginForeignInsert_function
 BeginForeignModify_function
-- 
2.45.2.746.g06e570c0df.dirty

v2-0008-bufmgr-Make-it-easier-to-change-number-of-buffer-.patchtext/x-diff; charset=us-asciiDownload
From d497dcc6b0ffde11038cb89ca4b718e601612e79 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 18:01:14 -0400
Subject: [PATCH v2 08/15] bufmgr: Make it easier to change number of buffer
 state bits

In an upcoming commit I'd like to change the number of bits for the usage
count (the current max is 5, fitting in three bits, but we resere four
bits). Until now that required adjusting a bunch of magic constants, now the
constants are defined based on the number of bits reserved.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index e2ac85c8d97..9f9e6bbf4e9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -39,12 +39,19 @@
  *
  * The definition of buffer state components is below.
  */
+#define BUF_REFCOUNT_BITS 18
+#define BUF_USAGECOUNT_BITS 4
+#define BUF_FLAG_BITS 10
+
+StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
+				 "buffer bit counts need to equal 32");
+
 #define BUF_REFCOUNT_ONE 1
-#define BUF_REFCOUNT_MASK ((1U << 18) - 1)
-#define BUF_USAGECOUNT_MASK 0x003C0000U
-#define BUF_USAGECOUNT_ONE (1U << 18)
-#define BUF_USAGECOUNT_SHIFT 18
-#define BUF_FLAG_MASK 0xFFC00000U
+#define BUF_REFCOUNT_MASK ((1U << BUF_REFCOUNT_BITS) - 1)
+#define BUF_USAGECOUNT_MASK (((1U << BUF_USAGECOUNT_BITS) - 1) << (BUF_REFCOUNT_BITS))
+#define BUF_USAGECOUNT_ONE (1U << BUF_REFCOUNT_BITS)
+#define BUF_USAGECOUNT_SHIFT BUF_REFCOUNT_BITS
+#define BUF_FLAG_MASK (((1U << BUF_FLAG_BITS) - 1) << (BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS))
 
 /* Get refcount and usagecount from buffer state */
 #define BUF_STATE_GET_REFCOUNT(state) ((state) & BUF_REFCOUNT_MASK)
@@ -77,6 +84,9 @@
  */
 #define BM_MAX_USAGE_COUNT	5
 
+StaticAssertDecl(BM_MAX_USAGE_COUNT < ((1 << BUF_USAGECOUNT_BITS) - 1),
+				 "BM_MAX_USAGE_COUNT doesn't fit in BUF_USAGECOUNT_BITS bits");
+
 /*
  * Buffer tag identifies which disk block the buffer contains.
  *
-- 
2.45.2.746.g06e570c0df.dirty

v2-0009-bufmgr-Add-interface-to-acquire-right-to-set-hint.patchtext/x-diff; charset=us-asciiDownload
From e86e84dfbf1c735331c426e4775ae7ae9f8c7dc3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 13:59:13 -0400
Subject: [PATCH v2 09/15] bufmgr: Add interface to acquire right to set hint
 bits

At the moment hint bits can be set with just a share lock on a page (and in
one place even without any lock). Because of this we need to copy pages while
writing them out, as otherwise the checksum could be corrupted.

The need to copy the page is problematic for the AIO patchset:

1) Instead of just needing a single buffer for a copied page we need one for
   each page that's potentially undergoing IO
2) To be able to use the "worker" AIO implementation the copied page needs to
   reside in shared memory.

Even without AIO copying the page isn't free...

This commit starts to address that by adding BufferPrepareToSetHintBits(),
which needs to be called before setting hint bits on a
buffer. BufferPrepareToSetHintBits() only allows hint bit writes if there's no
ongoing IO and while hint bits are being set no IO is allowed to be started.

To know that a buffer is undergoing IO a new BufferDesc state flag is used,
BM_SETTING_HINTS. Theoretically it'd be possible to reuse BM_IO_IN_PROGRESS,
but that'd make it harder to debug the system.

The new interface is not yet used, that will happen in subsequent commits, to
make review a bit easier. Therefore we cannot yet rely on not needing a copy
of the buffer during IO.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h           |  22 +-
 src/include/storage/bufmgr.h                  |   6 +
 src/backend/storage/buffer/README             |   6 +-
 src/backend/storage/buffer/bufmgr.c           | 336 ++++++++++++++++--
 .../utils/activity/wait_event_names.txt       |   1 +
 5 files changed, 337 insertions(+), 34 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9f9e6bbf4e9..a8dbcec666c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -31,8 +31,8 @@
  * Buffer state is a single 32-bit variable where following data is combined.
  *
  * - 18 bits refcount
- * - 4 bits usage count
- * - 10 bits of flags
+ * - 3 bits usage count
+ * - 11 bits of flags
  *
  * Combining these values allows to perform some operations without locking
  * the buffer header, by modifying them together with a CAS loop.
@@ -40,8 +40,8 @@
  * The definition of buffer state components is below.
  */
 #define BUF_REFCOUNT_BITS 18
-#define BUF_USAGECOUNT_BITS 4
-#define BUF_FLAG_BITS 10
+#define BUF_USAGECOUNT_BITS 3
+#define BUF_FLAG_BITS 11
 
 StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
 				 "buffer bit counts need to equal 32");
@@ -63,6 +63,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
  * Note: BM_TAG_VALID essentially means that there is a buffer hashtable
  * entry associated with the buffer's tag.
  */
+#define BM_SETTING_HINTS		(1U << 21)	/* hint bits are being set */
 #define BM_LOCKED				(1U << 22)	/* buffer header is locked */
 #define BM_DIRTY				(1U << 23)	/* data needs writing */
 #define BM_VALID				(1U << 24)	/* data is valid */
@@ -427,6 +428,7 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
 /* ResourceOwner callbacks to hold buffer I/Os and pins */
 extern PGDLLIMPORT const ResourceOwnerDesc buffer_io_resowner_desc;
 extern PGDLLIMPORT const ResourceOwnerDesc buffer_pin_resowner_desc;
+extern PGDLLIMPORT const ResourceOwnerDesc buffer_setting_hints_resowner_desc;
 
 /* Convenience wrappers over ResourceOwnerRemember/Forget */
 static inline void
@@ -449,6 +451,18 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 {
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
+static inline void
+ResourceOwnerRememberBufferSettingHints(ResourceOwner owner, Buffer buffer)
+{
+	ResourceOwnerRemember(owner, Int32GetDatum(buffer),
+						  &buffer_setting_hints_resowner_desc);
+}
+static inline void
+ResourceOwnerForgetBufferSettingHints(ResourceOwner owner, Buffer buffer)
+{
+	ResourceOwnerForget(owner, Int32GetDatum(buffer),
+						&buffer_setting_hints_resowner_desc);
+}
 
 /*
  * Internal buffer management routines
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ded46a57889..446616da102 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -300,6 +300,12 @@ extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
 extern bool EvictUnpinnedBuffer(Buffer buf);
 
+extern bool BufferPrepareToSetHintBits(Buffer buffer);
+extern void BufferFinishSetHintBits(Buffer buffer, bool mark_dirty, bool buffer_std);
+extern void BufferSetHintBits16(Buffer buffer,
+								uint16 *ptr, uint16 val);
+
+
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index 011af7aff3e..756ad542910 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -58,7 +58,8 @@ tuple while they are doing visibility checks.
 4. It is considered OK to update tuple commit status bits (ie, OR the
 values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
 HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
-pin on a buffer.  This is OK because another backend looking at the tuple
+pin on a buffer after calling BufferPrepareToSetHintBits().
+This is OK because another backend looking at the tuple
 at about the same time would OR the same bits into the field, so there
 is little or no risk of conflicting update; what's more, if there did
 manage to be a conflict it would merely mean that one bit-update would
@@ -68,6 +69,9 @@ great harm is done if they get reset to zero by conflicting updates.
 Note, however, that a tuple is frozen by setting both HEAP_XMIN_INVALID
 and HEAP_XMIN_COMMITTED; this is a critical update and accordingly requires
 an exclusive buffer lock (and it must also be WAL-logged).
+Calling BufferPrepareToSetHintBits() ensures that write IO cannot happen at
+the same time as modifying hint bits, which can lead the checksum computed at
+the start of the write to not match anymore.
 
 5. To physically remove a tuple or compact free space on a page, one
 must hold a pin and an exclusive lock, *and* observe while holding the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6ad686d9820..dd94b9c16b0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -222,6 +222,8 @@ static void ResOwnerReleaseBufferIO(Datum res);
 static char *ResOwnerPrintBufferIO(Datum res);
 static void ResOwnerReleaseBufferPin(Datum res);
 static char *ResOwnerPrintBufferPin(Datum res);
+static void ResOwnerReleaseBufferSettingHints(Datum res);
+static char *ResOwnerPrintBufferSettingHints(Datum res);
 
 const ResourceOwnerDesc buffer_io_resowner_desc =
 {
@@ -241,6 +243,15 @@ const ResourceOwnerDesc buffer_pin_resowner_desc =
 	.DebugPrint = ResOwnerPrintBufferPin
 };
 
+const ResourceOwnerDesc buffer_setting_hints_resowner_desc =
+{
+	.name = "buffer setting hints",
+	.release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
+	.release_priority = RELEASE_PRIO_BUFFER_IOS,
+	.ReleaseResource = ResOwnerReleaseBufferSettingHints,
+	.DebugPrint = ResOwnerPrintBufferSettingHints
+};
+
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
  * entry. This has to be called before using NewPrivateRefCountEntry() to fill
@@ -1737,7 +1748,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* some sanity checks while we hold the buffer header lock */
 	Assert(BUF_STATE_GET_REFCOUNT(victim_buf_state) == 1);
-	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_IO_IN_PROGRESS)));
+	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY
+								 | BM_IO_IN_PROGRESS | BM_SETTING_HINTS)));
 
 	victim_buf_hdr->tag = newTag;
 
@@ -2092,7 +2104,8 @@ again:
 	buf_state = pg_atomic_read_u32(&buf_hdr->state);
 
 	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
-	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY)));
+	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY
+						  | BM_IO_IN_PROGRESS | BM_SETTING_HINTS)));
 
 	CheckBufferIsPinnedOnce(buf);
 #endif
@@ -4971,34 +4984,16 @@ IncrBufferRefCount(Buffer buffer)
 }
 
 /*
- * MarkBufferDirtyHint
- *
- *	Mark a buffer dirty for non-critical changes.
- *
- * This is essentially the same as MarkBufferDirty, except:
- *
- * 1. The caller does not write WAL; so if checksums are enabled, we may need
- *	  to write an XLOG_FPI_FOR_HINT WAL record to protect against torn pages.
- * 2. The caller might have only share-lock instead of exclusive-lock on the
- *	  buffer's content lock.
- * 3. This function does not guarantee that the buffer is always marked dirty
- *	  (due to a race condition), so it cannot be used for important changes.
+ * Implementation of MarkBufferDirtyHint(), separate because
+ * BufferSetHintBits16() needs to be able to set hint bits without setting
+ * BM_SETTING_HINTS.
  */
-void
-MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
+static inline void
+MarkBufferDirtyHintImpl(Buffer buffer, bool buffer_std, bool single_hint_write)
 {
 	BufferDesc *bufHdr;
 	Page		page = BufferGetPage(buffer);
 
-	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
-
-	if (BufferIsLocal(buffer))
-	{
-		MarkLocalBufferDirty(buffer);
-		return;
-	}
-
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
 	Assert(GetPrivateRefCount(buffer) > 0);
@@ -5117,6 +5112,35 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 	}
 }
 
+/*
+ * MarkBufferDirtyHint
+ *
+ *	Mark a buffer dirty for non-critical changes.
+ *
+ * This is essentially the same as MarkBufferDirty, except:
+ *
+ * 1. The caller does not write WAL; so if checksums are enabled, we may need
+ *	  to write an XLOG_FPI_FOR_HINT WAL record to protect against torn pages.
+ * 2. The caller might have only share-lock instead of exclusive-lock on the
+ *	  buffer's content lock.
+ * 3. This function does not guarantee that the buffer is always marked dirty
+ *	  (due to a race condition), so it cannot be used for important changes.
+ */
+void
+MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
+{
+	if (!BufferIsValid(buffer))
+		elog(ERROR, "bad buffer ID: %d", buffer);
+
+	if (BufferIsLocal(buffer))
+	{
+		MarkLocalBufferDirty(buffer);
+		return;
+	}
+
+	MarkBufferDirtyHintImpl(buffer, buffer_std, false);
+}
+
 /*
  * Release buffer content locks for shared buffers.
  *
@@ -5544,7 +5568,8 @@ BufferLockHeldByMe(Buffer buffer, int mode)
  */
 
 /*
- * WaitIO -- Block until the IO_IN_PROGRESS flag on 'buf' is cleared.
+ * WaitIO -- Block until the IO_IN_PROGRESS and SETTING_HINTS flags on 'buf'
+ * are cleared.
  */
 static void
 WaitIO(BufferDesc *buf)
@@ -5555,6 +5580,7 @@ WaitIO(BufferDesc *buf)
 	for (;;)
 	{
 		uint32		buf_state;
+		uint32		wait_event;
 
 		/*
 		 * It may not be necessary to acquire the spinlock to check the flag
@@ -5564,9 +5590,21 @@ WaitIO(BufferDesc *buf)
 		buf_state = LockBufHdr(buf);
 		UnlockBufHdr(buf, buf_state);
 
-		if (!(buf_state & BM_IO_IN_PROGRESS))
+		if (likely((buf_state & (BM_IO_IN_PROGRESS | BM_SETTING_HINTS)) == 0))
 			break;
-		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
+		else if (buf_state & BM_IO_IN_PROGRESS)
+		{
+			Assert(!(buf_state & BM_SETTING_HINTS));
+			wait_event = WAIT_EVENT_BUFFER_IO;
+		}
+		else
+		{
+			Assert(buf_state & BM_SETTING_HINTS);
+			Assert(!(buf_state & BM_IO_IN_PROGRESS));
+			wait_event = WAIT_EVENT_BUFFER_SETTING_HINTS;
+		}
+
+		ConditionVariableSleep(cv, wait_event);
 	}
 	ConditionVariableCancelSleep();
 }
@@ -5606,7 +5644,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 	{
 		buf_state = LockBufHdr(buf);
 
-		if (!(buf_state & BM_IO_IN_PROGRESS))
+		if (!(buf_state & (BM_IO_IN_PROGRESS | BM_SETTING_HINTS)))
 			break;
 		UnlockBufHdr(buf, buf_state);
 		if (nowait)
@@ -5623,6 +5661,8 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 		return false;
 	}
 
+	Assert(!(buf_state & BM_SETTING_HINTS));
+
 	buf_state |= BM_IO_IN_PROGRESS;
 	UnlockBufHdr(buf, buf_state);
 
@@ -5661,6 +5701,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
+	Assert(!(buf_state & BM_SETTING_HINTS));
 
 	buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
 	if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
@@ -5697,6 +5738,7 @@ AbortBufferIO(Buffer buffer)
 
 	buf_state = LockBufHdr(buf_hdr);
 	Assert(buf_state & (BM_IO_IN_PROGRESS | BM_TAG_VALID));
+	Assert(!(buf_state & BM_SETTING_HINTS));
 
 	if (!(buf_state & BM_VALID))
 	{
@@ -6124,6 +6166,28 @@ ResOwnerPrintBufferPin(Datum res)
 	return DebugPrintBufferRefcount(DatumGetInt32(res));
 }
 
+static void
+ResOwnerReleaseBufferSettingHints(Datum res)
+{
+	Buffer		buffer = DatumGetInt32(res);
+
+	/*
+	 * At this point we don't know if the buffer was a standard buffer or
+	 * whether the caller wanted to dirty the buffer. But we don't need to
+	 * dirty the buffer, which also means we don't need to know if the buffer
+	 * was a standard buffer or not.
+	 */
+	BufferFinishSetHintBits(buffer, false, false);
+}
+
+static char *
+ResOwnerPrintBufferSettingHints(Datum res)
+{
+	Buffer		buffer = DatumGetInt32(res);
+
+	return psprintf("lost track of setting hint bits on buffer %d", buffer);
+}
+
 /*
  * Try to evict the current block in a shared buffer.
  *
@@ -6186,3 +6250,217 @@ EvictUnpinnedBuffer(Buffer buf)
 
 	return result;
 }
+
+/*
+ * Try to acquire the right to set hint bits for buf.
+ *
+ * It's only permitted to set hint bits in a buffer if the buffer is not
+ * undergoing IO. Otherwise the page level checksum could be corrupted. This
+ * could happen both for PG's checksum and on the OS/filesystem
+ * level. E.g. btrfs with direct-io relies on the page to not change while IO
+ * is going on.
+ *
+ * We can't just check if IO going on at the time BufferPrepareToSetHintBits()
+ * is called, we also need to block IO from starting before we're done setting
+ * hints. This is achieved by setting BM_SETTING_HINTS for the buffer and
+ * having StartBufferIO()/WaitIO() wait for that.  We could combine
+ * BM_SETTING_HINTS and BM_IO_IN_PROGRESS into one, as they can never be set
+ * at the same time, but that seems unnecessarily confusing.
+ *
+ * Because we use only a single bit (BM_SETTING_HINTS) to track whether hint
+ * bits are currently being set, we cannot allow multiple backends to set
+ * hints at the same time - it'd be unknown whether BM_SETTING_HINTS would
+ * need to be remain set when a backend finishes setting hint bits.  In almost
+ * all situations the two backends would just set the same hint bits anyway,
+ * so this is unlikely to be a problem.
+ *
+ * If allowed to set hint bits, the caller needs to call
+ * BufferFinishSetHintBits() once done setting hint bits. In case of an error
+ * occuring before BufferFinishSetHintBits() is reached, the cleanup will be
+ * done via resowner.c.
+ *
+ * It can make sense to amortize the cost of BufferPrepareToSetHintBits() +
+ * BufferFinishSetHintBits() over multiple hint bit sets on a page.
+ *
+ * Returns whether caller is allowed to set hint bits.
+ */
+bool
+BufferPrepareToSetHintBits(Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		old_buf_state;
+
+	Assert(BufferIsPinned(buffer));
+
+	if (BufferIsLocal(buffer))
+		return true;
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+
+	desc = GetBufferDescriptor(buffer - 1);
+
+	/*
+	 * We could use LockBufHdr() instead, but a CAS loop turns out to be
+	 * slightly faster and has better concurrency behaviour due to not
+	 * blocking other backends.
+	 */
+	old_buf_state = pg_atomic_read_u32(&desc->state);
+
+	for (;;)
+	{
+		uint32		new_buf_state;
+
+		if (unlikely(old_buf_state & BM_LOCKED))
+			old_buf_state = WaitBufHdrUnlocked(desc);
+
+		if (unlikely((old_buf_state & (BM_SETTING_HINTS | BM_IO_IN_PROGRESS)) != 0))
+			return false;
+
+		new_buf_state = old_buf_state;
+
+		new_buf_state |= BM_SETTING_HINTS;
+
+		if (likely(pg_atomic_compare_exchange_u32(&desc->state, &old_buf_state,
+												  new_buf_state)))
+		{
+			ResourceOwnerRememberBufferSettingHints(CurrentResourceOwner, buffer);
+
+			return true;
+		}
+	}
+}
+
+/*
+ * Release the permission to set hint bits on the current page.
+ */
+void
+BufferFinishSetHintBits(Buffer buffer, bool mark_dirty, bool buffer_std)
+{
+	BufferDesc *desc;
+	uint32		new_buf_state;
+	uint32		old_buf_state;
+	bool		is_dirty = false;
+
+	Assert(BufferIsPinned(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		MarkLocalBufferDirty(buffer);
+		return;
+	}
+
+	desc = GetBufferDescriptor(buffer - 1);
+
+	/*
+	 * We could use LockBufHdr() instead, but a CAS loop turns out to be
+	 * slightly faster and has better concurrency behaviour due to not
+	 * blocking other backends.
+	 */
+	old_buf_state = pg_atomic_read_u32(&desc->state);
+
+	Assert(old_buf_state & (BM_TAG_VALID));
+	Assert(old_buf_state & (BM_VALID));
+	Assert(old_buf_state & (BM_SETTING_HINTS));
+	Assert(!(old_buf_state & (BM_IO_IN_PROGRESS)));
+
+	for (;;)
+	{
+		if (unlikely(old_buf_state & BM_LOCKED))
+			old_buf_state = WaitBufHdrUnlocked(desc);
+
+		is_dirty = (old_buf_state & (BM_DIRTY | BM_JUST_DIRTIED)) ==
+				(BM_DIRTY | BM_JUST_DIRTIED);
+
+		new_buf_state = old_buf_state;
+		new_buf_state &= ~BM_SETTING_HINTS;
+
+		if (likely(pg_atomic_compare_exchange_u32(&desc->state, &old_buf_state,
+												  new_buf_state)))
+		{
+			break;
+		}
+	}
+
+	/*
+	 * It might be worth combining some of the atomic operations happening
+	 * inside MarkBufferDirtyHintImpl() with the above.
+	 */
+	if (mark_dirty && !is_dirty)
+		MarkBufferDirtyHintImpl(buffer, buffer_std, true);
+
+	ResourceOwnerForgetBufferSettingHints(CurrentResourceOwner, buffer);
+
+	/*
+	 * Wake everyone that might be waiting for a chance to perform IO (i.e.
+	 * WaitIO()).
+	 *
+	 * XXX: Right now this is somewhat expensive, due to
+	 * ConditionVariableBroadcast() acquiring its spinlock unconditionally. I
+	 * don't see a good way to avoid that from the bufmgr.c side, we don't
+	 * know if there is somebody waiting.
+	 */
+	ConditionVariableBroadcast(BufferDescriptorGetIOCV(desc));
+}
+
+/*
+ * Try to set 16 bit hint bit value.
+ *
+ * This is cheaper than BufferPrepareToSetHintBits() +
+ * BufferFinishSetHintBits() when doing a single hint bit write within a
+ * buffer, but more expensive when setting multiple hint bits.
+ *
+ * Assumes that the buffer has the standard layout. The buffer will be marked
+ * as dirty if possible.
+ */
+void
+BufferSetHintBits16(Buffer buffer, uint16 *ptr, uint16 val)
+{
+	BufferDesc	*desc;
+	uint32		buf_state;
+	bool		did_set = false;
+	bool		is_dirty = false;
+
+	Assert(BufferIsPinned(buffer));
+
+	/*
+	 * TODO: Add assert ensuring ptr inside the buffer.
+	 */
+
+	if (unlikely(BufferIsLocal(buffer)))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+
+		if (likely((buf_state & BM_IO_IN_PROGRESS) == 0))
+		{
+			*ptr = val;
+
+			is_dirty = (buf_state & (BM_DIRTY | BM_JUST_DIRTIED)) ==
+				(BM_DIRTY | BM_JUST_DIRTIED);
+			did_set = true;
+		}
+
+		if (did_set && !is_dirty)
+			MarkLocalBufferDirty(buffer);
+		return;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if (likely((buf_state & BM_IO_IN_PROGRESS) == 0))
+		{
+
+			*ptr = val;
+
+			did_set = true;
+		}
+
+		UnlockBufHdr(desc, buf_state);
+
+		is_dirty = (buf_state & (BM_DIRTY | BM_JUST_DIRTIED)) == (BM_DIRTY | BM_JUST_DIRTIED);
+		if (did_set && !is_dirty)
+			MarkBufferDirtyHintImpl(buffer, true, true);
+	}
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72d..03e4fcbe897 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -110,6 +110,7 @@ BGWORKER_SHUTDOWN	"Waiting for background worker to shut down."
 BGWORKER_STARTUP	"Waiting for background worker to start up."
 BTREE_PAGE	"Waiting for the page number needed to continue a parallel B-tree scan to become available."
 BUFFER_IO	"Waiting for buffer I/O to complete."
+BUFFER_SETTING_HINTS	"Waiting for hint bit setting to complete."
 CHECKPOINT_DELAY_COMPLETE	"Waiting for a backend that blocks a checkpoint from completing."
 CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from starting."
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
-- 
2.45.2.746.g06e570c0df.dirty

v2-0010-heapam-Acquire-right-to-set-hint-bits.patchtext/x-diff; charset=us-asciiDownload
From d833b569691634ddec6dd01b1f31297491aaed5f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 14:05:15 -0400
Subject: [PATCH v2 10/15] heapam: Acquire right to set hint bits

This commit starts to use BufferPrepareToSetHintBits(), introduced in a
previous commit, for the reasons explained in said commit.

To amortize the cost of BufferPrepareToSetHintBits() for cases where hint bits
are set at a high frequency, HeapTupleSatisfiesMVCCBatch() uses the new
SetHintBitsExt() which defers BufferFinishSetHintBits() until all hint bits on
a page have been set.

It's likely worth introducing additional batch visibility routines, e.g. for
vacuuming, but I did not find a regression with the state as of this
commit. So that's left for later.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam_visibility.c | 114 ++++++++++++++++----
 src/tools/pgindent/typedefs.list            |   1 +
 2 files changed, 97 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 4baafc01a2c..a31ec9d835d 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -80,10 +80,35 @@
 
 
 /*
- * SetHintBits()
+ * To be allowed to set hint bits SetHintBits() needs to call
+ * BufferPrepareToSetHintBits(). However, that's not free, and some callsites
+ * call SetHintBits() on numerous tuples in a row. For those it makes sense to
+ * amortize the cost of BufferPrepareToSetHintBits(). Additionally it's
+ * desirable to defer the cost of BufferPrepareToSetHintBits() until a hint
+ * bit needs to actually be set. This enum serves as the necessary state space
+ * passed to SetHintbitsExt().
+ */
+typedef enum SetHintBitsState
+{
+	/* not yet checked if hint bits may be set */
+	SHB_INITIAL,
+	/* failed to get permission to set hint bits, don't check again */
+	SHB_DISABLED,
+	/* allowed to set hint bits */
+	SHB_ENABLED,
+} SetHintBitsState;
+
+/*
+ * SetHintBitsExt()
  *
  * Set commit/abort hint bits on a tuple, if appropriate at this time.
  *
+ * To be allowed to set a hint bit on a tuple, the page must not be undergoing
+ * IO at this time (otherwise we e.g. could corrupt PG's page checksum or even
+ * the filesystem's, as is known to happen with btrfs). The right to set a
+ * hint bit is acquired on a page level with BufferPrepareToSetHintBits().
+ * Only a single backend gets the right to set hint bits at a time.
+ *
  * It is only safe to set a transaction-committed hint bit if we know the
  * transaction's commit record is guaranteed to be flushed to disk before the
  * buffer, or if the table is temporary or unlogged and will be obliterated by
@@ -111,9 +136,16 @@
  * InvalidTransactionId if no check is needed.
  */
 static inline void
-SetHintBits(HeapTupleHeader tuple, Buffer buffer,
-			uint16 infomask, TransactionId xid)
+SetHintBitsExt(HeapTupleHeader tuple, Buffer buffer,
+			   uint16 infomask, TransactionId xid, SetHintBitsState *state)
 {
+	/*
+	 * In batched mode and we previously did not get permission to set hint
+	 * bits. Don't try again, in all likelihood IO is still going on.
+	 */
+	if (state && *state == SHB_DISABLED)
+		return;
+
 	if (TransactionIdIsValid(xid))
 	{
 		/* NB: xid must be known committed here! */
@@ -127,8 +159,50 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 		}
 	}
 
+	if (!state)
+	{
+		BufferSetHintBits16(buffer,
+							&tuple->t_infomask,
+							tuple->t_infomask | infomask);
+		return;
+	}
+
+	/*
+	 * If not batching or this is the first hint that we'd like to set on the
+	 * page, check if we are allowed to do so.
+	 */
+	if (*state == SHB_INITIAL)
+	{
+		if (BufferPrepareToSetHintBits(buffer))
+		{
+			*state = SHB_ENABLED;
+		}
+		else
+		{
+			/*
+			 * If we hold an exclusive lock nobody else should be able to
+			 * prevent us from setting hint bits. This is important as there
+			 * are a few hint bit sets that are important for correctness -
+			 * all those happen with the buffer exclusively locked.
+			 */
+			Assert(!BufferLockHeldByMe(buffer, BUFFER_LOCK_EXCLUSIVE));
+			*state = SHB_DISABLED;
+			return;
+		}
+	}
+
 	tuple->t_infomask |= infomask;
-	MarkBufferDirtyHint(buffer, true);
+}
+
+/*
+ * Simple wrapper around SetHintBitExt(), use when operating on a single
+ * tuple.
+ */
+static inline void
+SetHintBits(HeapTupleHeader tuple, Buffer buffer,
+			uint16 infomask, TransactionId xid)
+{
+	SetHintBitsExt(tuple, buffer, infomask, xid, NULL);
 }
 
 /*
@@ -864,9 +938,9 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * inserting/deleting transaction was still running --- which was more cycles
  * and more contention on ProcArrayLock.
  */
-static bool
+static inline bool
 HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
-					   Buffer buffer)
+					   Buffer buffer, SetHintBitsState *state)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
@@ -912,8 +986,8 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 			if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
 			{
 				/* deleting subtransaction must have aborted */
-				SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
-							InvalidTransactionId);
+				SetHintBitsExt(tuple, buffer, HEAP_XMAX_INVALID,
+							   InvalidTransactionId, state);
 				return true;
 			}
 
@@ -925,13 +999,13 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
 			return false;
 		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
-			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-						HeapTupleHeaderGetRawXmin(tuple));
+			SetHintBitsExt(tuple, buffer, HEAP_XMIN_COMMITTED,
+						   HeapTupleHeaderGetRawXmin(tuple), state);
 		else
 		{
 			/* it must have aborted or crashed */
-			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-						InvalidTransactionId);
+			SetHintBitsExt(tuple, buffer, HEAP_XMIN_INVALID,
+						   InvalidTransactionId, state);
 			return false;
 		}
 	}
@@ -994,14 +1068,14 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
 		{
 			/* it must have aborted or crashed */
-			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
-						InvalidTransactionId);
+			SetHintBitsExt(tuple, buffer, HEAP_XMAX_INVALID,
+						   InvalidTransactionId, state);
 			return true;
 		}
 
 		/* xmax transaction committed */
-		SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
-					HeapTupleHeaderGetRawXmax(tuple));
+		SetHintBitsExt(tuple, buffer, HEAP_XMAX_COMMITTED,
+					   HeapTupleHeaderGetRawXmax(tuple), state);
 	}
 	else
 	{
@@ -1597,6 +1671,7 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 							OffsetNumber *vistuples_dense)
 {
 	int			nvis = 0;
+	SetHintBitsState state = SHB_INITIAL;
 
 	Assert(IsMVCCSnapshot(snapshot));
 
@@ -1609,7 +1684,7 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 		HeapTuple	tup = &tuples[i];
 #endif
 
-		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer);
+		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
 #ifdef BATCHMVCC_FEWER_ARGS
 		batchmvcc->visible[i] = valid;
 #else
@@ -1622,6 +1697,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 		}
 	}
 
+	if (state == SHB_ENABLED)
+		BufferFinishSetHintBits(buffer, true, true);
+
 	return nvis;
 }
 
@@ -1641,7 +1719,7 @@ HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 	switch (snapshot->snapshot_type)
 	{
 		case SNAPSHOT_MVCC:
-			return HeapTupleSatisfiesMVCC(htup, snapshot, buffer);
+			return HeapTupleSatisfiesMVCC(htup, snapshot, buffer, NULL);
 		case SNAPSHOT_SELF:
 			return HeapTupleSatisfiesSelf(htup, snapshot, buffer);
 		case SNAPSHOT_ANY:
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2d278960078..5415b85c385 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2606,6 +2606,7 @@ SetConstraintStateData
 SetConstraintTriggerData
 SetExprState
 SetFunctionReturnMode
+SetHintBitsState
 SetOp
 SetOpCmd
 SetOpPath
-- 
2.45.2.746.g06e570c0df.dirty

v2-0011-Acquire-right-to-set-hint-bits-in-the-remaining-p.patchtext/x-diff; charset=us-asciiDownload
From 727f753b56640fff91a2bfcdbf58ea4edbd5b65e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 16:44:11 -0400
Subject: [PATCH v2 11/15] Acquire right to set hint bits in the remaining
 places

Use BufferPrepareToSetHintBits() in the remaining places that modify buffers
without an exclusive lock. The remaining places are indexes with support for
kill_prior_tuples and the freespace code.

After this we do not need to copy buffers to write them out anymore. That
change is done separately however.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/gist/gistget.c         | 15 +++++++-----
 src/backend/access/hash/hashutil.c        |  6 ++++-
 src/backend/access/nbtree/nbtinsert.c     | 29 +++++++++++++++--------
 src/backend/access/nbtree/nbtutils.c      |  7 +++++-
 src/backend/storage/freespace/freespace.c | 16 +++++++++----
 src/backend/storage/freespace/fsmpage.c   |  8 ++++++-
 6 files changed, 57 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index b35b8a97577..020993b309c 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -63,11 +63,7 @@ gistkillitems(IndexScanDesc scan)
 	 * safe.
 	 */
 	if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
-	{
-		UnlockReleaseBuffer(buffer);
-		so->numKilled = 0;		/* reset counter */
-		return;
-	}
+		goto unlock;
 
 	Assert(GistPageIsLeaf(page));
 
@@ -77,6 +73,12 @@ gistkillitems(IndexScanDesc scan)
 	 */
 	for (i = 0; i < so->numKilled; i++)
 	{
+		if (!killedsomething)
+		{
+			if (!BufferPrepareToSetHintBits(buffer))
+				goto unlock;
+		}
+
 		offnum = so->killedItems[i];
 		iid = PageGetItemId(page, offnum);
 		ItemIdMarkDead(iid);
@@ -86,9 +88,10 @@ gistkillitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		GistMarkPageHasGarbage(page);
-		MarkBufferDirtyHint(buffer, true);
+		BufferFinishSetHintBits(buffer, true, true);
 	}
 
+unlock:
 	UnlockReleaseBuffer(buffer);
 
 	/*
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 20028f5cd14..a7fc98d8f96 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -593,6 +593,9 @@ _hash_kill_items(IndexScanDesc scan)
 
 			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
+				if (!BufferPrepareToSetHintBits(so->currPos.buf))
+					goto unlock_page;
+
 				/* found the item */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
@@ -610,9 +613,10 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(buf, true);
+		BufferFinishSetHintBits(so->currPos.buf, true, true);
 	}
 
+unlock_page:
 	if (so->hashso_bucket_buf == so->currPos.buf ||
 		havePin)
 		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 99043da8412..75160d53e59 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -679,20 +679,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				{
 					/*
 					 * The conflicting tuple (or all HOT chains pointed to by
-					 * all posting list TIDs) is dead to everyone, so mark the
-					 * index entry killed.
+					 * all posting list TIDs) is dead to everyone, so try to
+					 * mark the index entry killed. It's ok if we're not
+					 * allowed to, this isn't required for correctness.
 					 */
-					ItemIdMarkDead(curitemid);
-					opaque->btpo_flags |= BTP_HAS_GARBAGE;
+					Buffer		buf;
 
-					/*
-					 * Mark buffer with a dirty hint, since state is not
-					 * crucial. Be sure to mark the proper buffer dirty.
-					 */
+					/* Be sure to operate on the proper buffer */
 					if (nbuf != InvalidBuffer)
-						MarkBufferDirtyHint(nbuf, true);
+						buf = nbuf;
 					else
-						MarkBufferDirtyHint(insertstate->buf, true);
+						buf = insertstate->buf;
+
+					if (BufferPrepareToSetHintBits(buf))
+					{
+						ItemIdMarkDead(curitemid);
+						opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+						/*
+						 * Mark buffer with a dirty hint, since state is not
+						 * crucial.
+						 */
+						BufferFinishSetHintBits(buf, true, true);
+					}
 				}
 
 				/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index d7603250279..417b7c11d6f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -4318,6 +4318,10 @@ _bt_killitems(IndexScanDesc scan)
 			 */
 			if (killtuple && !ItemIdIsDead(iid))
 			{
+				/* if we're not able to set hint bits, there's no point continuing */
+				if (!killedsomething && !BufferPrepareToSetHintBits(so->currPos.buf))
+					goto unlock_page;
+
 				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
@@ -4337,9 +4341,10 @@ _bt_killitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		MarkBufferDirtyHint(so->currPos.buf, true);
+		BufferFinishSetHintBits(so->currPos.buf, true, true);
 	}
 
+unlock_page:
 	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
 }
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index b1262ac3b69..90b306a65cd 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -904,12 +904,18 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
 	max_avail = fsm_get_max_avail(page);
 
 	/*
-	 * Reset the next slot pointer. This encourages the use of low-numbered
-	 * pages, increasing the chances that a later vacuum can truncate the
-	 * relation.  We don't bother with a lock here, nor with marking the page
-	 * dirty if it wasn't already, since this is just a hint.
+	 * Try to reset the next slot pointer. This encourages the use of
+	 * low-numbered pages, increasing the chances that a later vacuum can
+	 * truncate the relation.  We don't bother with a lock here, nor with
+	 * marking the page dirty if it wasn't already, since this is just a hint.
+	 *
+	 * FIXME: Should probably use BufferSetHintBits16()
 	 */
-	((FSMPage) PageGetContents(page))->fp_next_slot = 0;
+	if (BufferPrepareToSetHintBits(buf))
+	{
+		((FSMPage) PageGetContents(page))->fp_next_slot = 0;
+		BufferFinishSetHintBits(buf, false, false);
+	}
 
 	ReleaseBuffer(buf);
 
diff --git a/src/backend/storage/freespace/fsmpage.c b/src/backend/storage/freespace/fsmpage.c
index ba85cad0697..257d4ed12c0 100644
--- a/src/backend/storage/freespace/fsmpage.c
+++ b/src/backend/storage/freespace/fsmpage.c
@@ -300,7 +300,13 @@ restart:
 	 *
 	 * Wrap-around is handled at the beginning of this function.
 	 */
-	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);
+	if (exclusive_lock_held || BufferPrepareToSetHintBits(buf))
+	{
+		fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);
+
+		if (!exclusive_lock_held)
+			BufferFinishSetHintBits(buf, false, true);
+	}
 
 	return slot;
 }
-- 
2.45.2.746.g06e570c0df.dirty

v2-0012-bufmgr-Don-t-copy-pages-while-writing-out.patchtext/x-diff; charset=us-asciiDownload
From 09e03c3750f70da7e37fb6575c97511588b162d3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 14:14:35 -0400
Subject: [PATCH v2 12/15] bufmgr: Don't copy pages while writing out

After the series of preceding commits introducing and using
BufferPrepareToSetHintBits() hint bits cannot be set while IO is going
on. Therefore we do not need to copy pages while they are being written out
anymore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/bufpage.h           |  3 +-
 src/backend/access/hash/hashpage.c      |  2 +-
 src/backend/access/transam/xloginsert.c | 28 +++++----------
 src/backend/storage/buffer/bufmgr.c     | 23 +++++-------
 src/backend/storage/buffer/localbuf.c   |  2 +-
 src/backend/storage/page/bufpage.c      | 48 ++++---------------------
 src/backend/storage/smgr/bulk_write.c   |  2 +-
 7 files changed, 26 insertions(+), 82 deletions(-)

diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 6222d46e535..e8cadc73aad 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,7 +507,6 @@ extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
 extern void PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum);
 extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
-extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
-extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetChecksum(Page page, BlockNumber blkno);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index d09c349e28f..2becf48bf06 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1029,7 +1029,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 					zerobuf.data,
 					true);
 
-	PageSetChecksumInplace(page, lastblock);
+	PageSetChecksum(page, lastblock);
 	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data,
 			   false);
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index f92d0626082..6fecbb0fe09 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1052,6 +1052,8 @@ XLogCheckBufferNeedsBackup(Buffer buffer)
  * failures. So instead we copy the page and insert the copied data as normal
  * record data.
  *
+ * FIXME: outdated
+ *
  * We only need to do something if page has not yet been full page written in
  * this checkpoint round. The LSN of the inserted wal record is returned if we
  * had to write, InvalidXLogRecPtr otherwise.
@@ -1089,37 +1091,23 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
 	if (lsn <= RedoRecPtr)
 	{
 		int			flags = 0;
-		PGAlignedBlock copied_buffer;
 		char	   *origdata = (char *) BufferGetBlock(buffer);
 		RelFileLocator rlocator;
 		ForkNumber	forkno;
 		BlockNumber blkno;
 
-		/*
-		 * Copy buffer so we don't have to worry about concurrent hint bit or
-		 * lsn updates. We assume pd_lower/upper cannot be changed without an
-		 * exclusive lock, so the contents bkp are not racy.
-		 */
-		if (buffer_std)
-		{
-			/* Assume we can omit data between pd_lower and pd_upper */
-			Page		page = BufferGetPage(buffer);
-			uint16		lower = ((PageHeader) page)->pd_lower;
-			uint16		upper = ((PageHeader) page)->pd_upper;
-
-			memcpy(copied_buffer.data, origdata, lower);
-			memcpy(copied_buffer.data + upper, origdata + upper, BLCKSZ - upper);
-		}
-		else
-			memcpy(copied_buffer.data, origdata, BLCKSZ);
-
 		XLogBeginInsert();
 
 		if (buffer_std)
 			flags |= REGBUF_STANDARD;
 
 		BufferGetTag(buffer, &rlocator, &forkno, &blkno);
-		XLogRegisterBlock(0, &rlocator, forkno, blkno, copied_buffer.data, flags);
+
+		/*
+		 * XXX: Should probably be XLogRegisterBuffer(), but assertions make
+		 * that non-viable right now.
+		 */
+		XLogRegisterBlock(0, &rlocator, forkno, blkno, origdata, flags);
 
 		recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI_FOR_HINT);
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dd94b9c16b0..6f39cc14596 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1996,9 +1996,8 @@ again:
 	/*
 	 * If the buffer was dirty, try to write it out.  There is a race
 	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
+	 * buffer header lock above.  We will recheck the dirty bit after
+	 * re-locking the buffer header.
 	 */
 	if (buf_state & BM_DIRTY)
 	{
@@ -3786,9 +3785,8 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * The caller must hold a pin on the buffer and have share-locked the
  * buffer contents.  (Note: a share-lock does not prevent updates of
- * hint bits in the buffer, so the page could change while the write
- * is in progress, but we assume that that will not invalidate the data
- * written.)
+ * hint bits in the buffer, but hint bits are not allowed to be set after
+ * StartBufferIO() succeeded.)
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
@@ -3801,7 +3799,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	ErrorContextCallback errcallback;
 	instr_time	io_start;
 	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
 
 	/*
@@ -3867,12 +3864,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufBlock = BufHdrGetBlock(buf);
 
-	/*
-	 * Update page checksum if desired.  Since we have only shared lock on the
-	 * buffer, other processes might be updating hint bits in it, so we must
-	 * copy the page to private storage if we do checksumming.
-	 */
-	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
+	/* Update page checksum if desired. */
+	PageSetChecksum((Page) bufBlock, buf->tag.blockNum);
 
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
@@ -3882,7 +3875,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	smgrwrite(reln,
 			  BufTagGetForkNum(&buf->tag),
 			  buf->tag.blockNum,
-			  bufToWrite,
+			  bufBlock,
 			  false);
 
 	/*
@@ -4531,7 +4524,7 @@ FlushRelationBuffers(Relation rel)
 				errcallback.previous = error_context_stack;
 				error_context_stack = &errcallback;
 
-				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+				PageSetChecksum(localpage, bufHdr->tag.blockNum);
 
 				io_start = pgstat_prepare_io_time(track_io_timing);
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6fd1a6418d2..b7cbbc57554 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -242,7 +242,7 @@ GetLocalVictimBuffer(void)
 		/* Find smgr relation for buffer */
 		oreln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag), MyProcNumber);
 
-		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+		PageSetChecksum(localpage, bufHdr->tag.blockNum);
 
 		io_start = pgstat_prepare_io_time(track_io_timing);
 
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index aa264f61b9c..f4822811331 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1484,51 +1484,15 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 /*
  * Set checksum for a page in shared buffers.
  *
- * If checksums are disabled, or if the page is not initialized, just return
- * the input.  Otherwise, we must make a copy of the page before calculating
- * the checksum, to prevent concurrent modifications (e.g. setting hint bits)
- * from making the final checksum invalid.  It doesn't matter if we include or
- * exclude hints during the copy, as long as we write a valid page and
- * associated checksum.
+ * If checksums are disabled, or if the page is not initialized, just
+ * return. Otherwise compute and set the checksum.
  *
- * Returns a pointer to the block-sized data that needs to be written. Uses
- * statically-allocated memory, so the caller must immediately write the
- * returned page and not refer to it again.
- */
-char *
-PageSetChecksumCopy(Page page, BlockNumber blkno)
-{
-	static char *pageCopy = NULL;
-
-	/* If we don't need a checksum, just return the passed-in data */
-	if (PageIsNew(page) || !DataChecksumsEnabled())
-		return (char *) page;
-
-	/*
-	 * We allocate the copy space once and use it over on each subsequent
-	 * call.  The point of palloc'ing here, rather than having a static char
-	 * array, is first to ensure adequate alignment for the checksumming code
-	 * and second to avoid wasting space in processes that never call this.
-	 */
-	if (pageCopy == NULL)
-		pageCopy = MemoryContextAllocAligned(TopMemoryContext,
-											 BLCKSZ,
-											 PG_IO_ALIGN_SIZE,
-											 0);
-
-	memcpy(pageCopy, (char *) page, BLCKSZ);
-	((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
-	return pageCopy;
-}
-
-/*
- * Set checksum for a page in private memory.
- *
- * This must only be used when we know that no other process can be modifying
- * the page buffer.
+ * In the past this needed to be done on a copy of the page, due to the
+ * possibility of e.g. hint bits being set concurrently. However, this is not
+ * necessary anymore as hint bits won't be set while IO is going on.
  */
 void
-PageSetChecksumInplace(Page page, BlockNumber blkno)
+PageSetChecksum(Page page, BlockNumber blkno)
 {
 	/* If we don't need a checksum, just return */
 	if (PageIsNew(page) || !DataChecksumsEnabled())
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
index f0a65bfe242..2c294ddab1f 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -277,7 +277,7 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 		BlockNumber blkno = pending_writes[i].blkno;
 		Page		page = pending_writes[i].buf->data;
 
-		PageSetChecksumInplace(page, blkno);
+		PageSetChecksum(page, blkno);
 
 		if (blkno >= bulkstate->pages_written)
 		{
-- 
2.45.2.746.g06e570c0df.dirty

v2-0013-bufmgr-Detect-some-missing-BufferPrepareToSetHint.patchtext/x-diff; charset=us-asciiDownload
From 6bd0809b5f5c83a464566eb0438b44fe5d1d2576 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 18:11:23 -0400
Subject: [PATCH v2 13/15] bufmgr: Detect some missing
 BufferPrepareToSetHintBits() calls

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f39cc14596..a91e4a7eff5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4993,6 +4993,24 @@ MarkBufferDirtyHintImpl(Buffer buffer, bool buffer_std, bool single_hint_write)
 	/* here, either share or exclusive lock is OK */
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
+	/*
+	 * A share-locked buffer may only be marked dirty for hints after having
+	 * gotten the permission to do so with BufferPrepareToSetHintBits(). It'd
+	 * perhaps be cheap enough to test this even outside of assert enabled
+	 * builds, but LWLockHeldByMeInMode() says "debug support only".
+	 */
+#ifdef USE_ASSERT_CHECKING
+	if (!single_hint_write)
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		if (!LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE))
+		{
+			Assert(buf_state & BM_SETTING_HINTS);
+		}
+	}
+#endif
+
 	/*
 	 * This routine might get called many times on the same page, if we are
 	 * making the first scan after commit of an xact that added/deleted many
-- 
2.45.2.746.g06e570c0df.dirty

v2-0014-WIP-bufmgr-Detect-some-bad-buffer-accesses.patchtext/x-diff; charset=us-asciiDownload
From 2118edc30758b2e23d11c5492699c17642a00902 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 18:03:53 -0400
Subject: [PATCH v2 14/15] WIP: bufmgr: Detect some bad buffer accesses

I wrote this mainly to ensure that I did not miss converting any hint bit sets
to BufferPrepareToSetHintBits(), but it seems like it might be more generally
useful.

If we do want to include it, it needs a bit more polishing.

On my workstation, the performance effect of this test infrastructure is as
follows:

base:
real	1m4.613s
user	4m31.409s
sys	3m20.445s

ENFORCE_BUFFER_PROT

real	1m11.912s
user	4m27.332s
sys	3m28.063s

ENFORCE_BUFFER_PROT + ENFORCE_BUFFER_PROT_READ
real	1m33.455s
user	4m32.188s
sys	3m41.275s

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/pg_config_manual.h        |  24 ++++
 src/include/storage/buf_internals.h   |   9 ++
 src/include/storage/pg_shmem.h        |   2 +
 src/backend/storage/buffer/buf_init.c |  10 ++
 src/backend/storage/buffer/bufmgr.c   | 173 ++++++++++++++++++++++++--
 src/backend/utils/misc/guc_tables.c   |   2 +-
 6 files changed, 208 insertions(+), 12 deletions(-)

diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e49eb13e43c..8ea96bd1f1b 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -361,3 +361,27 @@
  * Enable tracing of syncscan operations (see also the trace_syncscan GUC var).
  */
 /* #define TRACE_SYNCSCAN */
+
+
+/*
+ * On some operating systems we know how to change whether pages are
+ * readable/writeable. We can use that to verify that we are following buffer
+ * locking rules, we can make pages inaccessible or read-only when we don't
+ * have sufficient locks etc. This obviously is fairly expensive, so by
+ * default we only so in assert enabled builds.
+ */
+#if defined(USE_ASSERT_CHECKING) && !defined(WIN32)
+#define ENFORCE_BUFFER_PROT
+#endif
+
+/*
+ * Protecting pages against being modified without an exclusive lock /
+ * BufferPrepareToSetHintBits() is reasonably cheap, neither happens *that*
+ * often. Pinning/unpinning buffers is a lot more common, making it more
+ * expensive to call mprotect() that often.
+ *
+ * Therefore disable this by default, even in assert enabled builds.
+ */
+#ifdef ENFORCE_BUFFER_PROT
+/* #define ENFORCE_BUFFER_PROT_READ */
+#endif
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index a8dbcec666c..2fd451e0329 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -488,6 +488,15 @@ extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
 extern bool have_free_buffer(void);
 
+#ifdef ENFORCE_BUFFER_PROT
+extern void SetBufferProtection(Buffer buf, bool allow_reads, bool allow_writes);
+#else
+static inline void
+SetBufferProtection(Buffer buf, bool allow_reads, bool allow_writes)
+{
+}
+#endif
+
 /* buf_table.c */
 extern Size BufTableShmemSize(int size);
 extern void InitBufTable(int size);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 3065ff5be71..42a12688b9a 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -63,6 +63,8 @@ typedef enum
 	SHMEM_TYPE_MMAP,
 }			PGShmemType;
 
+extern PGDLLIMPORT int huge_pages_status;
+
 #ifndef WIN32
 extern PGDLLIMPORT unsigned long UsedShmemSegID;
 #else
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 56761a8eedc..0e02d8e1e9b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -135,6 +135,16 @@ BufferManagerShmemInit(void)
 							 LWTRANCHE_BUFFER_CONTENT);
 
 			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+
+			/*
+			 * Unused buffers are inaccessible. But if we're not enforcing
+			 * making buffers inaccessible without a pin, we won't mark
+			 * buffers as accessible during pinning, therefore we better don't
+			 * make them initially inaccessible.
+			 */
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+			SetBufferProtection(i + 1, false, false);
+#endif
 		}
 
 		/* Correct last entry of linked list */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a91e4a7eff5..ee8078fc06d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,10 @@
 #include <sys/file.h>
 #include <unistd.h>
 
+#ifdef ENFORCE_BUFFER_PROT
+#include <sys/mman.h>			/* for mprotect() */
+#endif							/* ENFORCE_BUFFER_PROT */
+
 #include "access/tableam.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -53,6 +57,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
@@ -1065,8 +1070,6 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 	if (need_to_zero)
 	{
-		memset(BufferGetPage(buffer), 0, BLCKSZ);
-
 		/*
 		 * Grab the buffer content lock before marking the page as valid, to
 		 * make sure that no other backend sees the zeroed page before the
@@ -1079,7 +1082,12 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		 * already valid.)
 		 */
 		if (!isLocalBuf)
+		{
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+			SetBufferProtection(buffer, true, true);
+		}
+
+		memset(BufferGetPage(buffer), 0, BLCKSZ);
 
 		if (isLocalBuf)
 		{
@@ -1505,6 +1513,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		io_first_block = blocknum + i;
 		io_buffers_len = 1;
 
+		if (persistence != RELPERSISTENCE_TEMP)
+			SetBufferProtection(io_buffers[0], true, true);
+
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
@@ -1520,7 +1531,13 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				   BufferGetBlockNumber(buffers[i]) + 1);
 
 			io_buffers[io_buffers_len] = buffers[++i];
-			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+			io_pages[io_buffers_len] = BufferGetBlock(buffers[i]);
+
+			/* smgrreadv() needs to modify the buffer */
+			if (persistence != RELPERSISTENCE_TEMP)
+				SetBufferProtection(io_buffers[io_buffers_len], true, true);
+
+			io_buffers_len++;
 		}
 
 		io_start = pgstat_prepare_io_time(track_io_timing);
@@ -1580,6 +1597,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 				TerminateBufferIO(bufHdr, false, BM_VALID, true);
 			}
 
+			if (persistence != RELPERSISTENCE_TEMP)
+				SetBufferProtection(io_buffers[j], true, false);
+
 			/* Report I/Os as completing individually. */
 			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
 											  operation->smgr->smgr_rlocator.locator.spcOid,
@@ -2233,7 +2253,9 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
 
 		/* new buffers are zero-filled */
+		SetBufferProtection(buffers[i], true, true);
 		MemSet((char *) buf_block, 0, BLCKSZ);
+		SetBufferProtection(buffers[i], true, false);
 	}
 
 	/*
@@ -2460,7 +2482,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		}
 
 		if (lock)
-			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		TerminateBufferIO(buf_hdr, false, BM_VALID, true);
 	}
@@ -2719,6 +2741,10 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 				 * non-accessible in any case.
 				 */
 				VALGRIND_MAKE_MEM_DEFINED(BufHdrGetBlock(buf), BLCKSZ);
+
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+				SetBufferProtection(BufferDescriptorGetBuffer(buf), true, false);
+#endif
 				break;
 			}
 		}
@@ -2791,6 +2817,10 @@ PinBuffer_Locked(BufferDesc *buf)
 	 */
 	VALGRIND_MAKE_MEM_DEFINED(BufHdrGetBlock(buf), BLCKSZ);
 
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+	SetBufferProtection(BufferDescriptorGetBuffer(buf), true, false);
+#endif
+
 	/*
 	 * Since we hold the buffer spinlock, we can update the buffer state and
 	 * release the lock in one operation.
@@ -2850,6 +2880,10 @@ UnpinBufferNoOwner(BufferDesc *buf)
 		 */
 		VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);
 
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+		SetBufferProtection(BufferDescriptorGetBuffer(buf), false, false);
+#endif
+
 		/* I'd better not still hold the buffer content lock */
 		Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(buf)));
 
@@ -3670,6 +3704,69 @@ CheckForBufferLeaks(void)
 #endif
 }
 
+/*
+ * To verify that we are following buffer locking rules, we can make pages
+ * inaccessible or read-only when we don't have sufficient locks etc.
+ *
+ * XXX: It might be possible to fold the VALGRIND_MAKE_MEM_NOACCESS() /
+ * VALGRIND_MAKE_MEM_DEFINED() calls into this.
+ */
+#ifdef ENFORCE_BUFFER_PROT
+void
+SetBufferProtection(Buffer buf, bool allow_reads, bool allow_writes)
+{
+	static long	pagesz = 0;
+	int			prot = PROT_NONE;
+	int			rc;
+
+	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN &&
+		   huge_pages_status != HUGE_PAGES_TRY);
+	StaticAssertStmt(PROT_NONE == 0, "that can't be right");
+
+	if (unlikely(pagesz == 0))
+	{
+		pagesz = sysconf(_SC_PAGESIZE);
+
+		elog(DEBUG1, "sysconf(_SC_PAGESIZE) = %ld", pagesz);
+		if (pagesz == -1)
+		{
+			elog(ERROR, "sysconf(_SC_PAGESIZE) failed: %m");
+		}
+		else if (pagesz > BLCKSZ)
+		{
+			elog(DEBUG1, "pagesz > BLCKSZ, disabling buffer protection mode");
+			pagesz = -1;
+		}
+		else if(BLCKSZ % pagesz != 0)
+		{
+			elog(DEBUG1, "BLCKSZ %% pagesz != 0, disabling buffer protection mode");
+			pagesz = -1;
+		}
+		else if (huge_pages_status == HUGE_PAGES_ON)
+		{
+			/* can't set status in a granular enough way */
+			elog(DEBUG1, "huge pages enabled, disabling buffer protection mode");
+			pagesz = -1;
+		}
+	}
+
+	/* disabled */
+	if (pagesz == -1)
+		return;
+
+	if (allow_reads)
+		prot |= PROT_READ;
+
+	if (allow_writes)
+		prot |= PROT_WRITE;
+
+	rc = mprotect(BufferGetBlock(buf), BLCKSZ, prot);
+
+	if (rc != 0)
+		elog(ERROR, "mprotect(%d, %d) failed: %m", buf, prot);
+}
+#endif							/* ENFORCE_BUFFER_PROT */
+
 /*
  * Helper routine to issue warnings when a buffer is unexpectedly pinned
  */
@@ -3865,7 +3962,10 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	bufBlock = BufHdrGetBlock(buf);
 
 	/* Update page checksum if desired. */
+	SetBufferProtection(BufferDescriptorGetBuffer(buf), true, true);
 	PageSetChecksum((Page) bufBlock, buf->tag.blockNum);
+	/* FIXME: could theoretically be exclusively locked */
+	SetBufferProtection(BufferDescriptorGetBuffer(buf), true, false);
 
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
@@ -5193,19 +5293,38 @@ void
 LockBuffer(Buffer buffer, int mode)
 {
 	BufferDesc *buf;
+	LWLock	   *content_lock;
 
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return;					/* local buffers need no lock */
 
 	buf = GetBufferDescriptor(buffer - 1);
+	content_lock = BufferDescriptorGetContentLock(buf);
 
 	if (mode == BUFFER_LOCK_UNLOCK)
-		LWLockRelease(BufferDescriptorGetContentLock(buf));
+	{
+#ifdef ENFORCE_BUFFER_PROT
+		bool		was_exclusive;
+
+		was_exclusive = LWLockHeldByMeInMode(content_lock, LW_EXCLUSIVE);
+#endif							/* ENFORCE_BUFFER_PROT */
+
+		LWLockRelease(content_lock);
+
+#ifdef ENFORCE_BUFFER_PROT
+		if (was_exclusive)
+			SetBufferProtection(buffer, true, false);
+#endif							/* ENFORCE_BUFFER_PROT */
+	}
 	else if (mode == BUFFER_LOCK_SHARE)
-		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
+		LWLockAcquire(content_lock, LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
-		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	{
+		LWLockAcquire(content_lock, LW_EXCLUSIVE);
+
+		SetBufferProtection(buffer, true, true);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5219,6 +5338,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
+	bool		ret;
 
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
@@ -5226,8 +5346,13 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = GetBufferDescriptor(buffer - 1);
 
-	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-									LW_EXCLUSIVE);
+	ret = LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+								   LW_EXCLUSIVE);
+
+	if (ret)
+		SetBufferProtection(buffer, true, true);
+
+	return ret;
 }
 
 /*
@@ -6336,6 +6461,11 @@ BufferPrepareToSetHintBits(Buffer buffer)
 		{
 			ResourceOwnerRememberBufferSettingHints(CurrentResourceOwner, buffer);
 
+#ifdef ENFORCE_BUFFER_PROT
+			if (!LWLockHeldByMeInMode(BufferDescriptorGetContentLock(desc), LW_EXCLUSIVE))
+				SetBufferProtection(buffer, true, true);
+#endif							/* ENFORCE_BUFFER_PROT */
+
 			return true;
 		}
 	}
@@ -6411,6 +6541,11 @@ BufferFinishSetHintBits(Buffer buffer, bool mark_dirty, bool buffer_std)
 	 * know if there is somebody waiting.
 	 */
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(desc));
+
+#ifdef ENFORCE_BUFFER_PROT
+	if (!LWLockHeldByMeInMode(BufferDescriptorGetContentLock(desc), LW_EXCLUSIVE))
+		SetBufferProtection(buffer, true, false);
+#endif							/* ENFORCE_BUFFER_PROT */
 }
 
 /*
@@ -6457,14 +6592,25 @@ BufferSetHintBits16(Buffer buffer, uint16 *ptr, uint16 val)
 	}
 	else
 	{
+#ifdef ENFORCE_BUFFER_PROT
+		bool exclusive;
+#endif
+
 		desc = GetBufferDescriptor(buffer - 1);
+
+#ifdef ENFORCE_BUFFER_PROT
+		exclusive = LWLockHeldByMeInMode(BufferDescriptorGetContentLock(desc),
+										 LW_EXCLUSIVE);
+
+		if (!exclusive)
+			SetBufferProtection(buffer, true, true);
+#endif
+
 		buf_state = LockBufHdr(desc);
 
 		if (likely((buf_state & BM_IO_IN_PROGRESS) == 0))
 		{
-
 			*ptr = val;
-
 			did_set = true;
 		}
 
@@ -6473,5 +6619,10 @@ BufferSetHintBits16(Buffer buffer, uint16 *ptr, uint16 val)
 		is_dirty = (buf_state & (BM_DIRTY | BM_JUST_DIRTIED)) == (BM_DIRTY | BM_JUST_DIRTIED);
 		if (did_set && !is_dirty)
 			MarkBufferDirtyHintImpl(buffer, true, true);
+
+#ifdef ENFORCE_BUFFER_PROT
+		if (!exclusive)
+			SetBufferProtection(buffer, true, false);
+#endif
 	}
 }
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8a67f01200c..4c94c20f0cb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -561,7 +561,7 @@ static int	ssl_renegotiation_limit;
  */
 int			huge_pages = HUGE_PAGES_TRY;
 int			huge_page_size;
-static int	huge_pages_status = HUGE_PAGES_UNKNOWN;
+int			huge_pages_status = HUGE_PAGES_UNKNOWN;
 
 /*
  * These variables are all dummies that don't do anything, except in some
-- 
2.45.2.746.g06e570c0df.dirty

#18Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#17)
12 attachment(s)
Re: AIO writes vs hint bits vs checksums

Here's a rebase of these patches.

The last patch, to use mprotect() to detect bad buffer accesses, needed
some changes because of AIO. I disabled the protection completely for
I/O worker processes, it was difficult to be more fine-grained than
that. I didn't test it with io_uring.

I went ahead and committed the "heapam: Only set tuple's block once per
page in pagemode" patch, because it was trivial and independent of the rest.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

0001-Add-very-basic-test-for-kill_prior_tuples.patchtext/x-patch; charset=UTF-8; name=0001-Add-very-basic-test-for-kill_prior_tuples.patchDownload
From f089fbd05f90e3f77407240428073c1d001e9fe8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 15:48:50 -0400
Subject: [PATCH 01/12] Add very basic test for kill_prior_tuples

Previously our tests did not exercise kill_prior_tuples for hash and gist. The
latter only because it does not work if all the dead tuples are on one
page (or something like that).

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 .../isolation/expected/index-killtuples.out   | 337 ++++++++++++++++++
 .../isolation/specs/index-killtuples.spec     | 127 +++++++
 2 files changed, 464 insertions(+)
 create mode 100644 src/test/isolation/expected/index-killtuples.out
 create mode 100644 src/test/isolation/specs/index-killtuples.spec

diff --git a/src/test/isolation/expected/index-killtuples.out b/src/test/isolation/expected/index-killtuples.out
new file mode 100644
index 00000000000..af46a3c8ead
--- /dev/null
+++ b/src/test/isolation/expected/index-killtuples.out
@@ -0,0 +1,337 @@
+Parsed test spec with 1 sessions
+
+starting permutation: create_table fill_500 create_btree flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_btree: CREATE INDEX kill_prior_tuple_btree ON kill_prior_tuple USING btree (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                         
+-----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_btree on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                            
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                0
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+
+starting permutation: create_table fill_500 create_ext_btree_gist create_gist flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gist
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_ext_btree_gist: CREATE EXTENSION btree_gist;
+step create_gist: CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                0
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gist: DROP EXTENSION btree_gist;
+
+starting permutation: create_table fill_10 create_ext_btree_gist create_gist flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gist
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_10: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 10) g(i);
+step create_ext_btree_gist: CREATE EXTENSION btree_gist;
+step create_gist: CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_gist on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gist: DROP EXTENSION btree_gist;
+
+starting permutation: create_table fill_500 create_hash flush disable_seq disable_bitmap measure access flush result measure access flush result delete flush measure access flush result measure access flush result drop_table
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_hash: CREATE INDEX kill_prior_tuple_hash ON kill_prior_tuple USING hash (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step disable_bitmap: SET enable_bitmapscan = false;
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=1 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                                        
+----------------------------------------------------------------------------------
+Index Scan using kill_prior_tuple_hash on kill_prior_tuple (actual rows=0 loops=1)
+  Index Cond: (key = 1)                                                           
+(2 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                0
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+
+starting permutation: create_table fill_500 create_ext_btree_gin create_gin flush disable_seq delete flush measure access flush result measure access flush result drop_table drop_ext_btree_gin
+step create_table: CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null);
+step fill_500: INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i);
+step create_ext_btree_gin: CREATE EXTENSION btree_gin;
+step create_gin: CREATE INDEX kill_prior_tuple_gin ON kill_prior_tuple USING gin (key);
+step flush: SELECT FROM pg_stat_force_next_flush();
+step disable_seq: SET enable_seqscan = false;
+step delete: DELETE FROM kill_prior_tuple;
+step flush: SELECT FROM pg_stat_force_next_flush();
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                             
+-----------------------------------------------------------------------
+Bitmap Heap Scan on kill_prior_tuple (actual rows=0 loops=1)           
+  Recheck Cond: (key = 1)                                              
+  Heap Blocks: exact=1                                                 
+  ->  Bitmap Index Scan on kill_prior_tuple_gin (actual rows=1 loops=1)
+        Index Cond: (key = 1)                                          
+(5 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step measure: UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple');
+step access: EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1;
+QUERY PLAN                                                             
+-----------------------------------------------------------------------
+Bitmap Heap Scan on kill_prior_tuple (actual rows=0 loops=1)           
+  Recheck Cond: (key = 1)                                              
+  Heap Blocks: exact=1                                                 
+  ->  Bitmap Index Scan on kill_prior_tuple_gin (actual rows=1 loops=1)
+        Index Cond: (key = 1)                                          
+(5 rows)
+
+step flush: SELECT FROM pg_stat_force_next_flush();
+step result: SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple';
+new_heap_accesses
+-----------------
+                1
+(1 row)
+
+step drop_table: DROP TABLE IF EXISTS kill_prior_tuple;
+step drop_ext_btree_gin: DROP EXTENSION btree_gin;
diff --git a/src/test/isolation/specs/index-killtuples.spec b/src/test/isolation/specs/index-killtuples.spec
new file mode 100644
index 00000000000..3896c935290
--- /dev/null
+++ b/src/test/isolation/specs/index-killtuples.spec
@@ -0,0 +1,127 @@
+# Basic testing of killtuples / kill_prior_tuples / all_dead testing
+# for various index AMs
+#
+# This tests just enough to ensure that the kill* routines are actually
+# executed and does something approximately reasonble. It's *not* sufficient
+# testing for adding killitems support to a new AM!
+#
+# This doesn't really need to be an isolation test, it could be written as a
+# regular regression test. However, writing it as an isolation test ends up a
+# *lot* less verbose.
+
+setup
+{
+    CREATE TABLE counter(heap_accesses int);
+    INSERT INTO counter(heap_accesses) VALUES (0);
+}
+
+teardown
+{
+    DROP TABLE counter;
+}
+
+session s1
+# to ensure GUCs are reset
+setup { RESET ALL; }
+
+step disable_seq { SET enable_seqscan = false; }
+
+step disable_bitmap { SET enable_bitmapscan = false; }
+
+# use a temporary table to make sure no other session can interfere with
+# visibility determinations
+step create_table { CREATE TEMPORARY TABLE kill_prior_tuple(key int not null, cat text not null); }
+
+step fill_10 { INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 10) g(i); }
+
+step fill_500 { INSERT INTO kill_prior_tuple(key, cat) SELECT g.i, 'a' FROM generate_series(1, 500) g(i); }
+
+# column-less select to make output easier to read
+step flush { SELECT FROM pg_stat_force_next_flush(); }
+
+step measure { UPDATE counter SET heap_accesses = (SELECT heap_blks_read + heap_blks_hit FROM pg_statio_all_tables WHERE relname = 'kill_prior_tuple'); }
+
+step result { SELECT heap_blks_read + heap_blks_hit - counter.heap_accesses AS new_heap_accesses FROM counter, pg_statio_all_tables WHERE relname = 'kill_prior_tuple'; }
+
+step access { EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF, SUMMARY OFF) SELECT * FROM kill_prior_tuple WHERE key = 1; }
+
+step delete { DELETE FROM kill_prior_tuple; }
+
+step drop_table { DROP TABLE IF EXISTS kill_prior_tuple; }
+
+### steps for testing btree indexes ###
+step create_btree { CREATE INDEX kill_prior_tuple_btree ON kill_prior_tuple USING btree (key); }
+
+### steps for testing gist indexes ###
+# Creating the extensions takes time, so we don't want to do so when testing
+# other AMs
+step create_ext_btree_gist { CREATE EXTENSION btree_gist; }
+step drop_ext_btree_gist { DROP EXTENSION btree_gist; }
+step create_gist { CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple USING gist (key); }
+
+### steps for testing gin indexes ###
+# See create_ext_btree_gist
+step create_ext_btree_gin { CREATE EXTENSION btree_gin; }
+step drop_ext_btree_gin { DROP EXTENSION btree_gin; }
+step create_gin { CREATE INDEX kill_prior_tuple_gin ON kill_prior_tuple USING gin (key); }
+
+### steps for testing hash indexes ###
+step create_hash { CREATE INDEX kill_prior_tuple_hash ON kill_prior_tuple USING hash (key); }
+
+
+# test killtuples with btree inex
+permutation
+  create_table fill_500 create_btree flush
+  disable_seq disable_bitmap
+  # show each access to non-deleted tuple increments heap_blks_*
+  measure access flush result
+  measure access flush result
+  delete flush
+  # first access after accessing deleted tuple still needs to access heap
+  measure access flush result
+  # but after kill_prior_tuple did its thing, we shouldn't access heap anymore
+  measure access flush result
+  drop_table
+
+# Same as first permutation, except testing gist
+permutation
+  create_table fill_500 create_ext_btree_gist create_gist flush
+  disable_seq disable_bitmap
+  measure access flush result
+  measure access flush result
+  delete flush
+  measure access flush result
+  measure access flush result
+  drop_table drop_ext_btree_gist
+
+# Test gist, but with fewer rows - shows that killitems doesn't work anymore!
+permutation
+  create_table fill_10 create_ext_btree_gist create_gist flush
+  disable_seq disable_bitmap
+  measure access flush result
+  measure access flush result
+  delete flush
+  measure access flush result
+  measure access flush result
+  drop_table drop_ext_btree_gist
+
+# Same as first permutation, except testing hash
+permutation
+  create_table fill_500 create_hash flush
+  disable_seq disable_bitmap
+  measure access flush result
+  measure access flush result
+  delete flush
+  measure access flush result
+  measure access flush result
+  drop_table
+
+# # Similar to first permutation, except that gin does not have killtuples support
+permutation
+  create_table fill_500 create_ext_btree_gin create_gin flush
+  disable_seq
+  delete flush
+  measure access flush result
+  # will still fetch from heap
+  measure access flush result
+  drop_table drop_ext_btree_gin
-- 
2.39.5

0002-heapam-Move-logic-to-handle-HEAP_MOVED-into-a-helper.patchtext/x-patch; charset=UTF-8; name=0002-heapam-Move-logic-to-handle-HEAP_MOVED-into-a-helper.patchDownload
From 2fe4abc52aa0e4351fdcc05927280326af28b5b3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 23 Sep 2024 12:23:33 -0400
Subject: [PATCH 02/12] heapam: Move logic to handle HEAP_MOVED into a helper
 function

Before we dealt with this in 6 near identical and one very similar copy.

The helper function errors out when encountering a
HEAP_MOVED_IN/HEAP_MOVED_OUT tuple with xvac considered current or
in-progress. It'd be preferrable to do that change separately, but otherwise
it'd not be possible to deduplicate the handling in
HeapTupleSatisfiesVacuum().

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam_visibility.c | 307 ++++----------------
 1 file changed, 61 insertions(+), 246 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 05f6946fe60..4fefcbca5f5 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -144,6 +144,55 @@ HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 	SetHintBits(tuple, buffer, infomask, xid);
 }
 
+/*
+ * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
+ * adjust hint bits. See the comment for SetHintBits() for more background.
+ *
+ * This helper returns false if the row ought to be invisible, true otherwise.
+ */
+static inline bool
+HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
+{
+	TransactionId xvac;
+
+	/* only used by pre-9.0 binary upgrades */
+	if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
+		return true;
+
+	xvac = HeapTupleHeaderGetXvac(tuple);
+
+	if (TransactionIdIsCurrentTransactionId(xvac))
+		elog(ERROR, "encountered tuple with HEAP_MOVED considered current");
+
+	if (TransactionIdIsInProgress(xvac))
+		elog(ERROR, "encountered tuple with HEAP_MOVED considered in-progress");
+
+	if (tuple->t_infomask & HEAP_MOVED_OFF)
+	{
+		if (TransactionIdDidCommit(xvac))
+		{
+			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
+						InvalidTransactionId);
+			return false;
+		}
+		SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
+					InvalidTransactionId);
+	}
+	else if (tuple->t_infomask & HEAP_MOVED_IN)
+	{
+		if (TransactionIdDidCommit(xvac))
+			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
+						InvalidTransactionId);
+		else
+		{
+			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
+						InvalidTransactionId);
+			return false;
+		}
+	}
+
+	return true;
+}
 
 /*
  * HeapTupleSatisfiesSelf
@@ -179,45 +228,8 @@ HeapTupleSatisfiesSelf(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
@@ -372,45 +384,8 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 
 		/*
 		 * An invalid Xmin can be left behind by a speculative insertion that
@@ -468,45 +443,8 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return TM_Invisible;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return TM_Invisible;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return TM_Invisible;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return TM_Invisible;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return TM_Invisible;
-				}
-			}
-		}
+		else if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (HeapTupleHeaderGetCmin(tuple) >= curcid)
@@ -756,45 +694,8 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!TransactionIdIsInProgress(xvac))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (TransactionIdIsInProgress(xvac))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
@@ -979,45 +880,8 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return false;
 
-		/* Used by pre-9.0 binary upgrades */
-		if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return false;
-			if (!XidInMVCCSnapshot(xvac, snapshot))
-			{
-				if (TransactionIdDidCommit(xvac))
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			}
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (!TransactionIdIsCurrentTransactionId(xvac))
-			{
-				if (XidInMVCCSnapshot(xvac, snapshot))
-					return false;
-				if (TransactionIdDidCommit(xvac))
-					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-								InvalidTransactionId);
-				else
-				{
-					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-								InvalidTransactionId);
-					return false;
-				}
-			}
-		}
+		if (!HeapTupleCleanMoved(tuple, buffer))
+			return false;
 		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
@@ -1222,57 +1086,8 @@ HeapTupleSatisfiesVacuumHorizon(HeapTuple htup, Buffer buffer, TransactionId *de
 	{
 		if (HeapTupleHeaderXminInvalid(tuple))
 			return HEAPTUPLE_DEAD;
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_OFF)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return HEAPTUPLE_DELETE_IN_PROGRESS;
-			if (TransactionIdIsInProgress(xvac))
-				return HEAPTUPLE_DELETE_IN_PROGRESS;
-			if (TransactionIdDidCommit(xvac))
-			{
-				SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-							InvalidTransactionId);
-				return HEAPTUPLE_DEAD;
-			}
-			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-						InvalidTransactionId);
-		}
-		/* Used by pre-9.0 binary upgrades */
-		else if (tuple->t_infomask & HEAP_MOVED_IN)
-		{
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);
-
-			if (TransactionIdIsCurrentTransactionId(xvac))
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			if (TransactionIdIsInProgress(xvac))
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			if (TransactionIdDidCommit(xvac))
-				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-							InvalidTransactionId);
-			else
-			{
-				SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-							InvalidTransactionId);
-				return HEAPTUPLE_DEAD;
-			}
-		}
-		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
-		{
-			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			/* only locked? run infomask-only check first, for performance */
-			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) ||
-				HeapTupleHeaderIsOnlyLocked(tuple))
-				return HEAPTUPLE_INSERT_IN_PROGRESS;
-			/* inserted and then deleted by same xact */
-			if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuple)))
-				return HEAPTUPLE_DELETE_IN_PROGRESS;
-			/* deleting subtransaction must have aborted */
-			return HEAPTUPLE_INSERT_IN_PROGRESS;
-		}
+		else if (!HeapTupleCleanMoved(tuple, buffer))
+			return HEAPTUPLE_DEAD;
 		else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple)))
 		{
 			/*
-- 
2.39.5

0003-bufmgr-Add-BufferLockHeldByMe.patchtext/x-patch; charset=UTF-8; name=0003-bufmgr-Add-BufferLockHeldByMe.patchDownload
From 7c88c351bcca4b85fe0b56810e6d2f612fb622cf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 23 Sep 2024 12:51:37 -0400
Subject: [PATCH 03/12] bufmgr: Add BufferLockHeldByMe()

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c | 41 +++++++++++++++++++++++++++++
 src/include/storage/bufmgr.h        |  1 +
 2 files changed, 42 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f9681d09e1e..8be8908c39c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5873,6 +5873,47 @@ IsBufferCleanupOK(Buffer buffer)
 	return false;
 }
 
+/*
+ * BufferLockHeldByMe - does the backend have the buffer locked?
+ *
+ * This likely should only be used for asserts etc.
+ *
+ * Note that this can only be called for non-temp buffers - there is no
+ * correct value to return for temporary buffers. One might think that just
+ * returning true for temp buffers would work, but the caller might assert
+ * that a lock is *not* held.
+ */
+bool
+BufferLockHeldByMe(Buffer buffer, int mode)
+{
+	BufferDesc *buf;
+	LWLockMode	lwmode;
+
+	/*
+	 * Can't hold a lock without a pin, there never should be uncertainty
+	 * about having a pin.
+	 */
+	Assert(BufferIsPinned(buffer));
+
+	/* there'd be no correct value to return */
+	Assert(!BufferIsLocal(buffer));
+
+	buf = GetBufferDescriptor(buffer - 1);
+
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		lwmode = LW_EXCLUSIVE;
+	else if (mode == BUFFER_LOCK_SHARE)
+		lwmode = LW_SHARED;
+	else
+	{
+		Assert(false);
+		pg_unreachable();
+		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+	}
+
+	return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(buf), lwmode);
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f2192ceb271..7c6b6df29aa 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -294,6 +294,7 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern bool BufferLockHeldByMe(Buffer buffer, int mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
-- 
2.39.5

0004-heapam-Use-exclusive-lock-on-old-page-in-CLUSTER.patchtext/x-patch; charset=UTF-8; name=0004-heapam-Use-exclusive-lock-on-old-page-in-CLUSTER.patchDownload
From 8a3e8e8bd038ae9ff2f7972b29cd8a4f66c93813 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 11:45:36 -0400
Subject: [PATCH 04/12] heapam: Use exclusive lock on old page in CLUSTER

To be able to guarantee that we can set the hint bit, acquire an exclusive
lock on the old buffer. We need the hint bits to be set as otherwise
reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple() will
get confused.

It'd be better if we somehow coulda void setting hint bits on the old
page. One reason to use VACUUM FULL are very bloated tables - rewriting most
of the old table before during VACUUM FULL doesn't.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam_handler.c    | 13 ++++++++++++-
 src/backend/access/heap/heapam_visibility.c |  7 +++++++
 src/backend/storage/buffer/bufmgr.c         |  2 +-
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 24d3765aa20..d21d7d4cef7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -837,7 +837,18 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 		tuple = ExecFetchSlotHeapTuple(slot, false, NULL);
 		buf = hslot->buffer;
 
-		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		/*
+		 * To be able to guarantee that we can set the hint bit, acquire an
+		 * exclusive lock on the old buffer. We need the hint bits to be set
+		 * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
+		 * heap_freeze_tuple() will get confused.
+		 *
+		 * It'd be better if we somehow could avoid setting hint bits on the
+		 * old page. One reason to use VACUUM FULL are very bloated tables -
+		 * rewriting most of the old table before during VACUUM FULL doesn't
+		 * exactly help...
+		 */
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		switch (HeapTupleSatisfiesVacuum(tuple, OldestXmin, buf))
 		{
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 4fefcbca5f5..71a410502b3 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -141,6 +141,13 @@ void
 HeapTupleSetHintBits(HeapTupleHeader tuple, Buffer buffer,
 					 uint16 infomask, TransactionId xid)
 {
+	/*
+	 * The uses from heapam.c rely on being able to perform the hint bit
+	 * updates, which can only be guaranteed if we are holding an exclusive
+	 * lock on the buffer - which all callers are doing.
+	 */
+	Assert(BufferIsLocal(buffer) || BufferLockHeldByMe(buffer, BUFFER_LOCK_EXCLUSIVE));
+
 	SetHintBits(tuple, buffer, infomask, xid);
 }
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8be8908c39c..5187f261423 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5908,7 +5908,7 @@ BufferLockHeldByMe(Buffer buffer, int mode)
 	{
 		Assert(false);
 		pg_unreachable();
-		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+		lwmode = LW_EXCLUSIVE;	/* assuage compiler */
 	}
 
 	return LWLockHeldByMeInMode(BufferDescriptorGetContentLock(buf), lwmode);
-- 
2.39.5

0005-bufmgr-Separate-slow-fast-path-of-LockBufHdr.patchtext/x-patch; charset=UTF-8; name=0005-bufmgr-Separate-slow-fast-path-of-LockBufHdr.patchDownload
From 037f1af8ade2235f07edf0dd84f2519e6d92a5f7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 19 Nov 2024 12:08:38 -0500
Subject: [PATCH 05/12] bufmgr: Separate slow/fast path of LockBufHdr

This mainly is beneficial for some upcoming patches that use LockBufHdr() in
reasonably common paths. But I do see a very small improvement when scanning a
large table fully resident in the kernel page cache.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c |  8 ++++---
 src/include/storage/buf_internals.h | 33 ++++++++++++++++++++++++++++-
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5187f261423..dd61ac547b4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6215,10 +6215,12 @@ rlocator_comparator(const void *p1, const void *p2)
 }
 
 /*
- * Lock buffer header - set BM_LOCKED in buffer state.
+ * Slow path helper for LockBufHdr().
+ *
+ * See comment in LockBufHdr() for the reason this is moved out of line.
  */
-uint32
-LockBufHdr(BufferDesc *desc)
+pg_noinline uint32
+LockBufHdrSlow(BufferDesc *desc)
 {
 	SpinDelayStatus delayStatus;
 	uint32		old_buf_state;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 72b36a4af26..4972afa38fc 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -371,7 +371,38 @@ BufferDescriptorGetContentLock(const BufferDesc *bdesc)
  * Functions for acquiring/releasing a shared buffer header's spinlock.  Do
  * not apply these to local buffers!
  */
-extern uint32 LockBufHdr(BufferDesc *desc);
+extern uint32 LockBufHdrSlow(BufferDesc *desc);
+
+/*
+ * Lock buffer header - set BM_LOCKED in buffer state.
+ *
+ * Implemented as a static inline in the header due to freelist.c using it in
+ * a reasonably common path.
+ */
+static inline uint32
+LockBufHdr(BufferDesc *desc)
+{
+	uint32		old_buf_state;
+
+	Assert(!BufferIsLocal(BufferDescriptorGetBuffer(desc)));
+
+	/* fast path, intended to be inlined into calling functions*/
+
+	/* set BM_LOCKED flag */
+	old_buf_state = pg_atomic_fetch_or_u32(&desc->state, BM_LOCKED);
+	if (likely(!(old_buf_state & BM_LOCKED)))
+		return old_buf_state | BM_LOCKED;
+
+	/*
+	 * The overhead of the spin delay code and the retry loop itself would be
+	 * noticeable due to the buffer header lock being taken very
+	 * frequently. Therefore it is moved into a separate function marked
+	 * pg_noinline.
+	 */
+
+	/* slow path with loop etc, not intended to be inlined */
+	return LockBufHdrSlow(desc);
+}
 
 static inline void
 UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
-- 
2.39.5

0006-heapam-Add-batch-mode-mvcc-check-and-use-it-in-page-.patchtext/x-patch; charset=UTF-8; name=0006-heapam-Add-batch-mode-mvcc-check-and-use-it-in-page-.patchDownload
From 8acbbc7099bcdd419ed5461936f00acef498db97 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 13:16:36 -0400
Subject: [PATCH 06/12] heapam: Add batch mode mvcc check and use it in page
 mode

There are two reasons for doing so:

1) It is generally faster to perform checks in a batched fashion and making
   sequential scans faster is nice.

2) We would like to stop setting hint bits while pages are being written
   out. The necessary locking becomes visible for page mode scans if done for
   every tuple. With batching the overhead can be amortized to only happen
   once per page.

There are substantial further optimization opportunities along these
lines:

- Right now HeapTupleSatisfiesMVCCBatch() simply uses the single-tuple
  HeapTupleSatisfiesMVCC(), relying on the compiler to inline it. We could
  instead write an explicitly optimized version that avoids repeated xid
  tests.

- Introduce batched version of the serializability test

- Introduce batched version of HeapTupleSatisfiesVacuum

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam.c            | 104 ++++++++++++++++----
 src/backend/access/heap/heapam_visibility.c |  50 ++++++++++
 src/include/access/heapam.h                 |  28 ++++++
 src/tools/pgindent/typedefs.list            |   1 +
 4 files changed, 163 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cedaa195cb6..1621f055822 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -508,42 +508,106 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 					BlockNumber block, int lines,
 					bool all_visible, bool check_serializable)
 {
+	Oid			relid = RelationGetRelid(scan->rs_base.rs_rd);
+#ifdef BATCHMVCC_FEWER_ARGS
+	BatchMVCCState batchmvcc;
+#else
+	HeapTupleData tuples[MaxHeapTuplesPerPage];
+	bool		visible[MaxHeapTuplesPerPage];
+#endif
 	int			ntup = 0;
-	OffsetNumber lineoff;
+	int			nvis = 0;
+
+	/* page at a time should have been disabled otherwise */
+	Assert(IsMVCCSnapshot(snapshot));
 
-	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	/* first find all tuples on the page */
+	for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
 	{
 		ItemId		lpp = PageGetItemId(page, lineoff);
-		HeapTupleData loctup;
-		bool		valid;
+		HeapTuple	tup;
 
-		if (!ItemIdIsNormal(lpp))
+		if (unlikely(!ItemIdIsNormal(lpp)))
 			continue;
 
-		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-		loctup.t_len = ItemIdGetLength(lpp);
-		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
-		ItemPointerSet(&(loctup.t_self), block, lineoff);
-
-		if (all_visible)
-			valid = true;
-		else
-			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
+		/*
+		 * If the page is not all-visible or we need to check serializability,
+		 * maintain enough state to be able to refind the tuple efficiently,
+		 * without again needing to extract it from the page.
+		 */
+		if (!all_visible || check_serializable)
+		{
+#ifdef BATCHMVCC_FEWER_ARGS
+			tup = &batchmvcc.tuples[ntup];
+#else
+			tup = &tuples[ntup];
+#endif
 
-		if (check_serializable)
-			HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
-												&loctup, buffer, snapshot);
+			tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+			tup->t_len = ItemIdGetLength(lpp);
+			tup->t_tableOid = relid;
+			ItemPointerSet(&(tup->t_self), block, lineoff);
+		}
 
-		if (valid)
+		/*
+		 * If the page is all visible, these fields won'otherwise wont be
+		 * populated in loop below.
+		 */
+		if (all_visible)
 		{
+			if (check_serializable)
+			{
+#ifdef BATCHMVCC_FEWER_ARGS
+				batchmvcc.visible[ntup] = true;
+#else
+				visible[ntup] = true;
+#endif
+			}
 			scan->rs_vistuples[ntup] = lineoff;
-			ntup++;
 		}
+
+		ntup++;
 	}
 
 	Assert(ntup <= MaxHeapTuplesPerPage);
 
-	return ntup;
+	/* unless the page is all visible, test visibility for all tuples one go */
+	if (all_visible)
+		nvis = ntup;
+	else
+		nvis = HeapTupleSatisfiesMVCCBatch(snapshot, buffer,
+										   ntup,
+#ifdef BATCHMVCC_FEWER_ARGS
+										   &batchmvcc,
+#else
+										   tuples, visible,
+#endif
+										   scan->rs_vistuples
+			);
+
+	/*
+	 * So far we don't have batch API for testing serializabilty, so do so
+	 * one-by-one.
+	 */
+	if (check_serializable)
+	{
+		for (int i = 0; i < ntup; i++)
+		{
+#ifdef BATCHMVCC_FEWER_ARGS
+			HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
+												scan->rs_base.rs_rd,
+												&batchmvcc.tuples[i],
+												buffer, snapshot);
+#else
+			HeapCheckForSerializableConflictOut(visible[i],
+												scan->rs_base.rs_rd,
+												&tuples[i],
+												buffer, snapshot);
+#endif
+		}
+	}
+
+	return nvis;
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 71a410502b3..8e91675e20a 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1584,6 +1584,56 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 		return true;
 }
 
+/*
+ * Perform HeaptupleSatisfiesMVCC() on each passed in tuple. This is more
+ * efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
+ *
+ * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
+ * visibility is set in batchmvcc->visible[]. In addition, ->vistuples_dense
+ * is set to contain the offsets of visible tuples.
+ *
+ * Returns the number of visible tuples.
+ */
+int
+HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
+							int ntups,
+#ifdef BATCHMVCC_FEWER_ARGS
+							BatchMVCCState *batchmvcc,
+#else
+							HeapTupleData *tuples,
+							bool *visible,
+#endif
+							OffsetNumber *vistuples_dense)
+{
+	int			nvis = 0;
+
+	Assert(IsMVCCSnapshot(snapshot));
+
+	for (int i = 0; i < ntups; i++)
+	{
+		bool		valid;
+#ifdef BATCHMVCC_FEWER_ARGS
+		HeapTuple	tup = &batchmvcc->tuples[i];
+#else
+		HeapTuple	tup = &tuples[i];
+#endif
+
+		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer);
+#ifdef BATCHMVCC_FEWER_ARGS
+		batchmvcc->visible[i] = valid;
+#else
+		visible[i] = valid;
+#endif
+		if (likely(valid))
+		{
+			vistuples_dense[nvis] = tup->t_self.ip_posid;
+			nvis++;
+		}
+	}
+
+	return nvis;
+}
+
 /*
  * HeapTupleSatisfiesVisibility
  *		True iff heap tuple satisfies a time qual.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..a5e741dd1c0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -425,6 +425,34 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
+/*
+ * FIXME: define to be removed
+ *
+ * Without this I see worse performance. But it's a bit ugly, so I thought
+ * it'd be useful to leave a way in for others to experiment with this.
+ */
+#define BATCHMVCC_FEWER_ARGS
+
+#ifdef BATCHMVCC_FEWER_ARGS
+typedef struct BatchMVCCState
+{
+	HeapTupleData tuples[MaxHeapTuplesPerPage];
+	bool		visible[MaxHeapTuplesPerPage];
+} BatchMVCCState;
+#endif
+
+extern int	HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
+										int ntups,
+#ifdef BATCHMVCC_FEWER_ARGS
+										BatchMVCCState *batchmvcc,
+#else
+										HeapTupleData *tuples,
+										bool *visible,
+#endif
+										OffsetNumber *vistuples_dense);
+
+
+
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b66cecd8799..d67767ca1f8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -250,6 +250,7 @@ Barrier
 BaseBackupCmd
 BaseBackupTargetHandle
 BaseBackupTargetType
+BatchMVCCState
 BeginDirectModify_function
 BeginForeignInsert_function
 BeginForeignModify_function
-- 
2.39.5

0007-bufmgr-Add-interface-to-acquire-right-to-set-hint-bi.patchtext/x-patch; charset=UTF-8; name=0007-bufmgr-Add-interface-to-acquire-right-to-set-hint-bi.patchDownload
From 2a94931e401cf25224fff27e6249707812f78891 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 13:59:13 -0400
Subject: [PATCH 07/12] bufmgr: Add interface to acquire right to set hint bits

At the moment hint bits can be set with just a share lock on a page (and in
one place even without any lock). Because of this we need to copy pages while
writing them out, as otherwise the checksum could be corrupted.

The need to copy the page is problematic for the AIO patchset:

1) Instead of just needing a single buffer for a copied page we need one for
   each page that's potentially undergoing IO
2) To be able to use the "worker" AIO implementation the copied page needs to
   reside in shared memory.

Even without AIO copying the page isn't free...

This commit starts to address that by adding BufferPrepareToSetHintBits(),
which needs to be called before setting hint bits on a
buffer. BufferPrepareToSetHintBits() only allows hint bit writes if there's no
ongoing IO and while hint bits are being set no IO is allowed to be started.

To know that a buffer is undergoing IO a new BufferDesc state flag is used,
BM_SETTING_HINTS. Theoretically it'd be possible to reuse BM_IO_IN_PROGRESS,
but that'd make it harder to debug the system.

The new interface is not yet used, that will happen in subsequent commits, to
make review a bit easier. Therefore we cannot yet rely on not needing a copy
of the buffer during IO.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/README             |   6 +-
 src/backend/storage/buffer/bufmgr.c           | 365 +++++++++++++++---
 .../utils/activity/wait_event_names.txt       |   1 +
 src/include/storage/buf_internals.h           |  22 +-
 src/include/storage/bufmgr.h                  |   6 +
 5 files changed, 351 insertions(+), 49 deletions(-)

diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a182fcd660c..c3249907e6b 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -58,7 +58,8 @@ tuple while they are doing visibility checks.
 4. It is considered OK to update tuple commit status bits (ie, OR the
 values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
 HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and
-pin on a buffer.  This is OK because another backend looking at the tuple
+pin on a buffer after calling BufferPrepareToSetHintBits().
+This is OK because another backend looking at the tuple
 at about the same time would OR the same bits into the field, so there
 is little or no risk of conflicting update; what's more, if there did
 manage to be a conflict it would merely mean that one bit-update would
@@ -68,6 +69,9 @@ great harm is done if they get reset to zero by conflicting updates.
 Note, however, that a tuple is frozen by setting both HEAP_XMIN_INVALID
 and HEAP_XMIN_COMMITTED; this is a critical update and accordingly requires
 an exclusive buffer lock (and it must also be WAL-logged).
+Calling BufferPrepareToSetHintBits() ensures that write IO cannot happen at
+the same time as modifying hint bits, which can lead the checksum computed at
+the start of the write to not match anymore.
 
 5. To physically remove a tuple or compact free space on a page, one
 must hold a pin and an exclusive lock, *and* observe while holding the
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dd61ac547b4..5d397a330b9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -228,6 +228,8 @@ static void ResOwnerReleaseBufferIO(Datum res);
 static char *ResOwnerPrintBufferIO(Datum res);
 static void ResOwnerReleaseBufferPin(Datum res);
 static char *ResOwnerPrintBufferPin(Datum res);
+static void ResOwnerReleaseBufferSettingHints(Datum res);
+static char *ResOwnerPrintBufferSettingHints(Datum res);
 
 const ResourceOwnerDesc buffer_io_resowner_desc =
 {
@@ -247,6 +249,15 @@ const ResourceOwnerDesc buffer_pin_resowner_desc =
 	.DebugPrint = ResOwnerPrintBufferPin
 };
 
+const ResourceOwnerDesc buffer_setting_hints_resowner_desc =
+{
+	.name = "buffer setting hints",
+	.release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
+	.release_priority = RELEASE_PRIO_BUFFER_IOS,
+	.ReleaseResource = ResOwnerReleaseBufferSettingHints,
+	.DebugPrint = ResOwnerPrintBufferSettingHints
+};
+
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
  * entry. This has to be called before using NewPrivateRefCountEntry() to fill
@@ -2133,7 +2144,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* some sanity checks while we hold the buffer header lock */
 	Assert(BUF_STATE_GET_REFCOUNT(victim_buf_state) == 1);
-	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_IO_IN_PROGRESS)));
+	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY
+								 | BM_IO_IN_PROGRESS | BM_SETTING_HINTS)));
 
 	victim_buf_hdr->tag = newTag;
 
@@ -2489,7 +2501,8 @@ again:
 	buf_state = pg_atomic_read_u32(&buf_hdr->state);
 
 	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
-	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY)));
+	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY
+						  | BM_IO_IN_PROGRESS | BM_SETTING_HINTS)));
 
 	CheckBufferIsPinnedOnce(buf);
 #endif
@@ -5341,34 +5354,16 @@ IncrBufferRefCount(Buffer buffer)
 }
 
 /*
- * MarkBufferDirtyHint
- *
- *	Mark a buffer dirty for non-critical changes.
- *
- * This is essentially the same as MarkBufferDirty, except:
- *
- * 1. The caller does not write WAL; so if checksums are enabled, we may need
- *	  to write an XLOG_FPI_FOR_HINT WAL record to protect against torn pages.
- * 2. The caller might have only share-lock instead of exclusive-lock on the
- *	  buffer's content lock.
- * 3. This function does not guarantee that the buffer is always marked dirty
- *	  (due to a race condition), so it cannot be used for important changes.
+ * Implementation of MarkBufferDirtyHint(), separate because
+ * BufferSetHintBits16() needs to be able to set hint bits without setting
+ * BM_SETTING_HINTS.
  */
-void
-MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
+static inline void
+MarkBufferDirtyHintImpl(Buffer buffer, bool buffer_std, bool single_hint_write)
 {
 	BufferDesc *bufHdr;
 	Page		page = BufferGetPage(buffer);
 
-	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
-
-	if (BufferIsLocal(buffer))
-	{
-		MarkLocalBufferDirty(buffer);
-		return;
-	}
-
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
 	Assert(GetPrivateRefCount(buffer) > 0);
@@ -5487,6 +5482,35 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 	}
 }
 
+/*
+ * MarkBufferDirtyHint
+ *
+ *	Mark a buffer dirty for non-critical changes.
+ *
+ * This is essentially the same as MarkBufferDirty, except:
+ *
+ * 1. The caller does not write WAL; so if checksums are enabled, we may need
+ *	  to write an XLOG_FPI_FOR_HINT WAL record to protect against torn pages.
+ * 2. The caller might have only share-lock instead of exclusive-lock on the
+ *	  buffer's content lock.
+ * 3. This function does not guarantee that the buffer is always marked dirty
+ *	  (due to a race condition), so it cannot be used for important changes.
+ */
+void
+MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
+{
+	if (!BufferIsValid(buffer))
+		elog(ERROR, "bad buffer ID: %d", buffer);
+
+	if (BufferIsLocal(buffer))
+	{
+		MarkLocalBufferDirty(buffer);
+		return;
+	}
+
+	MarkBufferDirtyHintImpl(buffer, buffer_std, false);
+}
+
 /*
  * Release buffer content locks for shared buffers.
  *
@@ -5922,7 +5946,8 @@ BufferLockHeldByMe(Buffer buffer, int mode)
  */
 
 /*
- * WaitIO -- Block until the IO_IN_PROGRESS flag on 'buf' is cleared.
+ * WaitIO -- Block until the IO_IN_PROGRESS and SETTING_HINTS flags on 'buf'
+ * are cleared.
  */
 static void
 WaitIO(BufferDesc *buf)
@@ -5950,32 +5975,44 @@ WaitIO(BufferDesc *buf)
 		iow = buf->io_wref;
 		UnlockBufHdr(buf, buf_state);
 
-		/* no IO in progress, we don't need to wait */
-		if (!(buf_state & BM_IO_IN_PROGRESS))
-			break;
+		if (likely((buf_state & (BM_IO_IN_PROGRESS | BM_SETTING_HINTS)) == 0))
+			break;		/* no need to wait */
 
 		/*
 		 * The buffer has asynchronous IO in progress, wait for it to
 		 * complete.
 		 */
-		if (pgaio_wref_valid(&iow))
+		if (buf_state & BM_IO_IN_PROGRESS)
 		{
-			pgaio_wref_wait(&iow);
+			Assert(!(buf_state & BM_SETTING_HINTS));
 
-			/*
-			 * The AIO subsystem internally uses condition variables and thus
-			 * might remove this backend from the BufferDesc's CV. While that
-			 * wouldn't cause a correctness issue (the first CV sleep just
-			 * immediately returns if not already registered), it seems worth
-			 * avoiding unnecessary loop iterations, given that we take care
-			 * to do so at the start of the function.
-			 */
-			ConditionVariablePrepareToSleep(cv);
+			if (pgaio_wref_valid(&iow))
+			{
+				pgaio_wref_wait(&iow);
+
+				/*
+				 * The AIO subsystem internally uses condition variables and thus
+				 * might remove this backend from the BufferDesc's CV. While that
+				 * wouldn't cause a correctness issue (the first CV sleep just
+				 * immediately returns if not already registered), it seems worth
+				 * avoiding unnecessary loop iterations, given that we take care
+				 * to do so at the start of the function.
+				 */
+				ConditionVariablePrepareToSleep(cv);
+				continue;
+			}
+
+			/* wait on BufferDesc->cv, e.g. for concurrent synchronous IO */
+			ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
+		}
+		else
+		{
+			/* Wait for hint-bit setting to complete */
+			Assert(buf_state & BM_SETTING_HINTS);
+			Assert(!(buf_state & BM_IO_IN_PROGRESS));
+			ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_SETTING_HINTS);
 			continue;
 		}
-
-		/* wait on BufferDesc->cv, e.g. for concurrent synchronous IO */
-		ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_IO);
 	}
 	ConditionVariableCancelSleep();
 }
@@ -6014,7 +6051,7 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 	{
 		buf_state = LockBufHdr(buf);
 
-		if (!(buf_state & BM_IO_IN_PROGRESS))
+		if (!(buf_state & (BM_IO_IN_PROGRESS | BM_SETTING_HINTS)))
 			break;
 		UnlockBufHdr(buf, buf_state);
 		if (nowait)
@@ -6031,6 +6068,8 @@ StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 		return false;
 	}
 
+	Assert(!(buf_state & BM_SETTING_HINTS));
+
 	buf_state |= BM_IO_IN_PROGRESS;
 	UnlockBufHdr(buf, buf_state);
 
@@ -6069,6 +6108,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
+	Assert(!(buf_state & BM_SETTING_HINTS));
 	buf_state &= ~BM_IO_IN_PROGRESS;
 
 	/* Clear earlier errors, if this IO failed, it'll be marked again */
@@ -6127,6 +6167,7 @@ AbortBufferIO(Buffer buffer)
 
 	buf_state = LockBufHdr(buf_hdr);
 	Assert(buf_state & (BM_IO_IN_PROGRESS | BM_TAG_VALID));
+	Assert(!(buf_state & BM_SETTING_HINTS));
 
 	if (!(buf_state & BM_VALID))
 	{
@@ -6606,6 +6647,220 @@ EvictUnpinnedBuffer(Buffer buf)
 	return result;
 }
 
+/*
+ * Try to acquire the right to set hint bits for buf.
+ *
+ * It's only permitted to set hint bits in a buffer if the buffer is not
+ * undergoing IO. Otherwise the page level checksum could be corrupted. This
+ * could happen both for PG's checksum and on the OS/filesystem
+ * level. E.g. btrfs with direct-io relies on the page to not change while IO
+ * is going on.
+ *
+ * We can't just check if IO going on at the time BufferPrepareToSetHintBits()
+ * is called, we also need to block IO from starting before we're done setting
+ * hints. This is achieved by setting BM_SETTING_HINTS for the buffer and
+ * having StartBufferIO()/WaitIO() wait for that.  We could combine
+ * BM_SETTING_HINTS and BM_IO_IN_PROGRESS into one, as they can never be set
+ * at the same time, but that seems unnecessarily confusing.
+ *
+ * Because we use only a single bit (BM_SETTING_HINTS) to track whether hint
+ * bits are currently being set, we cannot allow multiple backends to set
+ * hints at the same time - it'd be unknown whether BM_SETTING_HINTS would
+ * need to be remain set when a backend finishes setting hint bits.  In almost
+ * all situations the two backends would just set the same hint bits anyway,
+ * so this is unlikely to be a problem.
+ *
+ * If allowed to set hint bits, the caller needs to call
+ * BufferFinishSetHintBits() once done setting hint bits. In case of an error
+ * occuring before BufferFinishSetHintBits() is reached, the cleanup will be
+ * done via resowner.c.
+ *
+ * It can make sense to amortize the cost of BufferPrepareToSetHintBits() +
+ * BufferFinishSetHintBits() over multiple hint bit sets on a page.
+ *
+ * Returns whether caller is allowed to set hint bits.
+ */
+bool
+BufferPrepareToSetHintBits(Buffer buffer)
+{
+	BufferDesc *desc;
+	uint32		old_buf_state;
+
+	Assert(BufferIsPinned(buffer));
+
+	if (BufferIsLocal(buffer))
+		return true;
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+
+	desc = GetBufferDescriptor(buffer - 1);
+
+	/*
+	 * We could use LockBufHdr() instead, but a CAS loop turns out to be
+	 * slightly faster and has better concurrency behaviour due to not
+	 * blocking other backends.
+	 */
+	old_buf_state = pg_atomic_read_u32(&desc->state);
+
+	for (;;)
+	{
+		uint32		new_buf_state;
+
+		if (unlikely(old_buf_state & BM_LOCKED))
+			old_buf_state = WaitBufHdrUnlocked(desc);
+
+		if (unlikely((old_buf_state & (BM_SETTING_HINTS | BM_IO_IN_PROGRESS)) != 0))
+			return false;
+
+		new_buf_state = old_buf_state;
+
+		new_buf_state |= BM_SETTING_HINTS;
+
+		if (likely(pg_atomic_compare_exchange_u32(&desc->state, &old_buf_state,
+												  new_buf_state)))
+		{
+			ResourceOwnerRememberBufferSettingHints(CurrentResourceOwner, buffer);
+
+			return true;
+		}
+	}
+}
+
+/*
+ * Release the permission to set hint bits on the current page.
+ */
+void
+BufferFinishSetHintBits(Buffer buffer, bool mark_dirty, bool buffer_std)
+{
+	BufferDesc *desc;
+	uint32		new_buf_state;
+	uint32		old_buf_state;
+	bool		is_dirty = false;
+
+	Assert(BufferIsPinned(buffer));
+
+	if (BufferIsLocal(buffer))
+	{
+		MarkLocalBufferDirty(buffer);
+		return;
+	}
+
+	desc = GetBufferDescriptor(buffer - 1);
+
+	/*
+	 * We could use LockBufHdr() instead, but a CAS loop turns out to be
+	 * slightly faster and has better concurrency behaviour due to not
+	 * blocking other backends.
+	 */
+	old_buf_state = pg_atomic_read_u32(&desc->state);
+
+	Assert(old_buf_state & (BM_TAG_VALID));
+	Assert(old_buf_state & (BM_VALID));
+	Assert(old_buf_state & (BM_SETTING_HINTS));
+	Assert(!(old_buf_state & (BM_IO_IN_PROGRESS)));
+
+	for (;;)
+	{
+		if (unlikely(old_buf_state & BM_LOCKED))
+			old_buf_state = WaitBufHdrUnlocked(desc);
+
+		is_dirty = (old_buf_state & (BM_DIRTY | BM_JUST_DIRTIED)) ==
+				(BM_DIRTY | BM_JUST_DIRTIED);
+
+		new_buf_state = old_buf_state;
+		new_buf_state &= ~BM_SETTING_HINTS;
+
+		if (likely(pg_atomic_compare_exchange_u32(&desc->state, &old_buf_state,
+												  new_buf_state)))
+		{
+			break;
+		}
+	}
+
+	/*
+	 * It might be worth combining some of the atomic operations happening
+	 * inside MarkBufferDirtyHintImpl() with the above.
+	 */
+	if (mark_dirty && !is_dirty)
+		MarkBufferDirtyHintImpl(buffer, buffer_std, true);
+
+	ResourceOwnerForgetBufferSettingHints(CurrentResourceOwner, buffer);
+
+	/*
+	 * Wake everyone that might be waiting for a chance to perform IO (i.e.
+	 * WaitIO()).
+	 *
+	 * XXX: Right now this is somewhat expensive, due to
+	 * ConditionVariableBroadcast() acquiring its spinlock unconditionally. I
+	 * don't see a good way to avoid that from the bufmgr.c side, we don't
+	 * know if there is somebody waiting.
+	 */
+	ConditionVariableBroadcast(BufferDescriptorGetIOCV(desc));
+}
+
+/*
+ * Try to set 16 bit hint bit value.
+ *
+ * This is cheaper than BufferPrepareToSetHintBits() +
+ * BufferFinishSetHintBits() when doing a single hint bit write within a
+ * buffer, but more expensive when setting multiple hint bits.
+ *
+ * Assumes that the buffer has the standard layout. The buffer will be marked
+ * as dirty if possible.
+ */
+void
+BufferSetHintBits16(Buffer buffer, uint16 *ptr, uint16 val)
+{
+	BufferDesc	*desc;
+	uint32		buf_state;
+	bool		did_set = false;
+	bool		is_dirty = false;
+
+	Assert(BufferIsPinned(buffer));
+
+	/*
+	 * TODO: Add assert ensuring ptr inside the buffer.
+	 */
+
+	if (unlikely(BufferIsLocal(buffer)))
+	{
+		desc = GetLocalBufferDescriptor(-buffer - 1);
+		buf_state = pg_atomic_read_u32(&desc->state);
+
+		if (likely((buf_state & BM_IO_IN_PROGRESS) == 0))
+		{
+			*ptr = val;
+
+			is_dirty = (buf_state & (BM_DIRTY | BM_JUST_DIRTIED)) ==
+				(BM_DIRTY | BM_JUST_DIRTIED);
+			did_set = true;
+		}
+
+		if (did_set && !is_dirty)
+			MarkLocalBufferDirty(buffer);
+		return;
+	}
+	else
+	{
+		desc = GetBufferDescriptor(buffer - 1);
+		buf_state = LockBufHdr(desc);
+
+		if (likely((buf_state & BM_IO_IN_PROGRESS) == 0))
+		{
+
+			*ptr = val;
+
+			did_set = true;
+		}
+
+		UnlockBufHdr(desc, buf_state);
+
+		is_dirty = (buf_state & (BM_DIRTY | BM_JUST_DIRTIED)) == (BM_DIRTY | BM_JUST_DIRTIED);
+		if (did_set && !is_dirty)
+			MarkBufferDirtyHintImpl(buffer, true, true);
+	}
+}
+
 /*
  * Generic implementation of the AIO handle staging callback for readv/writev
  * on local/shared buffers.
@@ -7200,6 +7455,28 @@ shared_buffer_readv_complete(PgAioHandle *ioh, PgAioResult prior_result,
 	return buffer_readv_complete(ioh, prior_result, cb_data, false);
 }
 
+static void
+ResOwnerReleaseBufferSettingHints(Datum res)
+{
+	Buffer		buffer = DatumGetInt32(res);
+
+	/*
+	 * At this point we don't know if the buffer was a standard buffer or
+	 * whether the caller wanted to dirty the buffer. But we don't need to
+	 * dirty the buffer, which also means we don't need to know if the buffer
+	 * was a standard buffer or not.
+	 */
+	BufferFinishSetHintBits(buffer, false, false);
+}
+
+static char *
+ResOwnerPrintBufferSettingHints(Datum res)
+{
+	Buffer		buffer = DatumGetInt32(res);
+
+	return psprintf("lost track of setting hint bits on buffer %d", buffer);
+}
+
 /*
  * We need a backend-local completion callback for shared buffers, to be able
  * to report checksum errors correctly. Unfortunately that can only safely
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4f44648aca8..455c55428ee 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -112,6 +112,7 @@ BGWORKER_SHUTDOWN	"Waiting for background worker to shut down."
 BGWORKER_STARTUP	"Waiting for background worker to start up."
 BTREE_PAGE	"Waiting for the page number needed to continue a parallel B-tree scan to become available."
 BUFFER_IO	"Waiting for buffer I/O to complete."
+BUFFER_SETTING_HINTS	"Waiting for hint bit setting to complete."
 CHECKPOINT_DELAY_COMPLETE	"Waiting for a backend that blocks a checkpoint from completing."
 CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from starting."
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4972afa38fc..0bf6a45b500 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -33,8 +33,8 @@
  * Buffer state is a single 32-bit variable where following data is combined.
  *
  * - 18 bits refcount
- * - 4 bits usage count
- * - 10 bits of flags
+ * - 3 bits usage count
+ * - 11 bits of flags
  *
  * Combining these values allows to perform some operations without locking
  * the buffer header, by modifying them together with a CAS loop.
@@ -42,8 +42,8 @@
  * The definition of buffer state components is below.
  */
 #define BUF_REFCOUNT_BITS 18
-#define BUF_USAGECOUNT_BITS 4
-#define BUF_FLAG_BITS 10
+#define BUF_USAGECOUNT_BITS 3
+#define BUF_FLAG_BITS 11
 
 StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
 				 "parts of buffer state space need to equal 32");
@@ -65,6 +65,7 @@ StaticAssertDecl(BUF_REFCOUNT_BITS + BUF_USAGECOUNT_BITS + BUF_FLAG_BITS == 32,
  * Note: BM_TAG_VALID essentially means that there is a buffer hashtable
  * entry associated with the buffer's tag.
  */
+#define BM_SETTING_HINTS		(1U << 21)	/* hint bits are being set */
 #define BM_LOCKED				(1U << 22)	/* buffer header is locked */
 #define BM_DIRTY				(1U << 23)	/* data needs writing */
 #define BM_VALID				(1U << 24)	/* data is valid */
@@ -433,6 +434,7 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
 /* ResourceOwner callbacks to hold buffer I/Os and pins */
 extern PGDLLIMPORT const ResourceOwnerDesc buffer_io_resowner_desc;
 extern PGDLLIMPORT const ResourceOwnerDesc buffer_pin_resowner_desc;
+extern PGDLLIMPORT const ResourceOwnerDesc buffer_setting_hints_resowner_desc;
 
 /* Convenience wrappers over ResourceOwnerRemember/Forget */
 static inline void
@@ -455,6 +457,18 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 {
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
+static inline void
+ResourceOwnerRememberBufferSettingHints(ResourceOwner owner, Buffer buffer)
+{
+	ResourceOwnerRemember(owner, Int32GetDatum(buffer),
+						  &buffer_setting_hints_resowner_desc);
+}
+static inline void
+ResourceOwnerForgetBufferSettingHints(ResourceOwner owner, Buffer buffer)
+{
+	ResourceOwnerForget(owner, Int32GetDatum(buffer),
+						&buffer_setting_hints_resowner_desc);
+}
 
 /*
  * Internal buffer management routines
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c6b6df29aa..1a5d9544f0d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -307,6 +307,12 @@ extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
 extern bool EvictUnpinnedBuffer(Buffer buf);
 
+extern bool BufferPrepareToSetHintBits(Buffer buffer);
+extern void BufferFinishSetHintBits(Buffer buffer, bool mark_dirty, bool buffer_std);
+extern void BufferSetHintBits16(Buffer buffer,
+								uint16 *ptr, uint16 val);
+
+
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(void);
-- 
2.39.5

0008-heapam-Acquire-right-to-set-hint-bits.patchtext/x-patch; charset=UTF-8; name=0008-heapam-Acquire-right-to-set-hint-bits.patchDownload
From 1b5c96d86181d41d38e52d8e86ba0ddace57307f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 14:05:15 -0400
Subject: [PATCH 08/12] heapam: Acquire right to set hint bits

This commit starts to use BufferPrepareToSetHintBits(), introduced in a
previous commit, for the reasons explained in said commit.

To amortize the cost of BufferPrepareToSetHintBits() for cases where hint bits
are set at a high frequency, HeapTupleSatisfiesMVCCBatch() uses the new
SetHintBitsExt() which defers BufferFinishSetHintBits() until all hint bits on
a page have been set.

It's likely worth introducing additional batch visibility routines, e.g. for
vacuuming, but I did not find a regression with the state as of this
commit. So that's left for later.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/heap/heapam_visibility.c | 114 ++++++++++++++++----
 src/tools/pgindent/typedefs.list            |   1 +
 2 files changed, 97 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 8e91675e20a..db5b9c7cb5f 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -80,10 +80,35 @@
 
 
 /*
- * SetHintBits()
+ * To be allowed to set hint bits SetHintBits() needs to call
+ * BufferPrepareToSetHintBits(). However, that's not free, and some callsites
+ * call SetHintBits() on numerous tuples in a row. For those it makes sense to
+ * amortize the cost of BufferPrepareToSetHintBits(). Additionally it's
+ * desirable to defer the cost of BufferPrepareToSetHintBits() until a hint
+ * bit needs to actually be set. This enum serves as the necessary state space
+ * passed to SetHintbitsExt().
+ */
+typedef enum SetHintBitsState
+{
+	/* not yet checked if hint bits may be set */
+	SHB_INITIAL,
+	/* failed to get permission to set hint bits, don't check again */
+	SHB_DISABLED,
+	/* allowed to set hint bits */
+	SHB_ENABLED,
+} SetHintBitsState;
+
+/*
+ * SetHintBitsExt()
  *
  * Set commit/abort hint bits on a tuple, if appropriate at this time.
  *
+ * To be allowed to set a hint bit on a tuple, the page must not be undergoing
+ * IO at this time (otherwise we e.g. could corrupt PG's page checksum or even
+ * the filesystem's, as is known to happen with btrfs). The right to set a
+ * hint bit is acquired on a page level with BufferPrepareToSetHintBits().
+ * Only a single backend gets the right to set hint bits at a time.
+ *
  * It is only safe to set a transaction-committed hint bit if we know the
  * transaction's commit record is guaranteed to be flushed to disk before the
  * buffer, or if the table is temporary or unlogged and will be obliterated by
@@ -111,9 +136,16 @@
  * InvalidTransactionId if no check is needed.
  */
 static inline void
-SetHintBits(HeapTupleHeader tuple, Buffer buffer,
-			uint16 infomask, TransactionId xid)
+SetHintBitsExt(HeapTupleHeader tuple, Buffer buffer,
+			   uint16 infomask, TransactionId xid, SetHintBitsState *state)
 {
+	/*
+	 * In batched mode and we previously did not get permission to set hint
+	 * bits. Don't try again, in all likelihood IO is still going on.
+	 */
+	if (state && *state == SHB_DISABLED)
+		return;
+
 	if (TransactionIdIsValid(xid))
 	{
 		/* NB: xid must be known committed here! */
@@ -127,8 +159,50 @@ SetHintBits(HeapTupleHeader tuple, Buffer buffer,
 		}
 	}
 
+	if (!state)
+	{
+		BufferSetHintBits16(buffer,
+							&tuple->t_infomask,
+							tuple->t_infomask | infomask);
+		return;
+	}
+
+	/*
+	 * If not batching or this is the first hint that we'd like to set on the
+	 * page, check if we are allowed to do so.
+	 */
+	if (*state == SHB_INITIAL)
+	{
+		if (BufferPrepareToSetHintBits(buffer))
+		{
+			*state = SHB_ENABLED;
+		}
+		else
+		{
+			/*
+			 * If we hold an exclusive lock nobody else should be able to
+			 * prevent us from setting hint bits. This is important as there
+			 * are a few hint bit sets that are important for correctness -
+			 * all those happen with the buffer exclusively locked.
+			 */
+			Assert(!BufferLockHeldByMe(buffer, BUFFER_LOCK_EXCLUSIVE));
+			*state = SHB_DISABLED;
+			return;
+		}
+	}
+
 	tuple->t_infomask |= infomask;
-	MarkBufferDirtyHint(buffer, true);
+}
+
+/*
+ * Simple wrapper around SetHintBitExt(), use when operating on a single
+ * tuple.
+ */
+static inline void
+SetHintBits(HeapTupleHeader tuple, Buffer buffer,
+			uint16 infomask, TransactionId xid)
+{
+	SetHintBitsExt(tuple, buffer, infomask, xid, NULL);
 }
 
 /*
@@ -864,9 +938,9 @@ HeapTupleSatisfiesDirty(HeapTuple htup, Snapshot snapshot,
  * inserting/deleting transaction was still running --- which was more cycles
  * and more contention on ProcArrayLock.
  */
-static bool
+static inline bool
 HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
-					   Buffer buffer)
+					   Buffer buffer, SetHintBitsState *state)
 {
 	HeapTupleHeader tuple = htup->t_data;
 
@@ -921,8 +995,8 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 			if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
 			{
 				/* deleting subtransaction must have aborted */
-				SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
-							InvalidTransactionId);
+				SetHintBitsExt(tuple, buffer, HEAP_XMAX_INVALID,
+							   InvalidTransactionId, state);
 				return true;
 			}
 
@@ -934,13 +1008,13 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
 			return false;
 		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
-			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
-						HeapTupleHeaderGetRawXmin(tuple));
+			SetHintBitsExt(tuple, buffer, HEAP_XMIN_COMMITTED,
+						   HeapTupleHeaderGetRawXmin(tuple), state);
 		else
 		{
 			/* it must have aborted or crashed */
-			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
-						InvalidTransactionId);
+			SetHintBitsExt(tuple, buffer, HEAP_XMIN_INVALID,
+						   InvalidTransactionId, state);
 			return false;
 		}
 	}
@@ -1003,14 +1077,14 @@ HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
 		if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
 		{
 			/* it must have aborted or crashed */
-			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
-						InvalidTransactionId);
+			SetHintBitsExt(tuple, buffer, HEAP_XMAX_INVALID,
+						   InvalidTransactionId, state);
 			return true;
 		}
 
 		/* xmax transaction committed */
-		SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
-					HeapTupleHeaderGetRawXmax(tuple));
+		SetHintBitsExt(tuple, buffer, HEAP_XMAX_COMMITTED,
+					   HeapTupleHeaderGetRawXmax(tuple), state);
 	}
 	else
 	{
@@ -1606,6 +1680,7 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 							OffsetNumber *vistuples_dense)
 {
 	int			nvis = 0;
+	SetHintBitsState state = SHB_INITIAL;
 
 	Assert(IsMVCCSnapshot(snapshot));
 
@@ -1618,7 +1693,7 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 		HeapTuple	tup = &tuples[i];
 #endif
 
-		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer);
+		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
 #ifdef BATCHMVCC_FEWER_ARGS
 		batchmvcc->visible[i] = valid;
 #else
@@ -1631,6 +1706,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 		}
 	}
 
+	if (state == SHB_ENABLED)
+		BufferFinishSetHintBits(buffer, true, true);
+
 	return nvis;
 }
 
@@ -1650,7 +1728,7 @@ HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot, Buffer buffer)
 	switch (snapshot->snapshot_type)
 	{
 		case SNAPSHOT_MVCC:
-			return HeapTupleSatisfiesMVCC(htup, snapshot, buffer);
+			return HeapTupleSatisfiesMVCC(htup, snapshot, buffer, NULL);
 		case SNAPSHOT_SELF:
 			return HeapTupleSatisfiesSelf(htup, snapshot, buffer);
 		case SNAPSHOT_ANY:
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d67767ca1f8..d4799e0baf4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2679,6 +2679,7 @@ SetConstraintStateData
 SetConstraintTriggerData
 SetExprState
 SetFunctionReturnMode
+SetHintBitsState
 SetOp
 SetOpCmd
 SetOpPath
-- 
2.39.5

0009-Acquire-right-to-set-hint-bits-in-the-remaining-plac.patchtext/x-patch; charset=UTF-8; name=0009-Acquire-right-to-set-hint-bits-in-the-remaining-plac.patchDownload
From 69f35a6a4f6f99980de04c8d6e0a7c8226d30916 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 16:44:11 -0400
Subject: [PATCH 09/12] Acquire right to set hint bits in the remaining places

Use BufferPrepareToSetHintBits() in the remaining places that modify buffers
without an exclusive lock. The remaining places are indexes with support for
kill_prior_tuples and the freespace code.

After this we do not need to copy buffers to write them out anymore. That
change is done separately however.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/gist/gistget.c         | 15 +++++++-----
 src/backend/access/hash/hashutil.c        |  6 ++++-
 src/backend/access/nbtree/nbtinsert.c     | 29 +++++++++++++++--------
 src/backend/access/nbtree/nbtutils.c      |  7 +++++-
 src/backend/storage/freespace/freespace.c | 16 +++++++++----
 src/backend/storage/freespace/fsmpage.c   |  8 ++++++-
 6 files changed, 57 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 387d9972345..8b89629d232 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -63,11 +63,7 @@ gistkillitems(IndexScanDesc scan)
 	 * safe.
 	 */
 	if (BufferGetLSNAtomic(buffer) != so->curPageLSN)
-	{
-		UnlockReleaseBuffer(buffer);
-		so->numKilled = 0;		/* reset counter */
-		return;
-	}
+		goto unlock;
 
 	Assert(GistPageIsLeaf(page));
 
@@ -77,6 +73,12 @@ gistkillitems(IndexScanDesc scan)
 	 */
 	for (i = 0; i < so->numKilled; i++)
 	{
+		if (!killedsomething)
+		{
+			if (!BufferPrepareToSetHintBits(buffer))
+				goto unlock;
+		}
+
 		offnum = so->killedItems[i];
 		iid = PageGetItemId(page, offnum);
 		ItemIdMarkDead(iid);
@@ -86,9 +88,10 @@ gistkillitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		GistMarkPageHasGarbage(page);
-		MarkBufferDirtyHint(buffer, true);
+		BufferFinishSetHintBits(buffer, true, true);
 	}
 
+unlock:
 	UnlockReleaseBuffer(buffer);
 
 	/*
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 66c39f60654..cb004432388 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -593,6 +593,9 @@ _hash_kill_items(IndexScanDesc scan)
 
 			if (ItemPointerEquals(&ituple->t_tid, &currItem->heapTid))
 			{
+				if (!BufferPrepareToSetHintBits(so->currPos.buf))
+					goto unlock_page;
+
 				/* found the item */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
@@ -610,9 +613,10 @@ _hash_kill_items(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
-		MarkBufferDirtyHint(buf, true);
+		BufferFinishSetHintBits(so->currPos.buf, true, true);
 	}
 
+unlock_page:
 	if (so->hashso_bucket_buf == so->currPos.buf ||
 		havePin)
 		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index aa82cede30a..30c8b64564a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -679,20 +679,29 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 				{
 					/*
 					 * The conflicting tuple (or all HOT chains pointed to by
-					 * all posting list TIDs) is dead to everyone, so mark the
-					 * index entry killed.
+					 * all posting list TIDs) is dead to everyone, so try to
+					 * mark the index entry killed. It's ok if we're not
+					 * allowed to, this isn't required for correctness.
 					 */
-					ItemIdMarkDead(curitemid);
-					opaque->btpo_flags |= BTP_HAS_GARBAGE;
+					Buffer		buf;
 
-					/*
-					 * Mark buffer with a dirty hint, since state is not
-					 * crucial. Be sure to mark the proper buffer dirty.
-					 */
+					/* Be sure to operate on the proper buffer */
 					if (nbuf != InvalidBuffer)
-						MarkBufferDirtyHint(nbuf, true);
+						buf = nbuf;
 					else
-						MarkBufferDirtyHint(insertstate->buf, true);
+						buf = insertstate->buf;
+
+					if (BufferPrepareToSetHintBits(buf))
+					{
+						ItemIdMarkDead(curitemid);
+						opaque->btpo_flags |= BTP_HAS_GARBAGE;
+
+						/*
+						 * Mark buffer with a dirty hint, since state is not
+						 * crucial.
+						 */
+						BufferFinishSetHintBits(buf, true, true);
+					}
 				}
 
 				/*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 2aee9bbf67d..53ad7b5017d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -2492,6 +2492,10 @@ _bt_killitems(IndexScanDesc scan)
 			 */
 			if (killtuple && !ItemIdIsDead(iid))
 			{
+				/* if we're not able to set hint bits, there's no point continuing */
+				if (!killedsomething && !BufferPrepareToSetHintBits(so->currPos.buf))
+					goto unlock_page;
+
 				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
@@ -2511,9 +2515,10 @@ _bt_killitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		MarkBufferDirtyHint(so->currPos.buf, true);
+		BufferFinishSetHintBits(so->currPos.buf, true, true);
 	}
 
+unlock_page:
 	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
 }
 
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 4773a9cc65e..f83c13653f1 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -904,12 +904,18 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
 	max_avail = fsm_get_max_avail(page);
 
 	/*
-	 * Reset the next slot pointer. This encourages the use of low-numbered
-	 * pages, increasing the chances that a later vacuum can truncate the
-	 * relation.  We don't bother with a lock here, nor with marking the page
-	 * dirty if it wasn't already, since this is just a hint.
+	 * Try to reset the next slot pointer. This encourages the use of
+	 * low-numbered pages, increasing the chances that a later vacuum can
+	 * truncate the relation.  We don't bother with a lock here, nor with
+	 * marking the page dirty if it wasn't already, since this is just a hint.
+	 *
+	 * FIXME: Should probably use BufferSetHintBits16()
 	 */
-	((FSMPage) PageGetContents(page))->fp_next_slot = 0;
+	if (BufferPrepareToSetHintBits(buf))
+	{
+		((FSMPage) PageGetContents(page))->fp_next_slot = 0;
+		BufferFinishSetHintBits(buf, false, false);
+	}
 
 	ReleaseBuffer(buf);
 
diff --git a/src/backend/storage/freespace/fsmpage.c b/src/backend/storage/freespace/fsmpage.c
index 66a5c80b5a6..16970597998 100644
--- a/src/backend/storage/freespace/fsmpage.c
+++ b/src/backend/storage/freespace/fsmpage.c
@@ -300,7 +300,13 @@ restart:
 	 *
 	 * Wrap-around is handled at the beginning of this function.
 	 */
-	fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);
+	if (exclusive_lock_held || BufferPrepareToSetHintBits(buf))
+	{
+		fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);
+
+		if (!exclusive_lock_held)
+			BufferFinishSetHintBits(buf, false, true);
+	}
 
 	return slot;
 }
-- 
2.39.5

0010-bufmgr-Don-t-copy-pages-while-writing-out.patchtext/x-patch; charset=UTF-8; name=0010-bufmgr-Don-t-copy-pages-while-writing-out.patchDownload
From 52fc5c97663ffdb5556e12e81fdffb929e479ae4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 17 Oct 2024 14:14:35 -0400
Subject: [PATCH 10/12] bufmgr: Don't copy pages while writing out

After the series of preceding commits introducing and using
BufferPrepareToSetHintBits() hint bits cannot be set while IO is going
on. Therefore we do not need to copy pages while they are being written out
anymore.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/hash/hashpage.c      |  2 +-
 src/backend/access/transam/xloginsert.c | 28 +++++----------
 src/backend/storage/buffer/bufmgr.c     | 21 ++++-------
 src/backend/storage/buffer/localbuf.c   |  2 +-
 src/backend/storage/page/bufpage.c      | 48 ++++---------------------
 src/backend/storage/smgr/bulk_write.c   |  2 +-
 src/include/storage/bufpage.h           |  3 +-
 7 files changed, 25 insertions(+), 81 deletions(-)

diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index b8e5bd005e5..dd17eff59d1 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1029,7 +1029,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 					zerobuf.data,
 					true);
 
-	PageSetChecksumInplace(page, lastblock);
+	PageSetChecksum(page, lastblock);
 	smgrextend(RelationGetSmgr(rel), MAIN_FORKNUM, lastblock, zerobuf.data,
 			   false);
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 5ee9d0b028e..db5e0ec213b 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1052,6 +1052,8 @@ XLogCheckBufferNeedsBackup(Buffer buffer)
  * failures. So instead we copy the page and insert the copied data as normal
  * record data.
  *
+ * FIXME: outdated
+ *
  * We only need to do something if page has not yet been full page written in
  * this checkpoint round. The LSN of the inserted wal record is returned if we
  * had to write, InvalidXLogRecPtr otherwise.
@@ -1089,37 +1091,23 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
 	if (lsn <= RedoRecPtr)
 	{
 		int			flags = 0;
-		PGAlignedBlock copied_buffer;
 		char	   *origdata = (char *) BufferGetBlock(buffer);
 		RelFileLocator rlocator;
 		ForkNumber	forkno;
 		BlockNumber blkno;
 
-		/*
-		 * Copy buffer so we don't have to worry about concurrent hint bit or
-		 * lsn updates. We assume pd_lower/upper cannot be changed without an
-		 * exclusive lock, so the contents bkp are not racy.
-		 */
-		if (buffer_std)
-		{
-			/* Assume we can omit data between pd_lower and pd_upper */
-			Page		page = BufferGetPage(buffer);
-			uint16		lower = ((PageHeader) page)->pd_lower;
-			uint16		upper = ((PageHeader) page)->pd_upper;
-
-			memcpy(copied_buffer.data, origdata, lower);
-			memcpy(copied_buffer.data + upper, origdata + upper, BLCKSZ - upper);
-		}
-		else
-			memcpy(copied_buffer.data, origdata, BLCKSZ);
-
 		XLogBeginInsert();
 
 		if (buffer_std)
 			flags |= REGBUF_STANDARD;
 
 		BufferGetTag(buffer, &rlocator, &forkno, &blkno);
-		XLogRegisterBlock(0, &rlocator, forkno, blkno, copied_buffer.data, flags);
+
+		/*
+		 * XXX: Should probably be XLogRegisterBuffer(), but assertions make
+		 * that non-viable right now.
+		 */
+		XLogRegisterBlock(0, &rlocator, forkno, blkno, origdata, flags);
 
 		recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI_FOR_HINT);
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5d397a330b9..4080e8f9b5b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2393,9 +2393,8 @@ again:
 	/*
 	 * If the buffer was dirty, try to write it out.  There is a race
 	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
+	 * buffer header lock above.  We will recheck the dirty bit after
+	 * re-locking the buffer header.
 	 */
 	if (buf_state & BM_DIRTY)
 	{
@@ -4227,9 +4226,8 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  *
  * The caller must hold a pin on the buffer and have share-locked the
  * buffer contents.  (Note: a share-lock does not prevent updates of
- * hint bits in the buffer, so the page could change while the write
- * is in progress, but we assume that that will not invalidate the data
- * written.)
+ * hint bits in the buffer, but hint bits are not allowed to be set after
+ * StartBufferIO() succeeded.)
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
@@ -4242,7 +4240,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	ErrorContextCallback errcallback;
 	instr_time	io_start;
 	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
 
 	/*
@@ -4309,12 +4306,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufBlock = BufHdrGetBlock(buf);
 
-	/*
-	 * Update page checksum if desired.  Since we have only shared lock on the
-	 * buffer, other processes might be updating hint bits in it, so we must
-	 * copy the page to private storage if we do checksumming.
-	 */
-	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
+	/* Update page checksum if desired. */
+	PageSetChecksum((Page) bufBlock, buf->tag.blockNum);
 
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
@@ -4324,7 +4317,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	smgrwrite(reln,
 			  BufTagGetForkNum(&buf->tag),
 			  buf->tag.blockNum,
-			  bufToWrite,
+			  bufBlock,
 			  false);
 
 	/*
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index bf89076bb10..5d53bf12b03 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -196,7 +196,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 		reln = smgropen(BufTagGetRelFileLocator(&bufHdr->tag),
 						MyProcNumber);
 
-	PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+	PageSetChecksum(localpage, bufHdr->tag.blockNum);
 
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 82457bacc62..b7ebec8a69d 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1494,51 +1494,15 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 /*
  * Set checksum for a page in shared buffers.
  *
- * If checksums are disabled, or if the page is not initialized, just return
- * the input.  Otherwise, we must make a copy of the page before calculating
- * the checksum, to prevent concurrent modifications (e.g. setting hint bits)
- * from making the final checksum invalid.  It doesn't matter if we include or
- * exclude hints during the copy, as long as we write a valid page and
- * associated checksum.
+ * If checksums are disabled, or if the page is not initialized, just
+ * return. Otherwise compute and set the checksum.
  *
- * Returns a pointer to the block-sized data that needs to be written. Uses
- * statically-allocated memory, so the caller must immediately write the
- * returned page and not refer to it again.
- */
-char *
-PageSetChecksumCopy(Page page, BlockNumber blkno)
-{
-	static char *pageCopy = NULL;
-
-	/* If we don't need a checksum, just return the passed-in data */
-	if (PageIsNew(page) || !DataChecksumsEnabled())
-		return page;
-
-	/*
-	 * We allocate the copy space once and use it over on each subsequent
-	 * call.  The point of palloc'ing here, rather than having a static char
-	 * array, is first to ensure adequate alignment for the checksumming code
-	 * and second to avoid wasting space in processes that never call this.
-	 */
-	if (pageCopy == NULL)
-		pageCopy = MemoryContextAllocAligned(TopMemoryContext,
-											 BLCKSZ,
-											 PG_IO_ALIGN_SIZE,
-											 0);
-
-	memcpy(pageCopy, page, BLCKSZ);
-	((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno);
-	return pageCopy;
-}
-
-/*
- * Set checksum for a page in private memory.
- *
- * This must only be used when we know that no other process can be modifying
- * the page buffer.
+ * In the past this needed to be done on a copy of the page, due to the
+ * possibility of e.g. hint bits being set concurrently. However, this is not
+ * necessary anymore as hint bits won't be set while IO is going on.
  */
 void
-PageSetChecksumInplace(Page page, BlockNumber blkno)
+PageSetChecksum(Page page, BlockNumber blkno)
 {
 	/* If we don't need a checksum, just return */
 	if (PageIsNew(page) || !DataChecksumsEnabled())
diff --git a/src/backend/storage/smgr/bulk_write.c b/src/backend/storage/smgr/bulk_write.c
index b958be15716..f4d07543365 100644
--- a/src/backend/storage/smgr/bulk_write.c
+++ b/src/backend/storage/smgr/bulk_write.c
@@ -279,7 +279,7 @@ smgr_bulk_flush(BulkWriteState *bulkstate)
 		BlockNumber blkno = pending_writes[i].blkno;
 		Page		page = pending_writes[i].buf->data;
 
-		PageSetChecksumInplace(page, blkno);
+		PageSetChecksum(page, blkno);
 
 		if (blkno >= bulkstate->relsize)
 		{
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..c5126dda7e1 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -505,7 +505,6 @@ extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
 extern void PageIndexTupleDeleteNoCompact(Page page, OffsetNumber offnum);
 extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
-extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
-extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetChecksum(Page page, BlockNumber blkno);
 
 #endif							/* BUFPAGE_H */
-- 
2.39.5

0011-bufmgr-Detect-some-missing-BufferPrepareToSetHintBit.patchtext/x-patch; charset=UTF-8; name=0011-bufmgr-Detect-some-missing-BufferPrepareToSetHintBit.patchDownload
From ff2960acfb0e20efc4a0484f5f4617123a0903e2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 18:11:23 -0400
Subject: [PATCH 11/12] bufmgr: Detect some missing
 BufferPrepareToSetHintBits() calls

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4080e8f9b5b..e619ebd38ff 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5363,6 +5363,24 @@ MarkBufferDirtyHintImpl(Buffer buffer, bool buffer_std, bool single_hint_write)
 	/* here, either share or exclusive lock is OK */
 	Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));
 
+	/*
+	 * A share-locked buffer may only be marked dirty for hints after having
+	 * gotten the permission to do so with BufferPrepareToSetHintBits(). It'd
+	 * perhaps be cheap enough to test this even outside of assert enabled
+	 * builds, but LWLockHeldByMeInMode() says "debug support only".
+	 */
+#ifdef USE_ASSERT_CHECKING
+	if (!single_hint_write)
+	{
+		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+		if (!LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE))
+		{
+			Assert(buf_state & BM_SETTING_HINTS);
+		}
+	}
+#endif
+
 	/*
 	 * This routine might get called many times on the same page, if we are
 	 * making the first scan after commit of an xact that added/deleted many
-- 
2.39.5

0012-WIP-bufmgr-Detect-some-bad-buffer-accesses.patchtext/x-patch; charset=UTF-8; name=0012-WIP-bufmgr-Detect-some-bad-buffer-accesses.patchDownload
From 8e3c7cbf49c3710226c177d3259c4ef9b70bc048 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 28 Oct 2024 18:03:53 -0400
Subject: [PATCH 12/12] WIP: bufmgr: Detect some bad buffer accesses

I wrote this mainly to ensure that I did not miss converting any hint bit sets
to BufferPrepareToSetHintBits(), but it seems like it might be more generally
useful.

If we do want to include it, it needs a bit more polishing.

On my workstation, the performance effect of this test infrastructure is as
follows:

base:
real	1m4.613s
user	4m31.409s
sys	3m20.445s

ENFORCE_BUFFER_PROT

real	1m11.912s
user	4m27.332s
sys	3m28.063s

ENFORCE_BUFFER_PROT + ENFORCE_BUFFER_PROT_READ
real	1m33.455s
user	4m32.188s
sys	3m41.275s

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/aio/method_worker.c |  11 ++
 src/backend/storage/buffer/buf_init.c   |  10 ++
 src/backend/storage/buffer/bufmgr.c     | 183 ++++++++++++++++++++++--
 src/backend/utils/misc/guc_tables.c     |   2 +-
 src/include/pg_config_manual.h          |  24 ++++
 src/include/storage/buf_internals.h     |   9 ++
 src/include/storage/pg_shmem.h          |   2 +
 7 files changed, 229 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/aio/method_worker.c b/src/backend/storage/aio/method_worker.c
index 4a7853d13fa..7442ade2608 100644
--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -388,6 +388,17 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
 	sprintf(cmd, "%d", MyIoWorkerId);
 	set_ps_display(cmd);
 
+	/*
+	 * Allow access to all buffers.  We assume that backends will not request
+	 * I/Os on shared buffers without holding proper locks.
+	 */
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+	for (int i = 0; i < NBuffers; i++)
+	{
+		SetBufferProtection(i + 1, true, true);
+	}
+#endif
+
 	/* see PostgresMain() */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index ed1dc488a42..e5e22cfc59d 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -138,6 +138,16 @@ BufferManagerShmemInit(void)
 							 LWTRANCHE_BUFFER_CONTENT);
 
 			ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+
+			/*
+			 * Unused buffers are inaccessible. But if we're not enforcing
+			 * making buffers inaccessible without a pin, we won't mark
+			 * buffers as accessible during pinning, therefore we better don't
+			 * make them initially inaccessible.
+			 */
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+			SetBufferProtection(i + 1, false, false);
+#endif
 		}
 
 		/* Correct last entry of linked list */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e619ebd38ff..d680b57bcb7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,10 @@
 #include <sys/file.h>
 #include <unistd.h>
 
+#ifdef ENFORCE_BUFFER_PROT
+#include <sys/mman.h>			/* for mprotect() */
+#endif							/* ENFORCE_BUFFER_PROT */
+
 #include "access/tableam.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
@@ -54,6 +58,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/read_stream.h"
 #include "storage/smgr.h"
@@ -1074,8 +1079,6 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 
 	if (need_to_zero)
 	{
-		memset(BufferGetPage(buffer), 0, BLCKSZ);
-
 		/*
 		 * Grab the buffer content lock before marking the page as valid, to
 		 * make sure that no other backend sees the zeroed page before the
@@ -1088,7 +1091,12 @@ ZeroAndLockBuffer(Buffer buffer, ReadBufferMode mode, bool already_valid)
 		 * already valid.)
 		 */
 		if (!isLocalBuf)
+		{
 			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+			SetBufferProtection(buffer, true, true);
+		}
+
+		memset(BufferGetPage(buffer), 0, BLCKSZ);
 
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		if (isLocalBuf)
@@ -1922,6 +1930,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		io_pages[0] = BufferGetBlock(buffers[nblocks_done]);
 		io_buffers_len = 1;
 
+		if (persistence != RELPERSISTENCE_TEMP)
+			SetBufferProtection(io_buffers[0], true, true);
+
 		/*
 		 * How many neighboring-on-disk blocks can we scatter-read into other
 		 * buffers at the same time?  In this case we don't wait if we see an
@@ -1937,7 +1948,16 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 				   BufferGetBlockNumber(buffers[i]) - 1);
 			Assert(io_buffers[io_buffers_len] == buffers[i]);
 
-			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+			io_pages[io_buffers_len] = BufferGetBlock(buffers[i]);
+
+			/*
+			 * If the I/O is performed synchronously, allow the synchronous
+			 * read to modify the buffer.
+			 */
+			if (persistence != RELPERSISTENCE_TEMP)
+				SetBufferProtection(io_buffers[io_buffers_len], true, true);
+
+			io_buffers_len++;
 		}
 
 		/* get a reference to wait for in WaitReadBuffers() */
@@ -1975,6 +1995,16 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 		else
 			pgBufferUsage.shared_blks_read += io_buffers_len;
 
+		/*
+		 * Disallow writes for this proces again.  If the I/O is performed
+		 * asynchronously, the I/O worker will allow the write for itself.
+		 */
+		if (persistence != RELPERSISTENCE_TEMP)
+		{
+			for (int i = 0; i < io_buffers_len; i++)
+				SetBufferProtection(io_buffers[i], true, false);
+		}
+
 		/*
 		 * Track vacuum cost when issuing IO, not after waiting for it.
 		 * Otherwise we could end up issuing a lot of IO in a short timespan,
@@ -2649,7 +2679,9 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
 
 		/* new buffers are zero-filled */
+		SetBufferProtection(buffers[i], true, true);
 		MemSet(buf_block, 0, BLCKSZ);
+		SetBufferProtection(buffers[i], true, false);
 	}
 
 	/*
@@ -2877,7 +2909,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		}
 
 		if (lock)
-			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
+			LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 		TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
 	}
@@ -3136,6 +3168,10 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 				 * non-accessible in any case.
 				 */
 				VALGRIND_MAKE_MEM_DEFINED(BufHdrGetBlock(buf), BLCKSZ);
+
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+				SetBufferProtection(BufferDescriptorGetBuffer(buf), true, false);
+#endif
 				break;
 			}
 		}
@@ -3208,6 +3244,10 @@ PinBuffer_Locked(BufferDesc *buf)
 	 */
 	VALGRIND_MAKE_MEM_DEFINED(BufHdrGetBlock(buf), BLCKSZ);
 
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+	SetBufferProtection(BufferDescriptorGetBuffer(buf), true, false);
+#endif
+
 	/*
 	 * Since we hold the buffer spinlock, we can update the buffer state and
 	 * release the lock in one operation.
@@ -3305,6 +3345,10 @@ UnpinBufferNoOwner(BufferDesc *buf)
 		 */
 		VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);
 
+#if defined(ENFORCE_BUFFER_PROT) && defined(ENFORCE_BUFFER_PROT_READ)
+		SetBufferProtection(BufferDescriptorGetBuffer(buf), false, false);
+#endif
+
 		/* I'd better not still hold the buffer content lock */
 		Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(buf)));
 
@@ -4113,6 +4157,69 @@ CheckForBufferLeaks(void)
 #endif
 }
 
+/*
+ * To verify that we are following buffer locking rules, we can make pages
+ * inaccessible or read-only when we don't have sufficient locks etc.
+ *
+ * XXX: It might be possible to fold the VALGRIND_MAKE_MEM_NOACCESS() /
+ * VALGRIND_MAKE_MEM_DEFINED() calls into this.
+ */
+#ifdef ENFORCE_BUFFER_PROT
+void
+SetBufferProtection(Buffer buf, bool allow_reads, bool allow_writes)
+{
+	static long	pagesz = 0;
+	int			prot = PROT_NONE;
+	int			rc;
+
+	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN &&
+		   huge_pages_status != HUGE_PAGES_TRY);
+	StaticAssertStmt(PROT_NONE == 0, "that can't be right");
+
+	if (unlikely(pagesz == 0))
+	{
+		pagesz = sysconf(_SC_PAGESIZE);
+
+		elog(DEBUG1, "sysconf(_SC_PAGESIZE) = %ld", pagesz);
+		if (pagesz == -1)
+		{
+			elog(ERROR, "sysconf(_SC_PAGESIZE) failed: %m");
+		}
+		else if (pagesz > BLCKSZ)
+		{
+			elog(DEBUG1, "pagesz > BLCKSZ, disabling buffer protection mode");
+			pagesz = -1;
+		}
+		else if(BLCKSZ % pagesz != 0)
+		{
+			elog(DEBUG1, "BLCKSZ %% pagesz != 0, disabling buffer protection mode");
+			pagesz = -1;
+		}
+		else if (huge_pages_status == HUGE_PAGES_ON)
+		{
+			/* can't set status in a granular enough way */
+			elog(DEBUG1, "huge pages enabled, disabling buffer protection mode");
+			pagesz = -1;
+		}
+	}
+
+	/* disabled */
+	if (pagesz == -1)
+		return;
+
+	if (allow_reads)
+		prot |= PROT_READ;
+
+	if (allow_writes)
+		prot |= PROT_WRITE;
+
+	rc = mprotect(BufferGetBlock(buf), BLCKSZ, prot);
+
+	if (rc != 0)
+		elog(ERROR, "mprotect(%d, %d) failed: %m", buf, prot);
+}
+#endif							/* ENFORCE_BUFFER_PROT */
+
 /*
  * Helper routine to issue warnings when a buffer is unexpectedly pinned
  */
@@ -4307,7 +4414,10 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	bufBlock = BufHdrGetBlock(buf);
 
 	/* Update page checksum if desired. */
+	SetBufferProtection(BufferDescriptorGetBuffer(buf), true, true);
 	PageSetChecksum((Page) bufBlock, buf->tag.blockNum);
+	/* FIXME: could theoretically be exclusively locked */
+	SetBufferProtection(BufferDescriptorGetBuffer(buf), true, false);
 
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
@@ -5563,19 +5673,38 @@ void
 LockBuffer(Buffer buffer, int mode)
 {
 	BufferDesc *buf;
+	LWLock	   *content_lock;
 
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
 		return;					/* local buffers need no lock */
 
 	buf = GetBufferDescriptor(buffer - 1);
+	content_lock = BufferDescriptorGetContentLock(buf);
 
 	if (mode == BUFFER_LOCK_UNLOCK)
-		LWLockRelease(BufferDescriptorGetContentLock(buf));
+	{
+#ifdef ENFORCE_BUFFER_PROT
+		bool		was_exclusive;
+
+		was_exclusive = LWLockHeldByMeInMode(content_lock, LW_EXCLUSIVE);
+#endif							/* ENFORCE_BUFFER_PROT */
+
+		LWLockRelease(content_lock);
+
+#ifdef ENFORCE_BUFFER_PROT
+		if (was_exclusive)
+			SetBufferProtection(buffer, true, false);
+#endif							/* ENFORCE_BUFFER_PROT */
+	}
 	else if (mode == BUFFER_LOCK_SHARE)
-		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
+		LWLockAcquire(content_lock, LW_SHARED);
 	else if (mode == BUFFER_LOCK_EXCLUSIVE)
-		LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+	{
+		LWLockAcquire(content_lock, LW_EXCLUSIVE);
+
+		SetBufferProtection(buffer, true, true);
+	}
 	else
 		elog(ERROR, "unrecognized buffer lock mode: %d", mode);
 }
@@ -5589,6 +5718,7 @@ bool
 ConditionalLockBuffer(Buffer buffer)
 {
 	BufferDesc *buf;
+	bool		ret;
 
 	Assert(BufferIsPinned(buffer));
 	if (BufferIsLocal(buffer))
@@ -5596,8 +5726,13 @@ ConditionalLockBuffer(Buffer buffer)
 
 	buf = GetBufferDescriptor(buffer - 1);
 
-	return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-									LW_EXCLUSIVE);
+	ret = LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+								   LW_EXCLUSIVE);
+
+	if (ret)
+		SetBufferProtection(buffer, true, true);
+
+	return ret;
 }
 
 /*
@@ -6732,6 +6867,11 @@ BufferPrepareToSetHintBits(Buffer buffer)
 		{
 			ResourceOwnerRememberBufferSettingHints(CurrentResourceOwner, buffer);
 
+#ifdef ENFORCE_BUFFER_PROT
+			if (!LWLockHeldByMeInMode(BufferDescriptorGetContentLock(desc), LW_EXCLUSIVE))
+				SetBufferProtection(buffer, true, true);
+#endif							/* ENFORCE_BUFFER_PROT */
+
 			return true;
 		}
 	}
@@ -6807,6 +6947,11 @@ BufferFinishSetHintBits(Buffer buffer, bool mark_dirty, bool buffer_std)
 	 * know if there is somebody waiting.
 	 */
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(desc));
+
+#ifdef ENFORCE_BUFFER_PROT
+	if (!LWLockHeldByMeInMode(BufferDescriptorGetContentLock(desc), LW_EXCLUSIVE))
+		SetBufferProtection(buffer, true, false);
+#endif							/* ENFORCE_BUFFER_PROT */
 }
 
 /*
@@ -6853,14 +6998,25 @@ BufferSetHintBits16(Buffer buffer, uint16 *ptr, uint16 val)
 	}
 	else
 	{
+#ifdef ENFORCE_BUFFER_PROT
+		bool exclusive;
+#endif
+
 		desc = GetBufferDescriptor(buffer - 1);
+
+#ifdef ENFORCE_BUFFER_PROT
+		exclusive = LWLockHeldByMeInMode(BufferDescriptorGetContentLock(desc),
+										 LW_EXCLUSIVE);
+
+		if (!exclusive)
+			SetBufferProtection(buffer, true, true);
+#endif
+
 		buf_state = LockBufHdr(desc);
 
 		if (likely((buf_state & BM_IO_IN_PROGRESS) == 0))
 		{
-
 			*ptr = val;
-
 			did_set = true;
 		}
 
@@ -6869,6 +7025,11 @@ BufferSetHintBits16(Buffer buffer, uint16 *ptr, uint16 val)
 		is_dirty = (buf_state & (BM_DIRTY | BM_JUST_DIRTIED)) == (BM_DIRTY | BM_JUST_DIRTIED);
 		if (did_set && !is_dirty)
 			MarkBufferDirtyHintImpl(buffer, true, true);
+
+#ifdef ENFORCE_BUFFER_PROT
+		if (!exclusive)
+			SetBufferProtection(buffer, true, false);
+#endif
 	}
 }
 
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int	ssl_renegotiation_limit;
  */
 int			huge_pages = HUGE_PAGES_TRY;
 int			huge_page_size;
-static int	huge_pages_status = HUGE_PAGES_UNKNOWN;
+int			huge_pages_status = HUGE_PAGES_UNKNOWN;
 
 /*
  * These variables are all dummies that don't do anything, except in some
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 125d3eb5fff..04f1e49b0e7 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -361,3 +361,27 @@
  * Enable tracing of syncscan operations (see also the trace_syncscan GUC var).
  */
 /* #define TRACE_SYNCSCAN */
+
+
+/*
+ * On some operating systems we know how to change whether pages are
+ * readable/writeable. We can use that to verify that we are following buffer
+ * locking rules, we can make pages inaccessible or read-only when we don't
+ * have sufficient locks etc. This obviously is fairly expensive, so by
+ * default we only so in assert enabled builds.
+ */
+#if defined(USE_ASSERT_CHECKING) && !defined(WIN32)
+#define ENFORCE_BUFFER_PROT
+#endif
+
+/*
+ * Protecting pages against being modified without an exclusive lock /
+ * BufferPrepareToSetHintBits() is reasonably cheap, neither happens *that*
+ * often. Pinning/unpinning buffers is a lot more common, making it more
+ * expensive to call mprotect() that often.
+ *
+ * Therefore disable this by default, even in assert enabled builds.
+ */
+#ifdef ENFORCE_BUFFER_PROT
+/* #define ENFORCE_BUFFER_PROT_READ */
+#endif
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0bf6a45b500..50f7c7c29a0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -494,6 +494,15 @@ extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
 extern bool have_free_buffer(void);
 
+#ifdef ENFORCE_BUFFER_PROT
+extern void SetBufferProtection(Buffer buf, bool allow_reads, bool allow_writes);
+#else
+static inline void
+SetBufferProtection(Buffer buf, bool allow_reads, bool allow_writes)
+{
+}
+#endif
+
 /* buf_table.c */
 extern Size BufTableShmemSize(int size);
 extern void InitBufTable(int size);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5d0c56841d4 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -63,6 +63,8 @@ typedef enum
 	SHMEM_TYPE_MMAP,
 }			PGShmemType;
 
+extern PGDLLIMPORT int huge_pages_status;
+
 #ifndef WIN32
 extern PGDLLIMPORT unsigned long UsedShmemSegID;
 #else
-- 
2.39.5

#19Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#18)
Re: AIO writes vs hint bits vs checksums

Hi,

On 2025-04-01 13:34:53 +0300, Heikki Linnakangas wrote:

Here's a rebase of these patches.

Thanks!

Curious what made you do this? Do you need any parts of this soon?

I've been hacking on this a bit more, btw, just haven't gotten around to the
necessary README rephrasings... :(

I went ahead and committed the "heapam: Only set tuple's block once per page
in pagemode" patch, because it was trivial and independent of the rest.

Thanks!

I think I should just commit the kill_prior_tuples test, it's better than
nothing.

Greetings,

Andres Freund

#20Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andres Freund (#19)
Re: AIO writes vs hint bits vs checksums

On 01/04/2025 18:08, Andres Freund wrote:

Hi,

On 2025-04-01 13:34:53 +0300, Heikki Linnakangas wrote:

Here's a rebase of these patches.

Thanks!

Curious what made you do this? Do you need any parts of this soon?

No, I was just browsing through the commitfest.

I went ahead and committed the "heapam: Only set tuple's block once per page
in pagemode" patch, because it was trivial and independent of the rest.

Thanks!

I think I should just commit the kill_prior_tuples test, it's better than
nothing.

+1. I didn't look at it closely, but more tests is better and it's
independent of the rest of the patches.

--
Heikki Linnakangas
Neon (https://neon.tech)

#21Nico Williams
nico@cryptonector.com
In reply to: Heikki Linnakangas (#11)
Re: AIO writes vs hint bits vs checksums

On Wed, Oct 30, 2024 at 02:16:51PM +0200, Heikki Linnakangas wrote:

Acquiring the exclusive lock in step 4 is just a way to wait for all the
existing share-lockers to release the lock. You wouldn't need to block new
share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but
it currently ignores share-lockers. So doing this "properly" would require
more changes to LWLocks. Briefly acquiring an exclusive lock seems
acceptable though.

The problem is starvation. For this you really want something more like
rwlocks that do not have the writer starvation problem. But rwlocks
have other problems too, like in this case forcing readers to wait.

What you want here is something more like a barrier where readers that
did not see that the page has BM_IO_IN_PROGRESS set get to act as though
it's not set while readers that did see that the page has
BM_IO_IN_PROGRESS set don't, and the process that did set that bit gets
to wait for the first set of readers all without blocking the second set
of readers. That's something akin to an rwlock, but better -- in fact,
it resembles RCU.

Nico
--

#22Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#13)
Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-10-30 12:45:27 -0400, Andres Freund wrote:

On 2024-10-30 13:29:01 +0200, Heikki Linnakangas wrote:

On 30/10/2024 04:21, Andres Freund wrote:

Attached is a, unfortunately long, series of patches implementing what I
described upthread.

Review of the preparatory patches:

0001 Add very basic test for kill_prior_tuples

We currently don't exercise this patch for gist and hash, which seems
somewhat criminal.

Interesting to use the isolationtester for this. There's just one session,
so you're just using it to define reusable steps with handy names.

Yea. I had started out writing it as a pg_regress style test and it quickly got very
hard to understand.

I'm fine with that, but please add a comment to explain it.

Makes sense.

I wonder if it'd be more straightforward to make it a regular pg_regress
test though. There would be some repetition, but would it be so bad?

I found it to be quite bad. If testing just one AM it's ok-ish, but once you
test 2-3 it gets very long and repetitive. I guess we could use functions or
such to make it a bit less painful - but that point, is it actually simpler?

You forgot to add the new test to 'isolation_schedule'.

Oops.

typos:
"inex" -> "index"
"does something approximately reasonble" -> "do something approximately
reasonable"

Oops^2.

Pushed the test with these changes.

Greetings,

Andres Freund

#23Maxim Orlov
orlovmg@gmail.com
In reply to: Andres Freund (#22)
1 attachment(s)
Re: AIO writes vs hint bits vs checksums

On Wed, 13 Aug 2025 at 22:38, Andres Freund <andres@anarazel.de> wrote:

Pushed the test with these changes.

I think there is one thing that could be improved here. The isolation
test index-killtuples (377b7ab1452) adds a new dependency:
contrib/btree_gin. contrib/btree_gist. This must be taken into
account in the makefile.

Prior to this patch, isolation tests could be carried out independently
using "make check -C src/test/isolation", but they now require make
check-world to execute successfully, otherwise this test fails with:
====
 step create_ext_btree_gist: CREATE EXTENSION btree_gist;
+ERROR:  extension "btree_gist" is not available
 step create_gist: CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple
USING gist (key);
+ERROR:  data type integer has no default operator class for access method
"gist"
 step flush: SELECT FROM pg_stat_force_next_flush();
====

--
Best regards,
Maxim Orlov.

Attachments:

0001-Fix-isolation-tests-run-after-adding-kill_prior_tupl.patchapplication/octet-stream; name=0001-Fix-isolation-tests-run-after-adding-kill_prior_tupl.patchDownload
From d161c5d493b8851da9141282350ce6d77bb235a5 Mon Sep 17 00:00:00 2001
From: Maxim Orlov <orlovmg@gmail.com>
Date: Mon, 27 Oct 2025 14:28:31 +0300
Subject: [PATCH] Fix isolation tests run after adding kill_prior_tuples test

---
 src/test/isolation/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/test/isolation/Makefile b/src/test/isolation/Makefile
index ade2256ed3a..63594a18030 100644
--- a/src/test/isolation/Makefile
+++ b/src/test/isolation/Makefile
@@ -5,6 +5,8 @@
 PGFILEDESC = "pg_isolation_regress/isolationtester - multi-client test driver"
 PGAPPICON = win32
 
+EXTRA_INSTALL=contrib/btree_gin contrib/btree_gist
+
 subdir = src/test/isolation
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
-- 
2.43.0

#24Michael Paquier
michael@paquier.xyz
In reply to: Maxim Orlov (#23)
Re: AIO writes vs hint bits vs checksums

On Mon, Oct 27, 2025 at 02:48:43PM +0300, Maxim Orlov wrote:

I think there is one thing that could be improved here. The isolation
test index-killtuples (377b7ab1452) adds a new dependency:
contrib/btree_gin. contrib/btree_gist. This must be taken into
account in the makefile.

Also mentioned here:
/messages/by-id/aKJsWedftW7UX1WM@paquier.xyz

It looks that the conclusion would be to move this test into its own
test module to avoid the external contrib/ dependency in
src/test/isolation/, as duplicated Makefile rules in
src/test/isolation/Makefile are unpleasant.
--
Michael