AIO writes vs hint bits vs checksums

Started by Andres Freundabout 1 year ago24 messages

Andres Freund

andres@anarazel.de

about 1 year ago

Hi,

Currently we modify pages while just holding a share lock, for hint bit
writes. Writing a buffer out only requires a share lock. Because of that we
can't compute checksums and write out pages in-place, as a concurent hint bit
write can easily corrupt the checksum.

That's not great, but not awful for our current synchronous buffered IO, as
we only ever have a single page being written out at a time.

However, it becomes a problem even if we just want to write out in chunks
larger than a single page - we'd need to reserve not just one BLCKSZ sized
buffer for this, but make it PG_IOV_MAX * BLCKSZ sized. Perhaps still
tolerable.

With AIO this becomes a considerably bigger issue:
a) We can't just have one write in progress at a time, but many
b) To be able to implement AIO using workers the "source" or "target" memory
of the IO needs to be in shared memory

Besides that, the need to copy the buffers makes checkpoints with AIO
noticeably slower when checksums are enabled - it's not the checksum but the
copy that's the biggest source of the slowdown.

So far the AIO patchset has solved this by introducing a set of "bounce
buffers", which can be acquired and used as the source/target of IO when doing
it in-place into shared buffers isn't viable.

I am worried about that solution however, as either acquisition of bounce
buffers becomes a performance issue (that's how I did it at first, it was hard
to avoid regressions) or we reserve bounce buffers for each backend, in which
case the memory overhead for instances with relatively small amount of
shared_buffers and/or many connections can be significant.

Which lead me down the path of trying to avoid the need for the copy in the
first place: What if we don't modify pages while it's undergoing IO?

The naive approach would be to not set hint bits with just a shared lock - but
that doesn't seem viable at all. For performance we rely on hint bits being
set and in many cases we'll only encounter the page in shared mode. We could
implement a conditional lock upgrade to an exclusive lock and do so while
setting hint bits, but that'd obviously be concerning from a concurrency point
of view.

What I suspect we might want instead is something inbetween a share and an
exclusive lock, which is taken while setting a hint bit and which conflicts
with having an IO in progress.

On first blush it might sound attractive to introduce this on the level of
lwlocks. However, I doubt that is a good idea - it'd make lwlock.c more
complicated which would imply overhead for other users, while the precise
semantics would be fairly specific to buffer locking. A variant of this would
be to generalize lwlock.c to allow implementing different kinds of locks more
easily. But that's a significant project on its own and doesn't really seem
necessary for this specific project.

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples. That's not exactly free. I've
prototyped checking visibility of an entire page of tuples at once and it
indeed speeds up visibility checks substantially (in some cases seqscans are
over 20% faster!).

Once we have page-level visibility checks we can get the right to set hint
bits once for an entire page instead of doing it for every tuple - with that
in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
wins.

Having a page level approach to setting hint bits has other advantages:

E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
on the page and we don't mark a page dirty on hot standby. Which often will
result in hint bits notpersistently set on replicas until the page is frozen.

Another issue is that we'll often WAL log hint bits for a page (due to hint
bits being set), just to then immediately log another WAL record for the same
page (e.g. for pruning), which is obviously wasteful. With a different
interface we could combine the WAL records for both.

I've not prototyped either, but I'm fairly confident they'd be helpful.

Does this sound like a reasonable idea? Counterpoints? If it does sound
reasonable, I'll clean up my pile of hacks into something
semi-understandable...

Greetings,

Andres Freund

Noah Misch

noah@leadboat.com

about 1 year ago

In reply to: Andres Freund (#1)

Re: AIO writes vs hint bits vs checksums

On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:

So far the AIO patchset has solved this by introducing a set of "bounce
buffers", which can be acquired and used as the source/target of IO when doing
it in-place into shared buffers isn't viable.

I am worried about that solution however, as either acquisition of bounce
buffers becomes a performance issue (that's how I did it at first, it was hard
to avoid regressions) or we reserve bounce buffers for each backend, in which
case the memory overhead for instances with relatively small amount of
shared_buffers and/or many connections can be significant.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples. That's not exactly free. I've
prototyped checking visibility of an entire page of tuples at once and it
indeed speeds up visibility checks substantially (in some cases seqscans are
over 20% faster!).

Nice! It sounds like you refactored the relationship between
heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
bit setting upward or the iterate-over-tuples downward. Is that about right?

Once we have page-level visibility checks we can get the right to set hint
bits once for an entire page instead of doing it for every tuple - with that
in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
wins.

How did page-level+BM_SETTING_HINTS performance compare to performance of the
page-level change w/o the BM_SETTING_HINTS change?

Having a page level approach to setting hint bits has other advantages:

E.g. today, with wal_log_hints, we'll log hint bits on the first hint bit set
on the page and we don't mark a page dirty on hot standby. Which often will
result in hint bits notpersistently set on replicas until the page is frozen.

Nice way to improve that.

Does this sound like a reasonable idea? Counterpoints?

I guess the main part left to discuss is index scans or other scan types where
we'd either not do page-level visibility or we'd do page-level visibility
including tuples we wouldn't otherwise use. BM_SETTING_HINTS likely won't
show up so readily in index scan profiles, but the cost is still there. How
should we think about comparing the distributed cost of the buffer header
manipulations during index scans vs. the costs of bounce buffers?

Thanks,
nm

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Noah Misch (#2)

Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-09-24 12:43:40 -0700, Noah Misch wrote:

On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:

So far the AIO patchset has solved this by introducing a set of "bounce
buffers", which can be acquired and used as the source/target of IO when doing
it in-place into shared buffers isn't viable.

I am worried about that solution however, as either acquisition of bounce
buffers becomes a performance issue (that's how I did it at first, it was hard
to avoid regressions) or we reserve bounce buffers for each backend, in which
case the memory overhead for instances with relatively small amount of
shared_buffers and/or many connections can be significant.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples. That's not exactly free. I've
prototyped checking visibility of an entire page of tuples at once and it
indeed speeds up visibility checks substantially (in some cases seqscans are
over 20% faster!).

Nice! It sounds like you refactored the relationship between
heap_prepare_pagescan() and HeapTupleSatisfiesVisibility() to move the hint
bit setting upward or the iterate-over-tuples downward. Is that about right?

I've tried about five variations, so I don't have one answer to this yet :).

One problem is that having repeated loops doing PageGetItemId(),
PageGetItem(), ItemIdGetLength() isn't exactly free. To some degree it can be
hidden by allowing for better superscalar execution, but not entirely.

I've been somewhat confused by the compiler generated code around ItemId
handling for a while, it looks way more expensive than it should - it
regularly is a bottleneck due to the sheer number of instructions being
executed leading to being frontend bound. But never quite looked into it
deeply enough to figure out what's causing it / how to fix it.

Once we have page-level visibility checks we can get the right to set hint
bits once for an entire page instead of doing it for every tuple - with that
in place the "new approach" of setting hint bits only with BM_SETTING_HINTS
wins.

How did page-level+BM_SETTING_HINTS performance compare to performance of the
page-level change w/o the BM_SETTING_HINTS change?

Just ran that. There probably is a performance difference, but it's small
(<0.5%) making it somewhat hard to be certain. It looks like the main reason
for that is ConditionVariableBroadcast() on the iocv shows up even though
nobody is waiting.

I've been fighting that with AIO as well, so maybe it's time to figure out the
memory ordering rules that'd allow to check that without a full spinlock
acquisition.

If we figure it out, we perhaps should use the chance to get rid of
BM_PIN_COUNT_WAITER...

Does this sound like a reasonable idea? Counterpoints?

I guess the main part left to discuss is index scans or other scan types where
we'd either not do page-level visibility or we'd do page-level visibility
including tuples we wouldn't otherwise use. BM_SETTING_HINTS likely won't
show up so readily in index scan profiles, but the cost is still there.

I could indeed not make it show up in some simple index lookup heavy
workloads. I need to try some more extreme cases though (e.g. fetching all
tuples in a table via an index or having very long HOT chains).

If it's not visible cost-wise compared to all the other costs of index scans -
does it matter? If it's not visible it's either because it proportionally is
very small or because it's completely hidden by superscalar execution.

Another thing I forgot to mention that probably fits into the "tradeoffs"
bucket: Because BM_SETTING_HINTS would be just a single bit, one backend
setting hint bits would block out another backend setting hint bits. In most
situations that'll be fine, or even faster than not doing so due to reducing
cache line ping-pong, but in cases of multiple backends doing index lookups to
different unhinted tuples on the same page it could be a bit worse.

But I suspect that's fine because it's hard to believe that you could have
enough index lookups to unhinted tuples for that to be a bottleneck -
something has to produce all those unhinted tuples after all, and that's
rather more expensive. And for single-tuple visibility checks the window in
which hint bits are set is very small.

How should we think about comparing the distributed cost of the buffer
header manipulations during index scans vs. the costs of bounce buffers?

Well, the cost of bounce buffers would be born as long as postgres is up,
whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
really show up. If there are cases where the cost of the more expensive hint
bit logic does show up, it'll get a lot harder to weigh.

So far my prototype uses the path that avoids hint bits being set while IO is
going on all the time, not just when checksums are enabled. We could change
that, I guess. However, our habit of modifying buffers while IO is going on is
causing issues with filesystem level checksums as well, as evidenced by the
fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
tend to think it'd be better to just stop doing that alltogether (we also do
that for WAL, when writing out a partial page, but a potential fix there would
be different, I think).

A thing I forgot to mention: Bounce buffers are kind of an architectural dead
end, in that we wouln't need them in that form if we get to a threaded
postgres.

Zooming out (a lot) more: I like the idea of having a way to get the
permission to perform some kinds of modifications on a page without an
exlusive lock. While obviously a lot more work, I do think there's some
potential to have some fast-paths that perform work on a page level without
blocking out readers. E.g. some simple cases of insert could correctly be done
without blocking out readers (by ordering the update of the max page offset

Greetings,

Andres Freund

Thomas Munro

thomas.munro@gmail.com

about 1 year ago

In reply to: Andres Freund (#3)

Re: AIO writes vs hint bits vs checksums

On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres@anarazel.de> wrote:

Just ran that. There probably is a performance difference, but it's small
(<0.5%) making it somewhat hard to be certain. It looks like the main reason
for that is ConditionVariableBroadcast() on the iocv shows up even though
nobody is waiting.

. o O { Gotta fix that. Memory barriers might be enough to check for
empty wait list?, and even in the slow path, atomic wait lists or
something better than spinlocks... }

However, our habit of modifying buffers while IO is going on is
causing issues with filesystem level checksums as well, as evidenced by the
fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
tend to think it'd be better to just stop doing that alltogether (we also do
that for WAL, when writing out a partial page, but a potential fix there would
be different, I think).

+many. Interesting point re the WAL variant. For the record, here's
some discussion and a repro for that problem, which Andrew currently
works around in a build farm animal with mount options:

/messages/by-id/CA+hUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw@mail.gmail.com

Noah Misch

noah@leadboat.com

about 1 year ago

In reply to: Andres Freund (#3)

Re: AIO writes vs hint bits vs checksums

On Tue, Sep 24, 2024 at 04:30:25PM -0400, Andres Freund wrote:

On 2024-09-24 12:43:40 -0700, Noah Misch wrote:

On Tue, Sep 24, 2024 at 11:55:08AM -0400, Andres Freund wrote:

Besides that, the need to copy the buffers makes checkpoints with AIO
noticeably slower when checksums are enabled - it's not the checksum but the
copy that's the biggest source of the slowdown.

How big is that copy's contribution to the slowdown there? A measurable CPU
overhead on writes likely does outweigh the unmeasurable overhead on index
scans, but ...

Does this sound like a reasonable idea? Counterpoints?

How should we think about comparing the distributed cost of the buffer
header manipulations during index scans vs. the costs of bounce buffers?

Well, the cost of bounce buffers would be born as long as postgres is up,
whereas a not-measurable (if it indeed isn't) cost during index scans wouldn't
really show up.

... neither BM_SETTING_HINTS nor keeping bounce buffers looks like a bad
decision. From what I've heard so far of the performance effects, if it were
me, I would keep the bounce buffers. I'd pursue BM_SETTING_HINTS and bounce
buffer removal as a distinct project after the main AIO capability. Bounce
buffers have an implementation. They aren't harming other design decisions.
The AIO project is big, so I'd want to err on the side of not designating
other projects as its prerequisites.

Zooming out (a lot) more: I like the idea of having a way to get the
permission to perform some kinds of modifications on a page without an
exlusive lock. While obviously a lot more work, I do think there's some
potential to have some fast-paths that perform work on a page level without
blocking out readers. E.g. some simple cases of insert could correctly be done
without blocking out readers (by ordering the update of the max page offset

True.

Antonin Houska

ah@cybertec.at

about 1 year ago

In reply to: Andres Freund (#1)

Re: AIO writes vs hint bits vs checksums

Andres Freund <andres@anarazel.de> wrote:

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

An alternative approach: introduce a flag that tells that the checksum is
being computed, and disallow setting hint bits when that flag is set. As long
as the checksum computation takes take much less time than the IO, fewer hint
bit updates should be rejected.

Of course, SetHintBits() would have to update the checksum too. But if it
could determine where the hint bit is located in the buffer and if "some
intermediate state" of the computation was maintained for each page in shared
buffers, then the checksum update might be cheaper than the initial
computation. But I'm not sure I understand the algorithm enough.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Antonin Houska

ah@cybertec.at

about 1 year ago

In reply to: Antonin Houska (#6)

Re: AIO writes vs hint bits vs checksums

Antonin Houska <ah@cybertec.at> wrote:

Andres Freund <andres@anarazel.de> wrote:

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

An alternative approach: introduce a flag that tells that the checksum is
being computed, and disallow setting hint bits when that flag is set. As long
as the checksum computation takes take much less time than the IO, fewer hint
bit updates should be rejected.

Well, the checksum actually should not be computed during the IO, so the IO
would still disallow hint bit updates :-(

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Thomas Munro

thomas.munro@gmail.com

about 1 year ago

In reply to: Thomas Munro (#4)

Re: AIO writes vs hint bits vs checksums

On Wed, Sep 25, 2024 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Sep 25, 2024 at 8:30 AM Andres Freund <andres@anarazel.de> wrote:

However, our habit of modifying buffers while IO is going on is
causing issues with filesystem level checksums as well, as evidenced by the
fact that debug_io_direct = data on btrfs causes filesystem corruption. So I
tend to think it'd be better to just stop doing that alltogether (we also do
that for WAL, when writing out a partial page, but a potential fix there would
be different, I think).

+many. Interesting point re the WAL variant. For the record, here's
some discussion and a repro for that problem, which Andrew currently
works around in a build farm animal with mount options:

/messages/by-id/CA+hUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw@mail.gmail.com

Here's an interesting new development in that area, this time from
OpenZFS, which committed its long awaited O_DIRECT support a couple of
weeks ago[1]https://github.com/openzfs/zfs/pull/10018/commits/d7b861e7cfaea867ae28ab46ab11fba89a5a1fda and seems to have taken a different direction since that
last discussion. Clearly it has the same checksum stability problem
as BTRFS and PostgreSQL itself, so an O_DIRECT mode with the goal of
avoiding copying and caching must confront that and break *something*,
or accept something like bounce buffers and give up the zero-copy
goal. Curiously, they seem to have landed on two different solutions
with three different possible behaviours: (1) On FreeBSD, temporarily
make the memory non-writeable, (2) On Linux, they couldn't do that so
they have an extra checksum verification on write. I haven't fully
grokked all this yet, or even tried it, and it's not released or
anything, but it looks a bit like all three behaviours are bad for our
current hint bit design: on FreeBSD, setting a hint bit might crash
(?) if a write is in progress in another process, and on Linux,
depending on zfs_vdev_direct_write_verify, either the concurrent write
might fail (= checkpointer failing on EIO because someone concurrently
set a hint bit) or a later read might fail (= file is permanently
corrupted and you don't find out until later, like btrfs). I plan to
look more closely soon and see if I understood that right...

[1]: https://github.com/openzfs/zfs/pull/10018/commits/d7b861e7cfaea867ae28ab46ab11fba89a5a1fda

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Andres Freund (#3)

13 attachment(s)

Re: AIO writes vs hint bits vs checksums

Hi,

Attached is a, unfortunately long, series of patches implementing what I
described upthread.

0001 Add very basic test for kill_prior_tuples

We currently don't exercise this patch for gist and hash, which seems
somewhat criminal.

0002 heapam: Move logic to handle HEAP_MOVED into a helper function

A prep patch, which consolidates HEAP_MOVED handling into a helper. This
isn't strictly necessary, but I got bothered by it while writing this patch
series.

0003 bufmgr: Add BufferLockHeldByMe()

Useful for writing assertions in later patches.

0004 heapam: Use exclusive lock on old page in CLUSTER

When I started looking into this I wanted to make sure that it's actually
safe to skip setting hint bits in all places. One place that's *not* safe
is cluster - we assume that all hint bits are set in rewrite_heap_tuple().

This might not strictly be required, but it seemed very fragile as
is. While using an exclusive lock does seem like it makes it a bit safer,
it's not a very good fix, as we still dirty the old buffer, which seems
like a shame when one uses VACUUM FULL.

0005 heapam: Only set tuple's block once per page in pagemode

Minor, but measurable, optimization.

0006 heapam: Add batch mode mvcc check and use it in page mode

This is a good bit faster on its own, but is required to avoid a
performance regression later, when setting hint bits only only when no IO
going on at the same.

0007 bufmgr: Make it easier to change number of buffer state bits

This just makes it easier to change the division of bits in
BufferDesc->state.

0008 bufmgr: Add interface to acquire right to set hint bits

The main change. Adds BufferPrepareToSetHintBits() which, if true is
returned, gives the rigth to modify a buffer without an exclusive
lock. Once the modification is done BufferFinishSetHintBits() needs to be
called.

I dithered a bit on the name, we don't really call them hint bits in all
the places that modify buffers without an exlusive lock.

The new infra is unused as of this commit.

0009 heapam: Acquire right to set hint bits

Moves over heapam to the new interface.

0010 Acquire right to set hint bits in the remaining places

Moves indexams and freespace/ over to the new interface.

0011 bufmgr: Detect some missing BufferPrepareToSetHintBits() calls

Now that we don't set hint bits without calling
BufferPrepareToSetHintBits() first, we can detect many missing calls by
adding an assertion to MarkBufferDirtyHint().

0012 bufmgr: Don't copy pages while writing out

Finally remove the copy of the buffer while writing it out.

0013 WIP: bufmgr: Detect some bad buffer accesses

This is a lot more thorough infrastructure for detecting modifications of
buffers without appropriate locks, by using mprotect() to detect such
cases. Only works when not using huge pages, and not on windows (not sure
how mprotect() is spelled there) and not arm macos (due to 16kB pages).

Has two modes, one just disallows modifications, and is reasonably
cheap. The more expensive mode disallows any access when the buffer is
unpinned. The commit message has some performance numbers.

This has proven to be a good investment as far as this thread goes - I
wasn't initially aware that freespace modified buffers without a lock and
this quickly found that.

Greetings,

Andres Freund

#10

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: Andres Freund (#9)

Re: AIO writes vs hint bits vs checksums

On 30/10/2024 04:21, Andres Freund wrote:

Attached is a, unfortunately long, series of patches implementing what I
described upthread.

Review of the preparatory patches:

0001 Add very basic test for kill_prior_tuples

We currently don't exercise this patch for gist and hash, which seems
somewhat criminal.

Interesting to use the isolationtester for this. There's just one
session, so you're just using it to define reusable steps with handy
names. I'm fine with that, but please add a comment to explain it. I
wonder if it'd be more straightforward to make it a regular pg_regress
test though. There would be some repetition, but would it be so bad?

You forgot to add the new test to 'isolation_schedule'.

typos:
"inex" -> "index"
"does something approximately reasonble" -> "do something
approximately reasonable"

0002 heapam: Move logic to handle HEAP_MOVED into a helper function

A prep patch, which consolidates HEAP_MOVED handling into a helper. This
isn't strictly necessary, but I got bothered by it while writing this patch
series.

+/*
+ * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
+ * adjust hint bits. See the comment for SetHintBits() for more background.
+ *
+ * This helper returns false if the row ought to be invisible, true otherwise.
+ */
+static inline bool
+HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
+{
+	TransactionId xvac;
+
+	/* only used by pre-9.0 binary upgrades */
+	if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
+		return true;
+
+ 	...
+}

This is so unlikely these days that this function probably should not be
inlined anymore.

0003 bufmgr: Add BufferLockHeldByMe()

Useful for writing assertions in later patches.

+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		lwmode = LW_EXCLUSIVE;
+	else if (mode == BUFFER_LOCK_SHARE)
+		lwmode = LW_SHARED;
+	else
+	{
+		Assert(false);
+		pg_unreachable();
+		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+	}

I just learned a new word :-).

This is fine, but to avoid the assuaging, you could also do this:

if (mode == BUFFER_LOCK_EXCLUSIVE)
lwmode = LW_EXCLUSIVE;
else
{
Assert(mode == BUFFER_LOCK_SHARE);
lwmode = LW_SHARED;
}

0004 heapam: Use exclusive lock on old page in CLUSTER

When I started looking into this I wanted to make sure that it's actually
safe to skip setting hint bits in all places. One place that's *not* safe
is cluster - we assume that all hint bits are set in rewrite_heap_tuple().

This might not strictly be required, but it seemed very fragile as
is. While using an exclusive lock does seem like it makes it a bit safer,
it's not a very good fix, as we still dirty the old buffer, which seems
like a shame when one uses VACUUM FULL.

+		/*
+		 * To be able to guarantee that we can set the hint bit, acquire an
+		 * exclusive lock on the old buffer. We need the hint bits to be set
+		 * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
+		 * heap_freeze_tuple() will get confused.
+		 *

Hmm, I don't see any comments in the reform_and_rewrite_tuple() ->
rewrite_heap_tuple() -> heap_freeze_tuple() path about requiring hint
bits to be set. How does it get confused?

0005 heapam: Only set tuple's block once per page in pagemode

Minor, but measurable, optimization.

+1, this makes sense independently of all the other patches.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1748eafa100..d0cb4b1e29b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -981,6 +981,9 @@ heapgettup_pagemode(HeapScanDesc scan,
linesleft = scan->rs_ntuples;
lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;

+		/* set block once per page, instead of doing so for every tuple */
+		BlockIdSet(&tuple->t_self.ip_blkid, scan->rs_cblock);
+
/* lineindex now references the next or previous visible tid */
continue_page:

@@ -995,7 +998,7 @@ continue_page:

tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
+			tuple->t_self.ip_posid = lineoff;

/* skip any tuples that don't match the scan key */
if (key != NULL &&

Should probably use ItemPointerSetBlockNumber and
ItemPointerSetOffsetNumber.

0006 heapam: Add batch mode mvcc check and use it in page mode

This is a good bit faster on its own, but is required to avoid a
performance regression later, when setting hint bits only only when no IO
going on at the same.

The BATCHMVCC_FEWER_ARGS stuff is pretty weird. Hard to imagine that
it'd make any difference how you pass those arguments.

+		if (likely(valid))
+		{
+			vistuples_dense[nvis] = tup->t_self.ip_posid;
+			nvis++;
+		}

Is the likely(valid) hint important for performance here? It would be a
bad hint for a table with lots of dead tuples. I feel we'd better rely
on the branch predictor for this one.

0007 bufmgr: Make it easier to change number of buffer state bits

This just makes it easier to change the division of bits in
BufferDesc->state.

Looks good to me.

--
Heikki Linnakangas
Neon (https://neon.tech)

#11

Heikki Linnakangas

hlinnaka@iki.fi

about 1 year ago

In reply to: Andres Freund (#1)

Re: AIO writes vs hint bits vs checksums

On 24/09/2024 18:55, Andres Freund wrote:

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

It'd be nice to implement this without having to set and clear
BM_SETTING_HINTS every time. Could we put the overhead on the
FlushBuffer() instead?

How about something like this:

To set hint bits:

1. Lock page in SHARED mode.
2. Read BM_IO_IN_PROGRESS
3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
4. Unlock page

To flush a buffer:

1. Lock page in SHARED mode
2. Set BM_IO_IN_PROGRESS
3. Read the lock count on the buffer lock, to see if we're the only locker.
4. If anyone else is holding the lock, upgrade it to exclusive mode, and
immediately downgrade back to share mode.
5. calculate CRC, flush the buffer
6. Clear BM_IO_IN_PROGRESS and unlock page.

This goes back to the idea of adding LWLock support for this, but the
amount of changes could be pretty small. The missing operation we don't
have today is reading the share-lock count on the lock in step 3. That
seems simple to add.

Acquiring the exclusive lock in step 4 is just a way to wait for all the
existing share-lockers to release the lock. You wouldn't need to block
new share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we
need, but it currently ignores share-lockers. So doing this "properly"
would require more changes to LWLocks. Briefly acquiring an exclusive
lock seems acceptable though.

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples.

Yes, batching the visibility checks makes sense in any case.

--
Heikki Linnakangas
Neon (https://neon.tech)

#12

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Heikki Linnakangas (#11)

Re: AIO writes vs hint bits vs checksums

Hi,

Thanks for the quick review!

On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:

On 24/09/2024 18:55, Andres Freund wrote:

What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

It'd be nice to implement this without having to set and clear
BM_SETTING_HINTS every time.

True - but it's worth noting that we only need to set BM_SETTING_HINTS if
there are any hint bits that need to be set. So in the most common paths we
won't need to deal with it at all.

Could we put the overhead on the FlushBuffer()
instead?

How about something like this:

To set hint bits:

1. Lock page in SHARED mode.
2. Read BM_IO_IN_PROGRESS
3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
4. Unlock page

To flush a buffer:

1. Lock page in SHARED mode
2. Set BM_IO_IN_PROGRESS
3. Read the lock count on the buffer lock, to see if we're the only locker.
4. If anyone else is holding the lock, upgrade it to exclusive mode, and
immediately downgrade back to share mode.
5. calculate CRC, flush the buffer
6. Clear BM_IO_IN_PROGRESS and unlock page.

This goes back to the idea of adding LWLock support for this, but the amount
of changes could be pretty small. The missing operation we don't have today
is reading the share-lock count on the lock in step 3. That seems simple to
add.

I've played around with a bunch of ideas like this. There are two main
reasons I didn't like them that much in the end:

1) The worst case latency impacts seemed to make them not that
interesting. A buffer that is heavily contended with share locks might not
get down to zero share lockers for quite a while. That's not a problem for
individual FlushBuffer() calls, but it could very well add up to a decent
sized delay for something like a checkpoint that has to flush a lot of
buffers.

Waiting for all pre-existing share lockers is easier said than done. We
don't record the acquisition order anywhere and a share-lock release won't
wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
exactly be free and the cost would be born by all lwlock users.

2) They are based on holding an lwlock. But it's actually quite conceivable
that we'd want to set something hint-bit-like without any lock, as we
e.g. currently do for freespace/. That would work with something like the
approach I chose here, but not if we rely on lwlocks.

E.g. with a bit of work we could actually do sequential scans without
blocking concurrent buffer modifications by carefully ordering when
pd_lower is modified and making sure that tuple header fields are written
in a sensible order.

Acquiring the exclusive lock in step 4 is just a way to wait for all the
existing share-lockers to release the lock. You wouldn't need to block new
share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but
it currently ignores share-lockers. So doing this "properly" would require
more changes to LWLocks. Briefly acquiring an exclusive lock seems
acceptable though.

I don't think LW_WAIT_UNTIL_FREE is that easily extended to cover something
like this, because share lockers can acquire/release the lock without waking
anybody. And changing that would be rather expensive for everyone...

But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples.

Yes, batching the visibility checks makes sense in any case.

Cool. The wins by doing that are noticeable...

Greetings,

Andres Freund

#13

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Heikki Linnakangas (#10)

Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-10-30 13:29:01 +0200, Heikki Linnakangas wrote:

On 30/10/2024 04:21, Andres Freund wrote:

Attached is a, unfortunately long, series of patches implementing what I
described upthread.

Review of the preparatory patches:

0001 Add very basic test for kill_prior_tuples

We currently don't exercise this patch for gist and hash, which seems
somewhat criminal.

Interesting to use the isolationtester for this. There's just one session,
so you're just using it to define reusable steps with handy names.

Yea. I had started out writing it as a pg_regress style test and it quickly got very
hard to understand.

I'm fine with that, but please add a comment to explain it.

Makes sense.

I wonder if it'd be more straightforward to make it a regular pg_regress
test though. There would be some repetition, but would it be so bad?

I found it to be quite bad. If testing just one AM it's ok-ish, but once you
test 2-3 it gets very long and repetitive. I guess we could use functions or
such to make it a bit less painful - but that point, is it actually simpler?

You forgot to add the new test to 'isolation_schedule'.

Oops.

typos:
"inex" -> "index"
"does something approximately reasonble" -> "do something approximately
reasonable"

Oops^2.

+/*
+ * If HEAP_MOVED_OFF or HEAP_MOVED_IN are set on the tuple, remove them and
+ * adjust hint bits. See the comment for SetHintBits() for more background.
+ *
+ * This helper returns false if the row ought to be invisible, true otherwise.
+ */
+static inline bool
+HeapTupleCleanMoved(HeapTupleHeader tuple, Buffer buffer)
+{
+	TransactionId xvac;
+
+	/* only used by pre-9.0 binary upgrades */
+	if (likely(!(tuple->t_infomask & (HEAP_MOVED_OFF | HEAP_MOVED_IN))))
+		return true;
+
+ 	...
+}

This is so unlikely these days that this function probably should not be
inlined anymore.

Because I put the check for moved in the function it is important that it's
inlined. The compiler can understand that the rest of the function is unlikely
to be executed and put it together with other less likely to be executed code.

0003 bufmgr: Add BufferLockHeldByMe()

Useful for writing assertions in later patches.
+	if (mode == BUFFER_LOCK_EXCLUSIVE)
+		lwmode = LW_EXCLUSIVE;
+	else if (mode == BUFFER_LOCK_SHARE)
+		lwmode = LW_SHARED;
+	else
+	{
+		Assert(false);
+		pg_unreachable();
+		lwmode = LW_EXCLUSIVE;  /* assuage compiler */
+	}
I just learned a new word :-).

Heh.

This is fine, but to avoid the assuaging, you could also do this:

if (mode == BUFFER_LOCK_EXCLUSIVE)
lwmode = LW_EXCLUSIVE;
else
{
Assert(mode == BUFFER_LOCK_SHARE);
lwmode = LW_SHARED;
}

I have no opinion about which version is better...

0004 heapam: Use exclusive lock on old page in CLUSTER

When I started looking into this I wanted to make sure that it's actually
safe to skip setting hint bits in all places. One place that's *not* safe
is cluster - we assume that all hint bits are set in rewrite_heap_tuple().

This might not strictly be required, but it seemed very fragile as
is. While using an exclusive lock does seem like it makes it a bit safer,
it's not a very good fix, as we still dirty the old buffer, which seems
like a shame when one uses VACUUM FULL.
+		/*
+		 * To be able to guarantee that we can set the hint bit, acquire an
+		 * exclusive lock on the old buffer. We need the hint bits to be set
+		 * as otherwise reform_and_rewrite_tuple() -> rewrite_heap_tuple() ->
+		 * heap_freeze_tuple() will get confused.
+		 *
Hmm, I don't see any comments in the reform_and_rewrite_tuple() ->
rewrite_heap_tuple() -> heap_freeze_tuple() path about requiring hint bits
to be set.

There's this comment is in heap_prepare_freeze_tuple():

* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
* (else we should be removing the tuple, not freezing it).

I don't remember the precise details of what all goes wrong, I think there was
more than one avenue.

One example is stuff like this:
/*
* If the tuple has been updated, check the old-to-new mapping hash table.
*/
if (!((old_tuple->t_data->t_infomask & HEAP_XMAX_INVALID) ||
HeapTupleHeaderIsOnlyLocked(old_tuple->t_data)) &&
!HeapTupleHeaderIndicatesMovedPartitions(old_tuple->t_data) &&
!(ItemPointerEquals(&(old_tuple->t_self),
&(old_tuple->t_data->t_ctid))))

if the original tuple had an xmax but we didn't unset it in
HeapTupleSatisfiesVacuum(), due to not being allowed to write a hint bit,
we'll behave differently here. But at the same time
heapam_relation_copy_for_cluster() will treat the tuple as dead and call
rewrite_heap_dead_tuple().

How does it get confused?

Corrupted visibility information after a CLUSTER.

0006 heapam: Add batch mode mvcc check and use it in page mode

This is a good bit faster on its own, but is required to avoid a
performance regression later, when setting hint bits only only when no IO
going on at the same.

The BATCHMVCC_FEWER_ARGS stuff is pretty weird. Hard to imagine that it'd
make any difference how you pass those arguments.

I unfortunately found it to make a difference :(, even after trying to account
for binary layout difference by randomizing section order via linker flags.

+		if (likely(valid))
+		{
+			vistuples_dense[nvis] = tup->t_self.ip_posid;
+			nvis++;
+		}
Is the likely(valid) hint important for performance here? It would be a bad
hint for a table with lots of dead tuples. I feel we'd better rely on the
branch predictor for this one.

It did :(

Greetings,

Andres Freund

#14

Jeff Davis

pgsql@j-davis.com

about 1 year ago

In reply to: Andres Freund (#1)

Re: AIO writes vs hint bits vs checksums

On Tue, 2024-09-24 at 11:55 -0400, Andres Freund wrote:

What I suspect we might want instead is something inbetween a share
and an
exclusive lock, which is taken while setting a hint bit and which
conflicts
with having an IO in progress.

I am starting to wonder if a shared content locks are really the right
concept at all. It makes sense for simple mutexes, but we are already
more complex than that, and you are suggesting adding on to that
complexity. Which I agree is a good idea, I'm just wondering if we
could go even further.

The README states that a shared lock is necessary for visibility
checking, but can we just be more careful with the ordering and
atomicity of visibility changes in the page?

* carefully order reads and writes of xmin/xmax/hints (would
that create too many read barriers in the tqual.c code?)
* write line pointer after tuple is written

We would still have pins and cleanup locks to prevent data removal.
We'd have the logic you suggest that would prevent modification during
IO. And there would still need to be an exclusive content locks so that
two inserts don't try to allocate the same line pointer, or lock the
same tuple.

If PD_ALL_VISIBLE is set it's even simpler.

Am I missing some major hazards?

Regards,
Jeff Davis

#15

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Jeff Davis (#14)

Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-10-30 10:47:35 -0700, Jeff Davis wrote:

On Tue, 2024-09-24 at 11:55 -0400, Andres Freund wrote:

What I suspect we might want instead is something inbetween a share
and an
exclusive lock, which is taken while setting a hint bit and which
conflicts
with having an IO in progress.

I am starting to wonder if a shared content locks are really the right
concept at all. It makes sense for simple mutexes, but we are already
more complex than that, and you are suggesting adding on to that
complexity.

What I am proposing isn't making the content lock more complicated, it's
orthogonal to the content lock.

Which I agree is a good idea, I'm just wondering if we could go even
further.

The README states that a shared lock is necessary for visibility
checking, but can we just be more careful with the ordering and
atomicity of visibility changes in the page?

* carefully order reads and writes of xmin/xmax/hints (would
that create too many read barriers in the tqual.c code?)
* write line pointer after tuple is written

It's possible, but it'd be a lot of work. And you wouldn't need to just do
this for heap, but all the indexes too, to make progress on the
don't-set-hint-bits-during-io front. So I don't think it makes sense to tie
these things together.

I do think that it's an argument for not importing all the complexity into
lwlock.c though.

We would still have pins and cleanup locks to prevent data removal.

As-is cleanup locks only work in coordination with content locks. While
cleanup is ongoing we need to prevent anybody from starting to look at the
page - without acquiring something like a shared lock that's not easy.

We'd have the logic you suggest that would prevent modification during
IO. And there would still need to be an exclusive content locks so that
two inserts don't try to allocate the same line pointer, or lock the
same tuple.

If PD_ALL_VISIBLE is set it's even simpler.

Am I missing some major hazards?

I don't think anything fundamental, but it's a decidedly nontrivial amount of
work.

Greetings,

Andres Freund

#16

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Andres Freund (#12)

Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-10-30 09:58:30 -0400, Andres Freund wrote:

On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:

Could we put the overhead on the FlushBuffer()
instead?

How about something like this:

To set hint bits:

1. Lock page in SHARED mode.
2. Read BM_IO_IN_PROGRESS
3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
4. Unlock page

To flush a buffer:

1. Lock page in SHARED mode
2. Set BM_IO_IN_PROGRESS
3. Read the lock count on the buffer lock, to see if we're the only locker.
4. If anyone else is holding the lock, upgrade it to exclusive mode, and
immediately downgrade back to share mode.
5. calculate CRC, flush the buffer
6. Clear BM_IO_IN_PROGRESS and unlock page.

This goes back to the idea of adding LWLock support for this, but the amount
of changes could be pretty small. The missing operation we don't have today
is reading the share-lock count on the lock in step 3. That seems simple to
add.

I've played around with a bunch of ideas like this. There are two main
reasons I didn't like them that much in the end:

1) The worst case latency impacts seemed to make them not that
interesting. A buffer that is heavily contended with share locks might not
get down to zero share lockers for quite a while. That's not a problem for
individual FlushBuffer() calls, but it could very well add up to a decent
sized delay for something like a checkpoint that has to flush a lot of
buffers.

Waiting for all pre-existing share lockers is easier said than done. We
don't record the acquisition order anywhere and a share-lock release won't
wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
exactly be free and the cost would be born by all lwlock users.

2) They are based on holding an lwlock. But it's actually quite conceivable
that we'd want to set something hint-bit-like without any lock, as we
e.g. currently do for freespace/. That would work with something like the
approach I chose here, but not if we rely on lwlocks.

E.g. with a bit of work we could actually do sequential scans without
blocking concurrent buffer modifications by carefully ordering when
pd_lower is modified and making sure that tuple header fields are written
in a sensible order.

I still don't like this idea a whole lot - but perhaps we could get reduce the
overhead of my proposal some, to get closer to yours. When setting hint bits
for many tuples on a page the overhead of my approach is neglegible, but when doing it
for individual tuples it's a bit less neglegible.

We can reduce the efficiency difference substantially by adding a bufmgr.c API
that set hints on a page. That function can set the hint bit while holding the
buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and
thus also doesn't need to do resowner.c accounting.

To see the worst case overhead, I
a) disabling the "batch" optimization
b) disabled checksums, as that otherwise would hide small efficiency
differences
c) used an unlogged table

and measured the performance difference for a previously-unhinted sequential
scan of a narrow table that immediately discards all tuples due to OFFSET -
afaict the worst case for the proposed new behaviour.

Previously this was 30.8% slower than master. Now it's only 1.9% slower.

With the batch optimization enabled, the patchset is 7.5% faster.

I also looked at the performance impact on scans that cannot use the batched
approach. The worst case I could think of was a large ordered indexscan of a
previously unhinted table.

For an IOS, the performance difference is a slowdown of 0.65%.

But the difference being so small is partially just due to IOS being
considerably slower than a plain index scan when all tuples need to be fetched
from the table (we need to address that...). Forcing a non-IOS IOS scan using
enable_indexonlyscan, I get a slowdown of 5.0%.

Given that this is an intentionally pathological workload - there's no IO /
the workload is CPU bound, yet the data is only ever accessed once, with a
query that doesn't ever actually look at the data - I'm quite happy with that.

As soon as I make it a query that actually uses the data, the difference
vanishes in the noise.

Greetings,

Andres Freund

#17

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Andres Freund (#16)

14 attachment(s)

Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-11-01 14:10:54 -0400, Andres Freund wrote:

I still don't like this idea a whole lot - but perhaps we could get reduce the
overhead of my proposal some, to get closer to yours. When setting hint bits
for many tuples on a page the overhead of my approach is neglegible, but when doing it
for individual tuples it's a bit less neglegible.

We can reduce the efficiency difference substantially by adding a bufmgr.c API
that set hints on a page. That function can set the hint bit while holding the
buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and
thus also doesn't need to do resowner.c accounting.

To see the worst case overhead, I
a) disabling the "batch" optimization
b) disabled checksums, as that otherwise would hide small efficiency
differences
c) used an unlogged table

and measured the performance difference for a previously-unhinted sequential
scan of a narrow table that immediately discards all tuples due to OFFSET -
afaict the worst case for the proposed new behaviour.

Previously this was 30.8% slower than master. Now it's only 1.9% slower.

With the batch optimization enabled, the patchset is 7.5% faster.

I also looked at the performance impact on scans that cannot use the batched
approach. The worst case I could think of was a large ordered indexscan of a
previously unhinted table.

For an IOS, the performance difference is a slowdown of 0.65%.

But the difference being so small is partially just due to IOS being
considerably slower than a plain index scan when all tuples need to be fetched
from the table (we need to address that...). Forcing a non-IOS IOS scan using
enable_indexonlyscan, I get a slowdown of 5.0%.

The attached patchset implements this approach and a few additional
optimizations:

- The hint bit functions now take care of marking the buffer dirty, this saves
one trip through bufmgr.c and eventually could be used to remove one atomic
operation. It also looks cleaner imo.

- I had not removed the page copying from XLogSaveBufferForHint(), changing
that does make the patchset faster even in the index-scan scan, if a hint
bit log needs to be emitted.

XLogSaveBufferForHint() should probably use XLogRegisterBuffer() instead of
XLogRegisterBlock(), but we'd need to adjust assertions...

- I added a patch to optimize LockBufHdr() by avoiding init_local_spin_delay()
in the common path. With this applied on *both sides*, the regression
vanishes (but see more below).

I've not fully polished the later patches, I would like to get some agreement
on the approach before doing that.

I spent an unhealthily large amouunt of time trying to benchmark this. Largely
because there's some very odd bimodal performance distribution that I can't
figure out. I benchmarked a whole-table indexscan on an unhinted relation. IFF
testing a WAL logged relation, there are two different "centers" on a graph of
latencies of individual benchmark times.

This happens even on master, on a quiesced system, with the database on tmpfs,
turbo mode disabled etc. On two different systems. Unfortunately there are
fairly long stretches of time where I see one duration, which makes it hard to
just tackle this by running the benchmark for long.

I settled for just comparing both the slow and the fast times separately.

Greetings,

Andres Freund

#18

Heikki Linnakangas

hlinnaka@iki.fi

9 months ago

In reply to: Andres Freund (#17)

12 attachment(s)

Re: AIO writes vs hint bits vs checksums

Here's a rebase of these patches.

The last patch, to use mprotect() to detect bad buffer accesses, needed
some changes because of AIO. I disabled the protection completely for
I/O worker processes, it was difficult to be more fine-grained than
that. I didn't test it with io_uring.

I went ahead and committed the "heapam: Only set tuple's block once per
page in pagemode" patch, because it was trivial and independent of the rest.

--
Heikki Linnakangas
Neon (https://neon.tech)

#19

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Heikki Linnakangas (#18)

Re: AIO writes vs hint bits vs checksums

Hi,

On 2025-04-01 13:34:53 +0300, Heikki Linnakangas wrote:

Here's a rebase of these patches.

Thanks!

Curious what made you do this? Do you need any parts of this soon?

I've been hacking on this a bit more, btw, just haven't gotten around to the
necessary README rephrasings... :(

I went ahead and committed the "heapam: Only set tuple's block once per page
in pagemode" patch, because it was trivial and independent of the rest.

Thanks!

I think I should just commit the kill_prior_tuples test, it's better than
nothing.

Greetings,

Andres Freund

#20

Heikki Linnakangas

hlinnaka@iki.fi

9 months ago

In reply to: Andres Freund (#19)

Re: AIO writes vs hint bits vs checksums

On 01/04/2025 18:08, Andres Freund wrote:

Hi,

On 2025-04-01 13:34:53 +0300, Heikki Linnakangas wrote:

Here's a rebase of these patches.

Thanks!

Curious what made you do this? Do you need any parts of this soon?

No, I was just browsing through the commitfest.

I went ahead and committed the "heapam: Only set tuple's block once per page
in pagemode" patch, because it was trivial and independent of the rest.

Thanks!

I think I should just commit the kill_prior_tuples test, it's better than
nothing.

+1. I didn't look at it closely, but more tests is better and it's
independent of the rest of the patches.

--
Heikki Linnakangas
Neon (https://neon.tech)

#21

Nico Williams

nico@cryptonector.com

9 months ago

In reply to: Heikki Linnakangas (#11)

Re: AIO writes vs hint bits vs checksums

On Wed, Oct 30, 2024 at 02:16:51PM +0200, Heikki Linnakangas wrote:

Acquiring the exclusive lock in step 4 is just a way to wait for all the
existing share-lockers to release the lock. You wouldn't need to block new
share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but
it currently ignores share-lockers. So doing this "properly" would require
more changes to LWLocks. Briefly acquiring an exclusive lock seems
acceptable though.

The problem is starvation. For this you really want something more like
rwlocks that do not have the writer starvation problem. But rwlocks
have other problems too, like in this case forcing readers to wait.

What you want here is something more like a barrier where readers that
did not see that the page has BM_IO_IN_PROGRESS set get to act as though
it's not set while readers that did see that the page has
BM_IO_IN_PROGRESS set don't, and the process that did set that bit gets
to wait for the first set of readers all without blocking the second set
of readers. That's something akin to an rwlock, but better -- in fact,
it resembles RCU.

Nico
--

#22

Andres Freund

andres@anarazel.de

4 months ago

In reply to: Andres Freund (#13)

Re: AIO writes vs hint bits vs checksums

Hi,

On 2024-10-30 12:45:27 -0400, Andres Freund wrote:

On 2024-10-30 13:29:01 +0200, Heikki Linnakangas wrote:

On 30/10/2024 04:21, Andres Freund wrote:

Attached is a, unfortunately long, series of patches implementing what I
described upthread.

Review of the preparatory patches:

0001 Add very basic test for kill_prior_tuples

We currently don't exercise this patch for gist and hash, which seems
somewhat criminal.

Interesting to use the isolationtester for this. There's just one session,
so you're just using it to define reusable steps with handy names.

Yea. I had started out writing it as a pg_regress style test and it quickly got very
hard to understand.

I'm fine with that, but please add a comment to explain it.

Makes sense.

I wonder if it'd be more straightforward to make it a regular pg_regress
test though. There would be some repetition, but would it be so bad?

I found it to be quite bad. If testing just one AM it's ok-ish, but once you
test 2-3 it gets very long and repetitive. I guess we could use functions or
such to make it a bit less painful - but that point, is it actually simpler?

You forgot to add the new test to 'isolation_schedule'.

Oops.

typos:
"inex" -> "index"
"does something approximately reasonble" -> "do something approximately
reasonable"

Oops^2.

Pushed the test with these changes.

Greetings,

Andres Freund

#23

Maxim Orlov

orlovmg@gmail.com

about 2 months ago

In reply to: Andres Freund (#22)

1 attachment(s)

Re: AIO writes vs hint bits vs checksums

On Wed, 13 Aug 2025 at 22:38, Andres Freund <andres@anarazel.de> wrote:

Pushed the test with these changes.

I think there is one thing that could be improved here. The isolation
test index-killtuples (377b7ab1452) adds a new dependency:
contrib/btree_gin. contrib/btree_gist. This must be taken into
account in the makefile.

Prior to this patch, isolation tests could be carried out independently
using "make check -C src/test/isolation", but they now require make
check-world to execute successfully, otherwise this test fails with:
====
 step create_ext_btree_gist: CREATE EXTENSION btree_gist;
+ERROR:  extension "btree_gist" is not available
 step create_gist: CREATE INDEX kill_prior_tuple_gist ON kill_prior_tuple
USING gist (key);
+ERROR:  data type integer has no default operator class for access method
"gist"
 step flush: SELECT FROM pg_stat_force_next_flush();
====

--
Best regards,
Maxim Orlov.

#24

Michael Paquier

michael@paquier.xyz

about 2 months ago

In reply to: Maxim Orlov (#23)

Re: AIO writes vs hint bits vs checksums

On Mon, Oct 27, 2025 at 02:48:43PM +0300, Maxim Orlov wrote:

I think there is one thing that could be improved here. The isolation
test index-killtuples (377b7ab1452) adds a new dependency:
contrib/btree_gin. contrib/btree_gist. This must be taken into
account in the makefile.

Also mentioned here:
/messages/by-id/aKJsWedftW7UX1WM@paquier.xyz

It looks that the conclusion would be to move this test into its own
test module to avoid the external contrib/ dependency in
src/test/isolation/, as duplicated Makefile rules in
src/test/isolation/Makefile are unpleasant.
--
Michael

Participants

Thread Outline

AIO writes vs hint bits vs checksums

Attachments:

Attachments:

Attachments:

Attachments: