Buffer locking is special (hints, checksums, AIO writes)
Hi,
I'm working on making bufmgr.c ready for AIO writes. That requires some
infrastructure changes that I wanted to discuss... In this email I'm trying to
outline the problems and what I currently think we should do.
== Problem 1 - hint bits ==
Currently pages that are being written out are copied if checksums are
enabled. The copy is needed because hint bits can be set with just a share
lock. Writing out a buffer only requires a share lock. If we didn't copy the
buffer, the computed checksum can be falsified by the checksum.
Even without checksums hint bits being set while IO is ongoing is an issue,
e.g. btrfs assumes that pages do not change while being written out with
direct-io, and corrupts itself if they do [1]/messages/by-id/CA+hUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw@mail.gmail.com.
The copy we make to avoid checksum failure are not at all cheap [2]/messages/by-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m, but are
particularly problematic for AIO, where a lot of buffers can undergo AIO at
the same time. For AIO the issue is that the longer-lived bounce buffers that
I had initially prototyped actually end up using a lot of memory, making it
hard to figure out what defaults to set.
In [2]/messages/by-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m I had worked on an approach to avoid copying pages during writes. It
avoided the need to copy buffers by not allowing hint bits to be set while IO
is ongoing. It did so by skipping hint bit writes while IO is ongoing (plus a
fair bit of infrastructure to make that cheap enough).
In that thread Heikki was wondering whether we should instead not go for a
more fundamental solution to the problem, like introducing a separate lock
level that prevents concurrent modifications but allows share lockers. At the
time I was against that, because it seemed like a large project.
However, since then I realized there's a architecturally related second issue:
== Problem 2 - AIO writes vs exclusive locks ==
Separate from the hint bit issue, there is a second issue that I didn't have a
good answer for: Making acquiring an exclusive lock concurrency safe in the
presence of asynchronous writes:
The problem is that while a buffer is being written out, it obviously has to
be share locked. That's true even with AIO. With AIO the share lock is held
once the IO is completed. The problem is that if a backend wants to
exclusively lock a buffer undergoing AIO, it can't just wait for the content
lock as today, it might have to actually reap the IO completion from the
operating system. If one just were to wait for the content lock, there's no
forward progress guarantee.
The buffer's state "knows" that it's undergoing write IO (BM_VALID and
BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
problem is that it's surprisingly hard to do so race free:
If a backend A were to just check if a buffer is undergoing IO before locking
it, a backend B could start IO on the buffer between A checking for
BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just
hold the buffer header spinlock across a blocking lwlock acquisition.
There potentially are ways to synchronize the buffer state and the content
lock, but it requires deep integration between bufmgr.c and lwlock.c.
== Problem 3 - Cacheline contention ==
This is unrelated to AIO, but might influence the architecture for potential
solutions. It might make sense to skip over this section on a quicker
read-through.
The problem is that in some workloads the cacheline containing the BufferDesc
becomes very hot. The most common case are workloads with lots of index nested
loops, the root page and some of the other inner pages in the index can become
very contended.
I've seen cases where running the same workload in four separate copies in the
same postgres instance yields ~8x the throughput - on a machine with forty
cores, there's machines with many more these days [4]https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf - slide 19/
There are three main issues:
a) We manipulate the same cache line multiple times. E.g. btree always first
pins the page and then locks the page, which performs two atomic operations
on the same cacheline.
I've experimented with putting the content lock on a separate cacheline,
but that causes regressions in other cases.
Leaving nontrivial implementation issues aside, we could release the lock
and pin at the same time. With a bit increased difficulty we could do the
same for pin and lock acquisition. nbtree almost always does the two
together, so it'd not be hard to make it benefit from such an optimization.
b) Many operations, like unpinning a buffer, need to use CAS, instead of an
atomic subtraction.
atomic add/sub scales a *lot* better than compare and swap, as there is no
need to retry. With increased contention the number of retries increases
further and further.
The reason we need to use CAS in so many places is the following:
Historically postgres' spinlocks don't use an atomic operation to release
the spinlock on x86. When making buffer header locks their own thing, I
carried that forward, to avoid performance regressions. Because of that
BufferDesc.state may not be modified while the buffer header spinlock is
held - which is incompatible with using atomic-sub.
However, since then we converted most of the "hot" uses of buffer header
spinlocks into CAS loops (see e.g. PinBuffer()). That makes it feasible to
use an atomic operation for the buffer header lock release, which in turn
allows e.g. unpinning with an atomic sub.
c) Read accesses to the BufferDesc cause contention
Some code, like nbtree, relies on functions like
BufferGetBlockNumber(). Unfortunately that contends with concurrent
modifications of the buffer descriptor (like pinning). Potential solutions
are to rely less on functions like BufferGetBlockNumber() or to split out
the memory for that into a separate (denser?) array.
d) Even after addressing all of the above, there's still a lot of contention
I think the solution here would be something roughly to fastpath locks. If
a buffer is very contended, we can mark it as super-pinned & share locked,
avoiding any atomic operation on the buffer descriptor itself. Instead the
current lock and pincount would be stored in each backends PGPROC.
Obviously evicting or exclusively-locking such a buffer would be a lot more
expensive.
I've prototyped it and it helps a *lot*. The reason I mention this here is
that this seems impossible to do while using the generic lwlocks for the
content lock.
== Solution ? ==
My conclusion from the above is that we ought to:
A) Make Buffer Locks something separate from lwlocks
As part of that introduce a new lock level in-between share and exclusive
locks (provisionally called share-exclusive, but I hate it). The new lock
level allows other share lockers, but can only be held by one backend.
This allows to change the rules so that:
1) Share lockers are not allowed to modify buffers anymore
2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits())
3) Write IO needs to the new lock level
This addresses 1) from above
B) Merge BufferDesc.state and the content lock
This allows to address 2) from above, as we now atomically can check if IO
was concurrently initiated.
Obviously BufferDesc.state is not currently wide enough, therefore the
buffer state has to be updated to a 64bit variable.
C) Allow some modifications of BufferDesc.state while holding spinlock
Just naively doing the above two things reduces scalability, as the
likelihood of CAS failures increases, due to increased number of
modifications of the same atomic variable.
However, by allowing unpinning while the buffer header spinlock is held,
scalability considerably improves in my tests.
Doing so requires changing all uses of LockBufHdr(), but by introducing a
static inline helper the complexity can largely be encapsulated.
I've prototyped the above. The current state is pretty rough, but before I
spend the non-trivial time to make it into an understandable sequence of
changes, I wanted to get feedback.
Does this plan sound reasonable?
The hardest part about this change is that everything kind of depends on each
other. The changes are large enough that they clearly can't just be committed
at once, but doing them over time risks [temporary] performance regressions.
The order of changes I think makes the most sense is the following:
1) Allow some modifications while holding the buffer header spinlock
2) Reduce buffer pin with just an atomic-sub
This needs to happen first, otherwise there are performance regressions
during the later steps.
3) Widen BufferDesc.state to 64 bits
4) Implement buffer locking inside BufferDesc.state
5) Do IO while holding share-exclusive lock and require all buffer
modifications to at least hold share exclusive lock
6) Wait for AIO when acquiring an exclusive content lock
(some of these will likely have parts of their own, but that's details)
Sane?
DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
Greetings,
Andres Freund
[1]: /messages/by-id/CA+hUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw@mail.gmail.com
[2]: /messages/by-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m
[3]: In some cases it causes slowdowns for checkpointer close to 50%!
[4]: https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf - slide 19
Hello, Andres!
Andres Freund <andres@anarazel.de>:
As part of that introduce a new lock level in-between share and exclusive
locks (provisionally called share-exclusive, but I hate it). The new lock
level allows other share lockers, but can only be held by one backend.This allows to change the rules so that:
1) Share lockers are not allowed to modify buffers anymore
2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits())
3) Write IO needs to the new lock level
IIUC, it may be mapped to existing locking system:
BUFFER_LOCK_SHARE ----> AccessShareLock
new lock mode ----> ExclusiveLock
BUFFER_LOCK_EXCLUSIVE ----> AccessExclusiveLock
So, it all may be named:
BUFFER_LOCK_ACCESS_SHARE
BUFFER_LOCK_EXCLUSIVE
BUFFER_LOCK_ACCESS_EXCLUSIVE
being more consistent with table-level locking system.
Greetings,
Mikhail.
On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
My conclusion from the above is that we ought to:
A) Make Buffer Locks something separate from lwlocks
B) Merge BufferDesc.state and the content lock
C) Allow some modifications of BufferDesc.state while holding spinlock
+1 to (A) and (B). No particular opinion on (C) but if it works well, great.
The order of changes I think makes the most sense is the following:
1) Allow some modifications while holding the buffer header spinlock
2) Reduce buffer pin with just an atomic-sub
3) Widen BufferDesc.state to 64 bits
4) Implement buffer locking inside BufferDesc.state
5) Do IO while holding share-exclusive lock and require all buffer
modifications to at least hold share exclusive lock
6) Wait for AIO when acquiring an exclusive content lock
No strong objections. I certainly like getting to (5) and (6) and I
think those are in the right order. I'm not sure about the rest. I
thought (1) and (2) were the same change after reading your email; and
it surprises me a little bit that (2) is separate from (4). But I'm
sure you have a much better sense of this than I do.
DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
AFAIK "share exclusive" or "SX" is standard terminology. While I'm not
wholly hostile to the idea of coming up with something else, I don't
think our tendency to invent our own way to do everything is one of
our better tendencies as a project.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
My conclusion from the above is that we ought to:
A) Make Buffer Locks something separate from lwlocks
B) Merge BufferDesc.state and the content lock
C) Allow some modifications of BufferDesc.state while holding spinlock+1 to (A) and (B). No particular opinion on (C) but if it works well, great.
Without it I see performance regressions due to the increased rate of CAS
failures due to having more changes to one atomic variable :/
The order of changes I think makes the most sense is the following:
1) Allow some modifications while holding the buffer header spinlock
2) Reduce buffer pin with just an atomic-sub
3) Widen BufferDesc.state to 64 bits
4) Implement buffer locking inside BufferDesc.state
5) Do IO while holding share-exclusive lock and require all buffer
modifications to at least hold share exclusive lock
6) Wait for AIO when acquiring an exclusive content lockNo strong objections. I certainly like getting to (5) and (6) and I
think those are in the right order. I'm not sure about the rest.
I thought (1) and (2) were the same change after reading your email
They are certainly related. I thought it'd make sense to split them as
outlined above, as (1) is relatively verbose on its own, but far more
mechanical.
and it surprises me a little bit that (2) is separate from (4).
Without doing 2) first, I see performance/scalability regressions doing
(4). Doing (3) without (2) also hurts...
DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
AFAIK "share exclusive" or "SX" is standard terminology. While I'm not
wholly hostile to the idea of coming up with something else, I don't
think our tendency to invent our own way to do everything is one of
our better tendencies as a project.
I guess it bothers me that we'd use share-exclusive to mean the buffer can't
be modified, but for real (vs share, which does allow some modifications). But
it's very well plausible that there's no meaningfully better name, in which
case we certainly shouldn't differ from what's somewhat commonly used.
Greetings,
Andres Freund
On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
I'm working on making bufmgr.c ready for AIO writes.
Nice!
== Problem 2 - AIO writes vs exclusive locks ==
Separate from the hint bit issue, there is a second issue that I didn't have a
good answer for: Making acquiring an exclusive lock concurrency safe in the
presence of asynchronous writes:The problem is that while a buffer is being written out, it obviously has to
be share locked. That's true even with AIO. With AIO the share lock is held
once the IO is completed. The problem is that if a backend wants to
exclusively lock a buffer undergoing AIO, it can't just wait for the content
lock as today, it might have to actually reap the IO completion from the
operating system. If one just were to wait for the content lock, there's no
forward progress guarantee.The buffer's state "knows" that it's undergoing write IO (BM_VALID and
BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
problem is that it's surprisingly hard to do so race free:If a backend A were to just check if a buffer is undergoing IO before locking
it, a backend B could start IO on the buffer between A checking for
BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just
hold the buffer header spinlock across a blocking lwlock acquisition.There potentially are ways to synchronize the buffer state and the content
lock, but it requires deep integration between bufmgr.c and lwlock.c.
You may have considered and rejected simpler alternatives for (2) before
picking the approach you go on to outline. Anything interesting? For
example, I imagine these might work with varying degrees of inefficiency:
- Use LWLockConditionalAcquire() with some nonstandard waiting protocol when
there's a non-I/O lock conflict.
- Take BM_IO_IN_PROGRESS before exclusive-locking, then release it.
== Problem 3 - Cacheline contention ==
c) Read accesses to the BufferDesc cause contention
Some code, like nbtree, relies on functions like
BufferGetBlockNumber(). Unfortunately that contends with concurrent
modifications of the buffer descriptor (like pinning). Potential solutions
are to rely less on functions like BufferGetBlockNumber() or to split out
the memory for that into a separate (denser?) array.
Agreed. BufferGetBlockNumber() could even use a new local (non-shmem) data
structure, since the buffer's mapping can't change until we unpin.
d) Even after addressing all of the above, there's still a lot of contention
I think the solution here would be something roughly to fastpath locks. If
a buffer is very contended, we can mark it as super-pinned & share locked,
avoiding any atomic operation on the buffer descriptor itself. Instead the
current lock and pincount would be stored in each backends PGPROC.
Obviously evicting or exclusively-locking such a buffer would be a lot more
expensive.I've prototyped it and it helps a *lot*. The reason I mention this here is
that this seems impossible to do while using the generic lwlocks for the
content lock.
Nice.
On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
The order of changes I think makes the most sense is the following:
No concerns so far. I won't claim I can picture all the implications and be
sure this is the right thing, but it sounds promising. I like your principle
of ordering changes to avoid performance regressions.
DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
I would consider {AccessShare, Exclusive, AccessExclusive}. What the $SUBJECT
proposal calls SHARE-EXCLUSIVE would become Exclusive. That has the same
conflict matrix as the corresponding heavyweight locks, which seems good. I
don't love our mode names, particularly ShareRowExclusive being unsharable.
However, learning one special taxonomy is better than learning two.
AFAIK "share exclusive" or "SX" is standard terminology.
Can you say more about that? I looked around at
https://google.com/search?q=share+exclusive+%22sx%22+lock but didn't find
anything well-aligned with the proposal:
https://dev.mysql.com/doc/dev/mysql-server/latest//PAGE_LOCK_ORDER.html looked
most relevant, but it doesn't give the big idea.
https://mysqlonarm.github.io/Understanding-InnoDB-rwlock-stats/ is less
authoritative but does articulate the big idea, as "Shared-Exclusive (SX):
offer write access to the resource with inconsistent read. (relaxed
exclusive)." That differs from $SUBJECT semantics, in which SHARE-EXCLUSIVE
can't see inconsistent reads.
https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_LOCK.html
has term SX = "sub exclusive". I gather an SX lock on a table lets one do
SELECT FOR UPDATE on that table (each row is the "sub"component being locked).
https://man.freebsd.org/cgi/man.cgi?query=sx_slock&sektion=9&format=html uses
the term "SX", but it's more like our lwlocks. One acquires S or X, not
blends of them.
On Tue, Aug 26, 2025 at 8:14 PM Noah Misch <noah@leadboat.com> wrote:
AFAIK "share exclusive" or "SX" is standard terminology.
Can you say more about that?
Looks like I was misremembering. I was thinking of Gray & Reuter,
Transaction Processing: Concepts and Techniques, 1993. However,
opening it up, I find that his vocabulary is slightly different. He
offers the following six lock modes: IS, IX, S, SIX, Update, X. "I"
means "intent" and acts as a modifier to the letter that follows.
Hence, SIX means "a course-granularity shared lock with intent to set
finer-granularity exclusive locks" (p. 408). His lock manager is
hierarchical, so taking a SIX lock on a table means that you are
allowed to read all the rows in the table and you are allowed to
exclusive-lock individual rows as desired and nobody is allowed to
exclusive-lock any rows in the table. It is compatible only with IS;
that is, it does not preclude other people from share-locking
individual rows (which might delay your exclusive locks on those
rows). Since we don't have intent-locking in PostgreSQL, I think my
brain mentally flattened this hierarchy down to S, X, SX, but that's
not what he actually wrote.
His "Update" locks are also somewhat interesting: an update lock is
exactly like an exclusive lock except that it permits PAST
share-locks. You take an update lock when you currently need a
share-lock but anticipate the possibility of needing an
exclusive-lock. This is a deadlock avoidance strategy: updaters will
take turns, and some of them will ultimately want exclusive locks and
others won't, but they can't deadlock against each other as long as
they all take "Update" locks initially and don't try to upgrade to
that level later. An updater's attempt to upgrade to an exclusive lock
can still be delayed by, or deadlock against, share lockers, but those
typically won't try to higher lock levels later.
If we were to use the existing PostgreSQL naming convention, I think
I'd probably argue that the nearest parallel to this level is
ShareUpdateExclusive: a self-exclusive lock level that permits
ordinary table access to continue while blocking exclusive locks, used
for an in-flight maintenance operation. But that's arguable, of
course.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2025-08-26 17:14:49 -0700, Noah Misch wrote:
On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
== Problem 2 - AIO writes vs exclusive locks ==
Separate from the hint bit issue, there is a second issue that I didn't have a
good answer for: Making acquiring an exclusive lock concurrency safe in the
presence of asynchronous writes:The problem is that while a buffer is being written out, it obviously has to
be share locked. That's true even with AIO. With AIO the share lock is held
once the IO is completed. The problem is that if a backend wants to
exclusively lock a buffer undergoing AIO, it can't just wait for the content
lock as today, it might have to actually reap the IO completion from the
operating system. If one just were to wait for the content lock, there's no
forward progress guarantee.The buffer's state "knows" that it's undergoing write IO (BM_VALID and
BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
problem is that it's surprisingly hard to do so race free:If a backend A were to just check if a buffer is undergoing IO before locking
it, a backend B could start IO on the buffer between A checking for
BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just
hold the buffer header spinlock across a blocking lwlock acquisition.There potentially are ways to synchronize the buffer state and the content
lock, but it requires deep integration between bufmgr.c and lwlock.c.You may have considered and rejected simpler alternatives for (2) before
picking the approach you go on to outline.
I tried a few things...
Anything interesting?
Not really.
The first one you propose is what I looked at for a while:
For example, I imagine these might work with varying degrees of
inefficiency:- Use LWLockConditionalAcquire() with some nonstandard waiting protocol when
there's a non-I/O lock conflict.
It's nontrivial to make this race free - the problem is the case where we *do*
have to wait for an exclusive content lock. It's possible for the lwlock to be
released by the owning backend and for IO to be started, after checking
whether IO is in progress (after LWLockConditionalAcquire() had failed).
I came up with a complicated scheme, where setting IO in progress would
afterwards wake up all lwlock waiters and all exclusive content lock waits
were done with LWLockAcquireOrWait(). I think that was working - but it's
also a slower and seems really fragile and ugly.
- Take BM_IO_IN_PROGRESS before exclusive-locking, then release it.
That just seems expensive. We could make it cheaper by doing it only if a
LWLockConditionalAcquire() doesn't succeed. But it still seems not great. And
it doesn't really help with addressing the 'setting hint bits while IO is in
progress" part...
== Problem 3 - Cacheline contention ==
c) Read accesses to the BufferDesc cause contention
Some code, like nbtree, relies on functions like
BufferGetBlockNumber(). Unfortunately that contends with concurrent
modifications of the buffer descriptor (like pinning). Potential solutions
are to rely less on functions like BufferGetBlockNumber() or to split out
the memory for that into a separate (denser?) array.Agreed. BufferGetBlockNumber() could even use a new local (non-shmem) data
structure, since the buffer's mapping can't change until we unpin.
Hm. I didn't think about a backend local datastructure for that, perhaps
because it seems not cheap to maintain (both from a runtime and a space
perspective).
If we store the read-only data for buffers separately from the read-write
data, we could access that from backends without a lock, since it can't change
with the buffer pinned.
One way to do that would be to maintain a back-pointer from the BufferDesc to
the BufferLookupEnt, since the latter *already* contains the BufferTag. We
probably don't want to add another indirection to the buffer mapping hash
table, otherwise we could deduplicate the other way round and just put padding
between the modified and read-only part of a buffer desc.
On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
The order of changes I think makes the most sense is the following:
No concerns so far. I won't claim I can picture all the implications and be
sure this is the right thing, but it sounds promising. I like your principle
of ordering changes to avoid performance regressions.
I suspect we'll have to merge this incrementally to stay sane, I don't want to
end up with a period of worse performance due to this, that could make it
harder to evaluate other work.
DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
I would consider {AccessShare, Exclusive, AccessExclusive}.
One thing I forgot to mention is that with the proposed re-architecture in
place, we could subsequently go further and make pinning just be a very
lightweight lock level, instead of that being a separate dedicated
infrstructure. One nice outgrowth of that would be that that acquiring a
cleanup lock would just be a real lock acquisition, instead of the dedicated
limited machinery we have right now.
Which would leave us with:
- reference (pins today)
- share
- share-exclusive
- exclusive
- cleanup
This doesn't quite seem to map onto the heavyweight lock levels in a sensible
way...
What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive.
There are a few hundred references to the lock levels though, seems painful to
rename them :(
That has the same conflict matrix as the corresponding heavyweight locks,
which seems good.
I don't love our mode names, particularly ShareRowExclusive being
unsharable.
I hate them with a passion :). Except for the most basic ones they just don't
stay in my head for more than a few hours.
However, learning one special taxonomy is better than learning two.
But that's fair.
Greetings,
Andres Freund
On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
On 2025-08-26 17:14:49 -0700, Noah Misch wrote:
On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
== Problem 3 - Cacheline contention ==
c) Read accesses to the BufferDesc cause contention
Some code, like nbtree, relies on functions like
BufferGetBlockNumber(). Unfortunately that contends with concurrent
modifications of the buffer descriptor (like pinning). Potential solutions
are to rely less on functions like BufferGetBlockNumber() or to split out
the memory for that into a separate (denser?) array.Agreed. BufferGetBlockNumber() could even use a new local (non-shmem) data
structure, since the buffer's mapping can't change until we unpin.Hm. I didn't think about a backend local datastructure for that, perhaps
because it seems not cheap to maintain (both from a runtime and a space
perspective).
Yes, paying off the cost of maintaining it could be tricky. It could be the
kind of thing where the overhead loses at 10 cores and wins at 40 cores. It
could also depend heavily on the workload's concurrent pins per backend.
If we store the read-only data for buffers separately from the read-write
data, we could access that from backends without a lock, since it can't change
with the buffer pinned.
Good point. That alone may be enough of a win.
One way to do that would be to maintain a back-pointer from the BufferDesc to
the BufferLookupEnt, since the latter *already* contains the BufferTag. We
probably don't want to add another indirection to the buffer mapping hash
table, otherwise we could deduplicate the other way round and just put padding
between the modified and read-only part of a buffer desc.
I think you're saying clients would save the back-pointer once and dereference
it many times, with each dereference of a saved back-pointer avoiding a shmem
read of BufferDesc.tag. Is that right?
On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
I would consider {AccessShare, Exclusive, AccessExclusive}.
One thing I forgot to mention is that with the proposed re-architecture in
place, we could subsequently go further and make pinning just be a very
lightweight lock level, instead of that being a separate dedicated
infrstructure. One nice outgrowth of that would be that that acquiring a
cleanup lock would just be a real lock acquisition, instead of the dedicated
limited machinery we have right now.Which would leave us with:
- reference (pins today)
- share
- share-exclusive
- exclusive
- cleanupThis doesn't quite seem to map onto the heavyweight lock levels in a sensible
way...
Could map it like this:
AccessShare - pins today
RowShare - check tuple visibility (BUFFER_LOCK_SHARE today)
Share - set hint bits
ShareUpdateExclusive - clean/write out (borrowing Robert's idea)
Exclusive - add tuples, change xmax, etc. (BUFFER_LOCK_EXCLUSIVE today)
AccessExclusive - cleanup lock or evict the buffer
That has a separate level for hint bits vs. I/O, so multiple backends could
set hint bits. I don't know whether the benchmarks would favor maintaining
that distinction.
What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive.
There are a few hundred references to the lock levels though, seems painful to
rename them :(
Yes, especially in comments and extensions. Likely more important than that
for the long-term, your latest proposal has the advantage of keeping short
names for the most-commonly-referenced lock types. (We could keep
BUFFER_LOCK_SHARE with the lower layers translating that into RowShare, but
that weakens or eliminates the benefit of reducing what readers need to
learn.) For what it's worth, 6 PGXN modules reference BUFFER_LOCK_SHARE
and/or BUFFER_LOCK_EXCLUSIVE.
Compared to share-exclusive, I think I'd prefer a name that describes the use
cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).
What do you think of that? I don't know whether that should win vs. names
like ShareUpdateExclusive, though.
On Wed, Aug 27, 2025 at 12:18 PM Andres Freund <andres@anarazel.de> wrote:
Which would leave us with:
- reference (pins today)
- share
- share-exclusive
- exclusive
- cleanupThis doesn't quite seem to map onto the heavyweight lock levels in a sensible
way...
Could do: ACCESS SHARE, SHARE, SHARE UPDATE EXCLUSIVE, EXCLUSIVE,
ACCESS EXCLUSIVE.
I've always thought that a pin was a lot like an access share lock and
a cleanup lock was a lot like an access exclusive lock.
But then again, using the same terminology for two different things
might be confusing.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2025-08-27 12:14:41 -0700, Noah Misch wrote:
On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
One way to do that would be to maintain a back-pointer from the BufferDesc to
the BufferLookupEnt, since the latter *already* contains the BufferTag. We
probably don't want to add another indirection to the buffer mapping hash
table, otherwise we could deduplicate the other way round and just put padding
between the modified and read-only part of a buffer desc.I think you're saying clients would save the back-pointer once and dereference
it many times, with each dereference of a saved back-pointer avoiding a shmem
read of BufferDesc.tag. Is that right?
I was thinking that we'd not have BufferDesc.tag, instead just storing it
solely in BufferLookupEnt. To get the tag of a BufferDesc, you'd every time
have to follow the back-reference. But that's actually why it doesn't work -
reading the back-reference pointer would have the same issue as just reading
BufferDesc.tag...
On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
I would consider {AccessShare, Exclusive, AccessExclusive}.
One thing I forgot to mention is that with the proposed re-architecture in
place, we could subsequently go further and make pinning just be a very
lightweight lock level, instead of that being a separate dedicated
infrstructure. One nice outgrowth of that would be that that acquiring a
cleanup lock would just be a real lock acquisition, instead of the dedicated
limited machinery we have right now.Which would leave us with:
- reference (pins today)
- share
- share-exclusive
- exclusive
- cleanupThis doesn't quite seem to map onto the heavyweight lock levels in a sensible
way...Could map it like this:
AccessShare - pins today
RowShare - check tuple visibility (BUFFER_LOCK_SHARE today)
Share - set hint bits
ShareUpdateExclusive - clean/write out (borrowing Robert's idea)
Exclusive - add tuples, change xmax, etc. (BUFFER_LOCK_EXCLUSIVE today)
AccessExclusive - cleanup lock or evict the buffer
I tend think having things like RowShare for buffer locking is confusing
enough to actually make the similarity to the heavyweight locks to not be a
win...
That has a separate level for hint bits vs. I/O, so multiple backends could
set hint bits. I don't know whether the benchmarks would favor maintaining
that distinction.
I don't think it would - I actually found multiple backends setting the same
hint bits to *hurt* performance a bit. But what's more important, we don't
have the space for it, I think. Every lock that can be acquired multiple
times needs a lock count of 18 bits. And we need to store the buffer state
flags (10 bits). There's just not enough space in 64bit to have three 18bit
counters as well as flag bits etc.
Compared to share-exclusive, I think I'd prefer a name that describes the use
cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).
I would too, I just couldn't come up with something that conveys the meanings
in a sufficiently concise way :)
What do you think of that? I don't know whether that should win vs. names
like ShareUpdateExclusive, though.
I think it'd be a win compared to the heavyweight lock names...
Greetings,
Andres Freund
On Wed, Aug 27, 2025 at 03:29:02PM -0400, Andres Freund wrote:
On 2025-08-27 12:14:41 -0700, Noah Misch wrote:
On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
Which would leave us with:
- reference (pins today)
- share
- share-exclusive
- exclusive
- cleanup
Compared to share-exclusive, I think I'd prefer a name that describes the use
cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).
Another name idea is "self-exclusive", to contrast with "exclusive" excluding
all of (exclusive, self-exclusive, share).
Fortunately, not much code will acquire this lock type. Hence, there's
relatively little damage if the name is less obvious than older lock types or
if the name changes later.
On Wed, Aug 27, 2025 at 10:03:08AM -0400, Robert Haas wrote:
If we were to use the existing PostgreSQL naming convention, I think
I'd probably argue that the nearest parallel to this level is
ShareUpdateExclusive: a self-exclusive lock level that permits
ordinary table access to continue while blocking exclusive locks, used
for an in-flight maintenance operation. But that's arguable, of
course.
ShareUpdateExclusive is a term that's been used for some time now and
relates to knowledge that's quite spread in the tree, so it feels like
a natural fit for the use-case described on this thread as we'd want a
self-conflicting lock. share-exclusive did not sound that bad to me,
TBH, quite the contrary, when applied to buffer locking for aio.
"intent" is also a word I've bumped quite a lot into while looking at
some naming convention, but this is more related to the fact that a
lock is going to be taken, which we don't really have. So that feels
off.
--
Michael
Hi,
On 2025-08-22 15:44:48 -0400, Andres Freund wrote:
The hardest part about this change is that everything kind of depends on each
other. The changes are large enough that they clearly can't just be committed
at once, but doing them over time risks [temporary] performance regressions.The order of changes I think makes the most sense is the following:
1) Allow some modifications while holding the buffer header spinlock
2) Reduce buffer pin with just an atomic-sub
This needs to happen first, otherwise there are performance regressions
during the later steps.
Here are the first few cleaned up patches implementing the above steps, as
well as some cleanups. I included a commit from another thread, as it
conflicts with these changes, and we really should apply it - and it's
arguably required to make the changes viable, as it removes one more use of
PinBuffer_Locked().
Another change included is to not return the buffer with the spinlock held
from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
for that is to reduce the most common PinBuffer_locked() call. By definition
PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
it 0002 is faster than master. And the previous approach also just seems
pretty unclean. I don't love that it requires the new TrackNewBufferPin(),
but I don't really have a better idea.
I invite particular attention to the commit message for 0003 as well as the
comment changes in buf_internals.h within.
I'm not sure I like the TODOs added in 0003 and removed in 0004, but squashing
the changes doesn't really seem better either.
Greetings,
Andres Freund
Attachments:
v3-0001-Improve-ReadRecentBuffer-scalability.patchtext/x-diff; charset=us-asciiDownload+26-35
v3-0002-bufmgr-Don-t-lock-buffer-header-in-StrategyGetBuf.patchtext/x-diff; charset=us-asciiDownload+105-54
v3-0003-bufmgr-Allow-some-buffer-state-modifications-whil.patchtext/x-diff; charset=us-asciiDownload+262-159
v3-0004-bufmgr-Use-atomic-sub-or-for-unpinning-and-markin.patchtext/x-diff; charset=us-asciiDownload+6-48
v3-0005-bufmgr-fewer-calls-to-BufferDescriptorGetContentL.patchtext/x-diff; charset=us-asciiDownload+61-19
v3-0006-bufmgr-Introduce-FlushUnlockedBuffer.patchtext/x-diff; charset=us-asciiDownload+20-17
Hi,
On 2025-09-15 19:05:37 -0400, Andres Freund wrote:
On 2025-08-22 15:44:48 -0400, Andres Freund wrote:
The hardest part about this change is that everything kind of depends on each
other. The changes are large enough that they clearly can't just be committed
at once, but doing them over time risks [temporary] performance regressions.The order of changes I think makes the most sense is the following:
1) Allow some modifications while holding the buffer header spinlock
2) Reduce buffer pin with just an atomic-sub
This needs to happen first, otherwise there are performance regressions
during the later steps.Here are the first few cleaned up patches implementing the above steps, as
well as some cleanups. I included a commit from another thread, as it
conflicts with these changes, and we really should apply it - and it's
arguably required to make the changes viable, as it removes one more use of
PinBuffer_Locked().Another change included is to not return the buffer with the spinlock held
from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
for that is to reduce the most common PinBuffer_locked() call. By definition
PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
it 0002 is faster than master. And the previous approach also just seems
pretty unclean. I don't love that it requires the new TrackNewBufferPin(),
but I don't really have a better idea.I invite particular attention to the commit message for 0003 as well as the
comment changes in buf_internals.h within.
Robert looked at the patches while we were chatting, and I addressed his
feedback in this new version.
Changes:
- Updated patch description for 0002, giving a lot more background
- Improved BufferDesc comments a fair bit more in 0003
- Reduced the size of 0003 a bit, by using UnlockBufHdrExt() instead of a CAS
loop in buffer_stage_common() and reordering some things in
TerminateBufferIO()
- Made 0004 only use the non-looping atomic op in UnpinBufferNoOwner(), not
MarkBufferDirty(). I realized the latter would take additional complexity to
make safe (a CAS loop in TerminateBufferIO()). I am somewhat doubtful that
there are workloads where it matters...
Greetings,
Andres Freund
Attachments:
v4-0001-Improve-ReadRecentBuffer-scalability.patchtext/x-diff; charset=us-asciiDownload+26-35
v4-0002-bufmgr-Don-t-lock-buffer-header-in-StrategyGetBuf.patchtext/x-diff; charset=us-asciiDownload+105-54
v4-0003-bufmgr-Allow-some-buffer-state-modifications-whil.patchtext/x-diff; charset=us-asciiDownload+224-120
v4-0004-bufmgr-Use-atomic-sub-for-unpinning-buffers.patchtext/x-diff; charset=us-asciiDownload+3-27
v4-0005-bufmgr-fewer-calls-to-BufferDescriptorGetContentL.patchtext/x-diff; charset=us-asciiDownload+61-19
v4-0006-bufmgr-Introduce-FlushUnlockedBuffer.patchtext/x-diff; charset=us-asciiDownload+20-17
On Tue, 23 Sept 2025 at 00:14, Andres Freund <andres@anarazel.de> wrote:
On 2025-09-15 19:05:37 -0400, Andres Freund wrote:
Here are the first few cleaned up patches implementing the above steps, as
well as some cleanups. I included a commit from another thread, as it
conflicts with these changes, and we really should apply it - and it's
arguably required to make the changes viable, as it removes one more use of
PinBuffer_Locked().Another change included is to not return the buffer with the spinlock held
from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
for that is to reduce the most common PinBuffer_locked() call. By definition
PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
it 0002 is faster than master. And the previous approach also just seems
pretty unclean. I don't love that it requires the new TrackNewBufferPin(),
but I don't really have a better idea.I invite particular attention to the commit message for 0003 as well as the
comment changes in buf_internals.h within.Robert looked at the patches while we were chatting, and I addressed his
feedback in this new version.
I like these changes, and have some minor comments:
0001 ensures that ReadRecentBuffer increments the usage counter, which
someone who uses an access strategy may want to prevent. I know this
isn't exactly new behaviour, but something I noticed anyway. Apart
from that observation, LGTM
0002 has a FIXME in a comment in GetVictimBuffer. Assuming it's about
the comment itself needing updates, how about:
+ * Ensure, before we pin a victim buffer, that there's a free refcount
+ * entry, and a resource owner slot for the pin.
Again, LGTM.
0003's UnlockBufHdrExt:
This is implemented with CAS, even when we only want to change bits we
know the state of (or could know, if we spent the effort).
Given its inline nature, wouldn't it be better to use atomic_sub
instructions? Or is this to handle cases where the bits we want to
(un)set might be (un)set by a concurrent process?
If the latter, could we specialize this to do a single atomic_sub
whenever we want to change state bits that we know can be only changed
whilst holding the spinlock?
0004: LGTM
0005: LGTM
0006: LGTM
Kind regards,
Matthias van de Meent
Hi,
On 2025-10-04 09:05:45 +0200, Matthias van de Meent wrote:
On Tue, 23 Sept 2025 at 00:14, Andres Freund <andres@anarazel.de> wrote:
On 2025-09-15 19:05:37 -0400, Andres Freund wrote:
Here are the first few cleaned up patches implementing the above steps, as
well as some cleanups. I included a commit from another thread, as it
conflicts with these changes, and we really should apply it - and it's
arguably required to make the changes viable, as it removes one more use of
PinBuffer_Locked().Another change included is to not return the buffer with the spinlock held
from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
for that is to reduce the most common PinBuffer_locked() call. By definition
PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
it 0002 is faster than master. And the previous approach also just seems
pretty unclean. I don't love that it requires the new TrackNewBufferPin(),
but I don't really have a better idea.I invite particular attention to the commit message for 0003 as well as the
comment changes in buf_internals.h within.Robert looked at the patches while we were chatting, and I addressed his
feedback in this new version.I like these changes, and have some minor comments:
Thank for reviewing!
0001 ensures that ReadRecentBuffer increments the usage counter, which
someone who uses an access strategy may want to prevent. I know this
isn't exactly new behaviour, but something I noticed anyway. Apart
from that observation, LGTM
Are you proposing to change behaviour? Right now ReadRecentBuffer doesn't even
accept a strategy, so I don't really see this as something that needs to be
tackled at this point.
I'm not sure I see any real use cases for ReadRecentBuffer() that would
benefit from a strategy, but I very well might just not be thinking wide
enough.
0002 has a FIXME in a comment in GetVictimBuffer. Assuming it's about
the comment itself needing updates
Indeed.
, how about:
+ * Ensure, before we pin a victim buffer, that there's a free refcount + * entry, and a resource owner slot for the pin.Again, LGTM.
WFM.
0003's UnlockBufHdrExt:
This is implemented with CAS, even when we only want to change bits we
know the state of (or could know, if we spent the effort).
Given its inline nature, wouldn't it be better to use atomic_sub
instructions? Or is this to handle cases where the bits we want to
(un)set might be (un)set by a concurrent process?
Yes, it's to handle concurrent changes to the buffer state.
If the latter, could we specialize this to do a single atomic_sub
whenever we want to change state bits that we know can be only changed
whilst holding the spinlock?
We probably could optimize some cases as an atomic-sub, some others as an
atomic-and and others again as an atomic-or. The latter to however are
implemented as a CAS on x86 anyway...
After 0004 I don't think any of the paths using this are actually particularly
hot, so I'm somewhat doubtful it's worth to try to optimize this too much. If
there are hot paths, we really should try to work towards not even needing the
buffer header spinlock, that has a bigger impact that improving the code for
unlocking the buffer header...
Greetings,
Andres Freund
Hi,
On Tue, 7 Oct 2025 at 00:55, Andres Freund <andres@anarazel.de> wrote:
On 2025-10-04 09:05:45 +0200, Matthias van de Meent wrote:
0001 ensures that ReadRecentBuffer increments the usage counter, which
someone who uses an access strategy may want to prevent. I know this
isn't exactly new behaviour, but something I noticed anyway. Apart
from that observation, LGTMAre you proposing to change behaviour?
Eventually, yes, but not necessarily now or with this patchset.
Right now ReadRecentBuffer doesn't even
accept a strategy, so I don't really see this as something that needs to be
tackled at this point.I'm not sure I see any real use cases for ReadRecentBuffer() that would
benefit from a strategy, but I very well might just not be thinking wide
enough.
I think it's rather strange that there is no Extended variant of
ReadRecentBuffer, like how there is a ReadBufferExtended for
ReadBuffer. Yes, ReadRecentBuffer has more arguments to fill and so
has a smaller difference versus ReadBufferExtended, but it's no
complete replacement for ReadBuffer[Ext] when you're aware of a recent
buffer of the page.
I'm not saying it will definitely happen, but I could see that e.g.
amcheck might want to keep track of buffer IDs of recent heap pages it
accessed to verify index's results, without holding a pin on all the
pages; and instead using ReadRecentBuffer[Extended] with a
BufferAccessStrategy to allow re-acquiring the buffer pin without
blowing out shared buffers or making parts of the pool take forever to
evict again.
0003's UnlockBufHdrExt:
This is implemented with CAS, even when we only want to change bits we
know the state of (or could know, if we spent the effort).
Given its inline nature, wouldn't it be better to use atomic_sub
instructions? Or is this to handle cases where the bits we want to
(un)set might be (un)set by a concurrent process?Yes, it's to handle concurrent changes to the buffer state.
If the latter, could we specialize this to do a single atomic_sub
whenever we want to change state bits that we know can be only changed
whilst holding the spinlock?We probably could optimize some cases as an atomic-sub, some others as an
atomic-and and others again as an atomic-or. The latter to however are
implemented as a CAS on x86 anyway...After 0004 I don't think any of the paths using this are actually particularly
hot, so I'm somewhat doubtful it's worth to try to optimize this too much. If
there are hot paths, we really should try to work towards not even needing the
buffer header spinlock, that has a bigger impact that improving the code for
unlocking the buffer header...
Fair enough; I guess we'll see if further optimization would have much
impact once this all has been committed.
Kind regards,
Matthias van de Meent
Databricks
Hi,
I pushed a few commits from this patchset after Matthias' review
(thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
would not be done anymore for the buffers returned by
StrategyGetBuffer(). Which turned skink red.
The attached 0001 patch centralizes the valgrind initialization in
TrackNewBufferPin(), which 5e899859287 had added. The nice side effect of that
is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than before. The
naming isn't the perfect match, but it seems fine to me.
0002 is a new version of "Allow some buffer state modifications while holding
header lock", mainly with a fair bit of comment polishing around BufferDesc
and one small oversight fixed (didn't update a buffer_state variable in one
place).
Greetings,
Andres Freund
Attachments:
v5-0001-bufmgr-Fix-valgrind-checking-for-buffers-pinned-i.patchtext/x-diff; charset=us-asciiDownload+16-17
v5-0002-bufmgr-Allow-some-buffer-state-modifications-whil.patchtext/x-diff; charset=us-asciiDownload+224-120
v5-0003-bufmgr-Use-atomic-sub-for-unpinning-buffers.patchtext/x-diff; charset=us-asciiDownload+3-27
On 2025-10-09 16:35:44 -0400, Andres Freund wrote:
I pushed a few commits from this patchset after Matthias' review
(thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
would not be done anymore for the buffers returned by
StrategyGetBuffer(). Which turned skink red.The attached 0001 patch centralizes the valgrind initialization in
TrackNewBufferPin(), which 5e899859287 had added. The nice side effect of that
is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than before. The
naming isn't the perfect match, but it seems fine to me.
Forgot to say: I'll push this patch soon, to get skink back to green. Unless
somebody says something. We can adjust this later, if the comment and/or
placement of VALGRIND_MAKE_MEM_DEFINED() isn't to everyones liking.
Hi,
On 2025-10-09 17:16:49 -0400, Andres Freund wrote:
On 2025-10-09 16:35:44 -0400, Andres Freund wrote:
I pushed a few commits from this patchset after Matthias' review
(thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
would not be done anymore for the buffers returned by
StrategyGetBuffer(). Which turned skink red.The attached 0001 patch centralizes the valgrind initialization in
TrackNewBufferPin(), which 5e899859287 had added. The nice side effect of that
is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than before. The
naming isn't the perfect match, but it seems fine to me.Forgot to say: I'll push this patch soon, to get skink back to green. Unless
somebody says something. We can adjust this later, if the comment and/or
placement of VALGRIND_MAKE_MEM_DEFINED() isn't to everyones liking.
I have pushed that fix as well as the subsequent buffer header locking changes
a while ago.
Attached is a patchset that actually implements the buffer content locks in
bufmgr.c. This isn't that close to a committable shape yet, but it seemed
useful to get it out there. The first few patches seem closer, so it'll also
be useful to narrow this down.
0001: A straight-up bugfix in lwlock.c - albeit for a bug that seems currently
effectively harmless.
0002: Not really required, but seems like an improvement to me
0003: A prerequisite to 0004, pretty boring itself
0004: Use 64bit atomics for BufferDesc.state - at this point nothing uses the
additional bits yet, though. Some annoying reformatting required to avoid
long lines.
0005: There already was a wait event class for BUFFERPIN. It seems better to
make that more general than to implement them separately.
0006+0007: This is preparatory work for 0008, but also worthwhile on its
own. The private refcount stuff does show up in profiles. The reason it's
related is that without these changes the added information in 0008 makes that
worse.
0008: The main change. Implements buffer content locking independently from
lwlock.c. There's obviously a lot of similarity between lwlock.c code and
this, but I've not found a good way to reduce the duplication without giving
up too much. This patch does immediately introduce share-exclusive as a new
lock level, mostly because it was too painful to do separately.
0009+0010+0011: Preparatory work for 0012.
0012: One of the main goals of this patchset - use the new share-exclusive
lock level to only allow hint bits to be set while no IO is going on.
0013: Prototype of making UnlockReleaseBuffer() faster and of using it more
widely in nbtree.c
0014: Now that hint bits can't be done while IO is going on, we don't need to
copy pages anymore. This needs a fair bit more work, as denoted by the FIXMEs
in the code.
I've tried to add detail to the more important commit messages, at least until
0012.
I want to again emphasize that the important commits (i.e. 0008, 0012, 0014)
aren't close to being mergeable. But I think they're in a stage that they
could benefit from "lenient" high-level review.
Greetings,
Andres Freund