ReadRecentBuffer() doesn't scale well

Started by Andres Freundover 2 years ago29 messages
#1Andres Freund
andres@anarazel.de

Hi,

As mentioned nearby [1]/messages/by-id/20230627013458.axge7iylw7llyvww@awork3.anarazel.de, Thomas brought up [2]https://twitter.com/MengTangmu/status/1673439083518115840 the idea of using
ReadRecentBuffer() _bt_getroot(). I couldn't resist and prototyped it.

Unfortunately it scaled way worse at first. This is not an inherent issue, but
due to an implementation choice in ReadRecentBuffer(). Whereas the normal
BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
LockBufHdr(), checks if the buffer ID is the same and then uses
PinBuffer_Locked().

The problem with that is that PinBuffer() takes care to not hold the buffer
header spinlock, it uses compare_exchange to atomically acquire the pin, while
guaranteing nobody holds the lock. When holding the buffer header spinlock,
there obviously is the risk of being scheduled out (or even just not have
exclusive access to the cacheline).

ReadRecentBuffer() scales worse even if LockBufHdr() is immediately followed
by PinBuffer_Locked(), so it's really just holding the lock that is the issue.

The fairly obvious solution to this is to just use PinBuffer() and just unpin
the buffer if its identity was changed concurrently. There could be an
unlocked pre-check as well. However, there's the following comment in
ReadRecentBuffer():
* It's now safe to pin the buffer. We can't pin first and ask
* questions later, because it might confuse code paths like
* InvalidateBuffer() if we pinned a random non-matching buffer.
*/

But I'm not sure I buy that - there's plenty other things that can briefly
acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
contents, etc).

Another difference between using PinBuffer() and PinBuffer_locked() is that
the latter does not adjust a buffer's usagecount.

Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
increasing usagecount anymore?

FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
the root page's buffer id in RelationData, seems a noticeable win. About 7% in
a concurrent, read-only pgbench that utilizes batches of 10. And it should be
easy to get much bigger wins, e.g. with a index nested loop with a relatively
small index on the inner side.

Greetings,

Andres Freund

[1]: /messages/by-id/20230627013458.axge7iylw7llyvww@awork3.anarazel.de
[2]: https://twitter.com/MengTangmu/status/1673439083518115840

#2Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#1)
Re: ReadRecentBuffer() doesn't scale well

On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:

As mentioned nearby [1], Thomas brought up [2] the idea of using
ReadRecentBuffer() _bt_getroot(). I couldn't resist and prototyped it.

Thanks!

Unfortunately it scaled way worse at first. This is not an inherent issue, but
due to an implementation choice in ReadRecentBuffer(). Whereas the normal
BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
LockBufHdr(), checks if the buffer ID is the same and then uses
PinBuffer_Locked().

The problem with that is that PinBuffer() takes care to not hold the buffer
header spinlock, it uses compare_exchange to atomically acquire the pin, while
guaranteing nobody holds the lock. When holding the buffer header spinlock,
there obviously is the risk of being scheduled out (or even just not have
exclusive access to the cacheline).

Yeah. Aside from inherent nastiness of user-space spinlocks, this new
use case is also enormously more likely to contend and then get into
trouble by being preempted due to btree root pages being about the
hottest pages in the universe than the use case I was focusing on at
the time.

The fairly obvious solution to this is to just use PinBuffer() and just unpin
the buffer if its identity was changed concurrently. There could be an
unlocked pre-check as well. However, there's the following comment in
ReadRecentBuffer():
* It's now safe to pin the buffer. We can't pin first and ask
* questions later, because it might confuse code paths like
* InvalidateBuffer() if we pinned a random non-matching buffer.
*/

But I'm not sure I buy that - there's plenty other things that can briefly
acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
contents, etc).

I may well have been too cautious with that. The worst thing I can
think of right now is that InvalidateBuffer() would busy loop (as it
already does in other rare cases) when it sees a pin.

Another difference between using PinBuffer() and PinBuffer_locked() is that
the latter does not adjust a buffer's usagecount.

Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
increasing usagecount anymore?

Yeah, that is not great. The simplification you suggest would fix
that too, though I guess it would also bump the usage count of buffers
that don't have the tag we expected; that's obviously rare and erring
on a better side though.

FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
the root page's buffer id in RelationData, seems a noticeable win. About 7% in
a concurrent, read-only pgbench that utilizes batches of 10. And it should be
easy to get much bigger wins, e.g. with a index nested loop with a relatively
small index on the inner side.

Wooo, that's better than I was hoping. Thanks for trying it out! I
think, for the complexity involved (ie very little), it's a nice
result, and worth considering even though it's also a solid clue that
we could do much better than this with a (yet to be designed)
longer-lived pin scheme. smgr_targblock could be another
easy-to-cache candidate, ie a place where there is a single
interesting hot page that we're already keeping track of with no
requirement for new backend-local mapping machinery.

In reply to: Thomas Munro (#2)
Re: ReadRecentBuffer() doesn't scale well

On Mon, Jun 26, 2023 at 8:34 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Yeah. Aside from inherent nastiness of user-space spinlocks, this new
use case is also enormously more likely to contend and then get into
trouble by being preempted due to btree root pages being about the
hottest pages in the universe than the use case I was focusing on at
the time.

They're not just the hottest. They're also among the least likely to
change from one moment to the next. (If that ever failed to hold then
it wouldn't take long for the index to become grotesquely tall.)

--
Peter Geoghegan

#4Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#2)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:

On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:

Unfortunately it scaled way worse at first. This is not an inherent issue, but
due to an implementation choice in ReadRecentBuffer(). Whereas the normal
BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
LockBufHdr(), checks if the buffer ID is the same and then uses
PinBuffer_Locked().

The problem with that is that PinBuffer() takes care to not hold the buffer
header spinlock, it uses compare_exchange to atomically acquire the pin, while
guaranteing nobody holds the lock. When holding the buffer header spinlock,
there obviously is the risk of being scheduled out (or even just not have
exclusive access to the cacheline).

Yeah. Aside from inherent nastiness of user-space spinlocks

I've been wondering about making our backoff path use futexes, after some
adaptive spinning.

The fairly obvious solution to this is to just use PinBuffer() and just unpin
the buffer if its identity was changed concurrently. There could be an
unlocked pre-check as well. However, there's the following comment in
ReadRecentBuffer():
* It's now safe to pin the buffer. We can't pin first and ask
* questions later, because it might confuse code paths like
* InvalidateBuffer() if we pinned a random non-matching buffer.
*/

But I'm not sure I buy that - there's plenty other things that can briefly
acquire a buffer pin (e.g. checkpointer, reclaiming the buffer for other
contents, etc).

I may well have been too cautious with that. The worst thing I can
think of right now is that InvalidateBuffer() would busy loop (as it
already does in other rare cases) when it sees a pin.

Right. Particularly if we were to add a pre-check for the tag to match, that
should be extremely rare.

Another difference between using PinBuffer() and PinBuffer_locked() is that
the latter does not adjust a buffer's usagecount.

Leaving the scalability issue aside, isn't it somewhat odd that optimizing a
codepath to use ReadRecentBuffer() instead of ReadBuffer() leads to not
increasing usagecount anymore?

Yeah, that is not great. The simplification you suggest would fix
that too, though I guess it would also bump the usage count of buffers
that don't have the tag we expected; that's obviously rare and erring
on a better side though.

Yea, I'm not worried about that. If somebody is, we could just add code to
decrement the usagecount again.

FWIW, once that's fixed, using ReadRecentBuffer() for _bt_getroot(), caching
the root page's buffer id in RelationData, seems a noticeable win. About 7% in
a concurrent, read-only pgbench that utilizes batches of 10. And it should be
easy to get much bigger wins, e.g. with a index nested loop with a relatively
small index on the inner side.

Wooo, that's better than I was hoping. Thanks for trying it out! I
think, for the complexity involved (ie very little)

I don't really have a concrete thought for where to store the id of the recent
buffer. I just added a new field into some padding in RelationData, but we
might go for something fancier.

smgr_targblock could be another easy-to-cache candidate, ie a place where
there is a single interesting hot page that we're already keeping track of
with no requirement for new backend-local mapping machinery.

I wonder if we should simple add a generic field for such a Buffer to
RelationData, that the AM can use as it desires. For btree that would be the
root page, for heap the target block ...

it's a nice result, and worth considering even though it's also a solid clue
that we could do much better than this with a (yet to be designed)
longer-lived pin scheme.

Indeed. PinBuffer() is pretty hot after the change. As is the buffer content
lock.

Particularly for the root page, it'd be really interesting to come up with a
scheme that keeps an offline copy of the root page while also pinning the real
root page. I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.

Greetings,

Andres Freund

In reply to: Andres Freund (#4)
Re: ReadRecentBuffer() doesn't scale well

On Mon, Jun 26, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:

I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.

I'm guessing that you're thinking of doing something with the page LSN?

--
Peter Geoghegan

#6Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#5)
Re: ReadRecentBuffer() doesn't scale well

On Tue, Jun 27, 2023 at 4:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jun 26, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:

I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.

I'm guessing that you're thinking of doing something with the page LSN?

If the goal is to get rid of both pins and content locks, LSN isn't
enough. A page might be evicted and replaced by another page that has
the same LSN because they were modified by the same record. Maybe
that's vanishingly rare, but the correct thing would be counter that
goes up on modification AND eviction. (FWIW I toyed with variants of
this concept in the context of SLRU -> buffer pool migration, where I
was trying to do zero-lock CLOG lookups; in that case I didn't need
the copy of the page being discussed here due to the data being
atomically readable, but I had the same requirement for a
"did-it-change-under-my-feet?" check).

In reply to: Thomas Munro (#6)
Re: ReadRecentBuffer() doesn't scale well

On Mon, Jun 26, 2023 at 9:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:

If the goal is to get rid of both pins and content locks, LSN isn't
enough. A page might be evicted and replaced by another page that has
the same LSN because they were modified by the same record. Maybe
that's vanishingly rare, but the correct thing would be counter that
goes up on modification AND eviction.

It should be safe to allow searchers to see a version of the root page
that is out of date. The Lehman & Yao design is very permissive about
these things. There aren't any special cases where the general rules
are weakened in some way that might complicate this approach.
Searchers need to check the high key to determine if they need to move
right -- same as always.

More concretely: A root page can be concurrently split when there is
an in-flight index scan that is about to land on it (which becomes the
left half of the split). It doesn't matter if it's a searcher that is
"between" the meta page and the root page. It doesn't matter if a
level was added. This is true even though nothing that you'd usually
think of as an interlock is held "between levels". The root page isn't
really special, except in the obvious way. We can even have two roots
at the same time (the true root, and the fast root).

--
Peter Geoghegan

#8Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#7)
Re: ReadRecentBuffer() doesn't scale well

On Tue, Jun 27, 2023 at 4:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jun 26, 2023 at 9:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:

If the goal is to get rid of both pins and content locks, LSN isn't
enough. A page might be evicted and replaced by another page that has
the same LSN because they were modified by the same record. Maybe
that's vanishingly rare, but the correct thing would be counter that
goes up on modification AND eviction.

It should be safe to allow searchers to see a version of the root page
that is out of date. The Lehman & Yao design is very permissive about
these things. There aren't any special cases where the general rules
are weakened in some way that might complicate this approach.
Searchers need to check the high key to determine if they need to move
right -- same as always.

OK. I guess I'm talking about a slightly more general version of the
problem inspired by the stuff I mentioned in parentheses, which would
simply get the wrong answer if the mapping changed, whereas here you'd
use the cached copy in a race case which should still work for
searches.

So I guess the question for this thread is: do we want to work on
ReadRecentBuffer(), or just take this experiment as evidence of even
more speed-up available and aim for that directly?

#9Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#6)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2023-06-27 16:40:08 +1200, Thomas Munro wrote:

On Tue, Jun 27, 2023 at 4:32 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jun 26, 2023 at 9:09 PM Andres Freund <andres@anarazel.de> wrote:

I think we should be able to have a post-check that can figure out
if the copied root page is out of date after searching it, without needing the
content lock.

I'm guessing that you're thinking of doing something with the page LSN?

Yes, that seems to be the most obvious.

If the goal is to get rid of both pins and content locks, LSN isn't
enough.

I was imaginging you'd have a long-lived pin. I don't think trying to make it
work without that is particularly promising in this context, where it seems
quite feasible to keep pins around for a while.

A page might be evicted and replaced by another page that has the same LSN
because they were modified by the same record. Maybe that's vanishingly
rare, but the correct thing would be counter that goes up on modification
AND eviction.

I don't think it would need to be a single counter. If we wanted to do
something like this, I think you'd have to have a counter in the buffer desc
that's incremented whenever the page is replaced. Plus the LSN for the page
content change "counter".

Greetings,

Andres Freund

#10Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#7)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote:

It should be safe to allow searchers to see a version of the root page
that is out of date. The Lehman & Yao design is very permissive about
these things. There aren't any special cases where the general rules
are weakened in some way that might complicate this approach.
Searchers need to check the high key to determine if they need to move
right -- same as always.

Wouldn't we at least need a pin on the root page, or hold a snapshot, to
defend against page deletions?

Greetings,

Andres Freund

In reply to: Andres Freund (#10)
Re: ReadRecentBuffer() doesn't scale well

On Mon, Jun 26, 2023 at 11:27 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote:

It should be safe to allow searchers to see a version of the root page
that is out of date. The Lehman & Yao design is very permissive about
these things. There aren't any special cases where the general rules
are weakened in some way that might complicate this approach.
Searchers need to check the high key to determine if they need to move
right -- same as always.

Wouldn't we at least need a pin on the root page, or hold a snapshot, to
defend against page deletions?

You need to hold a snapshot to prevent concurrent page recycling --
though not page deletion itself (I did say "anything that you'd
usually think of as an interlock"). I'm pretty sure that a concurrent
page deletion is possible, even when you hold a pin on the page.
(Perhaps not, but if not then it's just an accident -- a side-effect
of the interlock that protects against concurrent heap TID recycling.)

You can't delete a rightmost page (on any level). Every root page is a
rightmost page. So the root would have to be split, and then once
again emptied before it could be deleted -- only then would there be a
danger of some backend with a locally cached root page having an
irredeemably bad picture of what's going on with the index. That's
another angle that you could approach the problem from, I suppose.

--
Peter Geoghegan

#12Ants Aasma
ants@cybertec.at
In reply to: Andres Freund (#4)
1 attachment(s)
Re: ReadRecentBuffer() doesn't scale well

On Tue, 27 Jun 2023 at 07:09, Andres Freund <andres@anarazel.de> wrote:

On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:

On Tue, Jun 27, 2023 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:

Unfortunately it scaled way worse at first. This is not an inherent issue, but
due to an implementation choice in ReadRecentBuffer(). Whereas the normal
BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
LockBufHdr(), checks if the buffer ID is the same and then uses
PinBuffer_Locked().

The problem with that is that PinBuffer() takes care to not hold the buffer
header spinlock, it uses compare_exchange to atomically acquire the pin, while
guaranteing nobody holds the lock. When holding the buffer header spinlock,
there obviously is the risk of being scheduled out (or even just not have
exclusive access to the cacheline).

Yeah. Aside from inherent nastiness of user-space spinlocks

I've been wondering about making our backoff path use futexes, after some
adaptive spinning.

If you want to experiment, here is a rebased version of something I
hacked up a couple of years back on the way to Fosdem Pgday. I didn't
pursue it further because I didn't have a use case where it showed a
significant difference.

--
Ants

Attachments:

futex-prototype.patchtext/x-patch; charset=US-ASCII; name=futex-prototype.patchDownload
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 327ac64f7c2..67a5e8a0246 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -92,6 +92,7 @@ s_lock_stuck(const char *file, int line, const char *func)
 int
 s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
 {
+#ifndef HAS_FUTEX
 	SpinDelayStatus delayStatus;
 
 	init_spin_delay(&delayStatus, file, line, func);
@@ -104,6 +105,8 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
 	finish_spin_delay(&delayStatus);
 
 	return delayStatus.delays;
+#endif
+	elog(FATAL, "Should not be called");
 }
 
 #ifdef USE_DEFAULT_S_UNLOCK
@@ -230,6 +233,71 @@ update_spins_per_delay(int shared_spins_per_delay)
 	return (shared_spins_per_delay * 15 + spins_per_delay) / 16;
 }
 
+#ifdef HAS_FUTEX
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <linux/futex.h>
+
+static int
+futex(volatile uint32 *uaddr, int futex_op, int val,
+	  const struct timespec *timeout, int *uaddr2, int val3)
+{
+	return syscall(SYS_futex, uaddr, futex_op, val,
+				   timeout, uaddr, val3);
+}
+
+int
+futex_lock(volatile slock_t *lock, uint32 current, const char *file, int line, const char *func)
+{
+	int i, s;
+	/*
+	 * First lets wait for a bit without involving the kernel, it is quite likely
+	 * the lock holder is still running.
+	 **/
+	if (likely(current < 2))
+	{
+		uint32 expected;
+		for (i = 0; i < DEFAULT_SPINS_PER_DELAY; i++)
+		{
+			SPIN_DELAY();
+			expected = lock->value;
+			if (expected == 0 && pg_atomic_compare_exchange_u32(lock, &expected, 1))
+				return i;
+		}
+
+		while (expected != 2 && !pg_atomic_compare_exchange_u32(lock, &expected, 2)) {
+			if (expected == 0 && pg_atomic_compare_exchange_u32(lock, &expected, 2))
+				return i;
+		}
+	}
+
+	/* At this point lock value is 2 and we will get waken up */
+	while (true)
+	{
+		uint32 expected = 0;
+		s = futex(&(lock->value), FUTEX_WAIT, 2, NULL, NULL, 0);
+		if (s == -1 && errno != EAGAIN)
+			elog(FATAL, "Futex wait failed with error: %m");
+
+		/* Maybe someone else was waiting too, we will try to wake them up. */
+		if (pg_atomic_compare_exchange_u32(lock, &expected, 2))
+			break;
+
+	}
+
+	return i;
+}
+
+int futex_unlock(volatile slock_t *lock, uint32 current)
+{
+	lock->value = 0;
+	if (futex(&(lock->value), FUTEX_WAKE, 1, NULL, NULL, 0) == -1)
+		elog(FATAL, "Futex wake failed with error: %m");
+
+	return 0;
+}
+
+#endif /* HAS_FUTEX */
 
 /*****************************************************************************/
 #if defined(S_LOCK_TEST)
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index c9fa84cc43c..6351ec0804e 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -205,6 +205,52 @@ spin_delay(void)
 #ifdef __x86_64__		/* AMD Opteron, Intel EM64T */
 #define HAS_TEST_AND_SET
 
+#if defined(__linux__)
+#define HAS_FUTEX 1 	/* TODO: move to configure to check for old kernels */
+#endif
+
+#ifdef HAS_FUTEX
+
+#include "port/atomics.h"
+
+typedef pg_atomic_uint32 slock_t;
+
+#define S_LOCK(lock) \
+	do { \
+		uint32 expected = 0; \
+		if (unlikely(!pg_atomic_compare_exchange_u32((lock), &expected, 1))) \
+			futex_lock((lock), expected, __FILE__, __LINE__, __func__); \
+	} while (0)
+
+
+#define S_UNLOCK(lock) \
+	do { \
+		uint32 actual = pg_atomic_exchange_u32((lock), 0); \
+		if (unlikely(actual == 2)) \
+			futex_unlock((lock), actual); \
+	} while (0)
+extern int futex_lock(volatile slock_t *lock, uint32 current, const char *file, int line, const char *func);
+extern int futex_unlock(volatile slock_t *lock, uint32 current);
+
+/* TAS only needed for regress */
+#define TAS(lock) tas(lock)
+
+static __inline__ int
+tas(volatile slock_t *lock)
+{
+	register uint32 _res = 1;
+
+	__asm__ __volatile__(
+		"	lock			\n"
+		"	xchgl	%0,%1	\n"
+:		"+q"(_res), "+m"(*lock)
+:		/* no inputs */
+:		"memory", "cc");
+	return (int) _res;
+}
+
+
+#else
 typedef unsigned char slock_t;
 
 #define TAS(lock) tas(lock)
@@ -247,6 +293,7 @@ spin_delay(void)
 		" rep; nop			\n");
 }
 
+#endif   /* HAS_FUTEX */
 #endif	 /* __x86_64__ */
 
 
diff --git a/src/test/regress/regress.c b/src/test/regress/regress.c
index bcbc6d910f1..b92f8542ae9 100644
--- a/src/test/regress/regress.c
+++ b/src/test/regress/regress.c
@@ -857,7 +857,11 @@ test_spinlock(void)
 		S_UNLOCK(&struct_w_lock.lock);
 
 		/* and that "contended" acquisition works */
+#ifdef HAS_FUTEX
+		futex_lock(&struct_w_lock.lock, 1, "testfile", 17, "testfunc");
+#else
 		s_lock(&struct_w_lock.lock, "testfile", 17, "testfunc");
+#endif
 		S_UNLOCK(&struct_w_lock.lock);
 
 		/*
#13Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#12)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2023-06-27 14:49:48 +0300, Ants Aasma wrote:

If you want to experiment, here is a rebased version of something I
hacked up a couple of years back on the way to Fosdem Pgday. I didn't
pursue it further because I didn't have a use case where it showed a
significant difference.

Thanks for posting!

Based on past experiments, anything that requires an atomic op during spinlock
release on x86 will be painful :/. I'm not sure there's a realistic way to
avoid that with futexes though :(.

Greetings,

Andres Freund

#14Ants Aasma
ants@cybertec.at
In reply to: Andres Freund (#13)
Re: ReadRecentBuffer() doesn't scale well

On Tue, 27 Jun 2023 at 18:40, Andres Freund <andres@anarazel.de> wrote:

On 2023-06-27 14:49:48 +0300, Ants Aasma wrote:

If you want to experiment, here is a rebased version of something I
hacked up a couple of years back on the way to Fosdem Pgday. I didn't
pursue it further because I didn't have a use case where it showed a
significant difference.

Thanks for posting!

Based on past experiments, anything that requires an atomic op during spinlock
release on x86 will be painful :/. I'm not sure there's a realistic way to
avoid that with futexes though :(.

Do you happen to know if a plain xchg instruction counts as an atomic
for this? I haven't done atomics stuff in a while, so I might be
missing something, but at first glance I think using a plain xchg
would be enough for the releasing side.

--
Ants

#15Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#14)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2023-06-27 19:04:31 +0300, Ants Aasma wrote:

On Tue, 27 Jun 2023 at 18:40, Andres Freund <andres@anarazel.de> wrote:

On 2023-06-27 14:49:48 +0300, Ants Aasma wrote:

If you want to experiment, here is a rebased version of something I
hacked up a couple of years back on the way to Fosdem Pgday. I didn't
pursue it further because I didn't have a use case where it showed a
significant difference.

Thanks for posting!

Based on past experiments, anything that requires an atomic op during spinlock
release on x86 will be painful :/. I'm not sure there's a realistic way to
avoid that with futexes though :(.

Do you happen to know if a plain xchg instruction counts as an atomic
for this? I haven't done atomics stuff in a while, so I might be
missing something, but at first glance I think using a plain xchg
would be enough for the releasing side.

It is automatically an atomic op when referencing memory:

Intel SDM 9.1.2.1 Automatic Locking:
"The operations on which the processor automatically follows the LOCK semantics are as follows:
• When executing an XCHG instruction that references memory.
...
"

Theoretically cmpxchg can be used in a non-atomic fashion. I'm not sure that
can be done correctly though, if you want to also store a separate value for
the futex. This stuff is hard to think though :)

Greetings,

Andres Freund

#16Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#11)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2023-06-27 01:10:25 -0700, Peter Geoghegan wrote:

On Mon, Jun 26, 2023 at 11:27 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-06-26 21:53:12 -0700, Peter Geoghegan wrote:

It should be safe to allow searchers to see a version of the root page
that is out of date. The Lehman & Yao design is very permissive about
these things. There aren't any special cases where the general rules
are weakened in some way that might complicate this approach.
Searchers need to check the high key to determine if they need to move
right -- same as always.

Wouldn't we at least need a pin on the root page, or hold a snapshot, to
defend against page deletions?

You need to hold a snapshot to prevent concurrent page recycling --
though not page deletion itself (I did say "anything that you'd
usually think of as an interlock").

I don't think we'd want to have a snapshot for this, that make it much less
beneficial.

I'm pretty sure that a concurrent page deletion is possible, even when you
hold a pin on the page. (Perhaps not, but if not then it's just an accident
-- a side-effect of the interlock that protects against concurrent heap TID
recycling.)

Looks like the pin should prevent the danger, but wouldn't be very attractive,
due to blocking vacuum...

I've wondered before about a type of pin that just prevents buffer
replacement, but not cleaning up page contents. I think that'd be beneficial
in quite a few places.

You can't delete a rightmost page (on any level). Every root page is a
rightmost page. So the root would have to be split, and then once
again emptied before it could be deleted -- only then would there be a
danger of some backend with a locally cached root page having an
irredeemably bad picture of what's going on with the index. That's
another angle that you could approach the problem from, I suppose.

If we had a way of just preventing the page from being replaced, or reliably
detecting that happening, without blocking btree vacuum, the easiest path
seems to be to use the cached version of the root page, and re-do the work
whenever a) the LSN of the page has changed or b) the buffer has been
replaced. To me that seems like it'd likely be simpler and more general than
relying on being able to step right from any outdated, but not deleted,
version of the page (due to the page deletion issues).

Obviously that'd lead to retries more often - but realistically it's still
going to be vanishingly rare, root pages don't get modified that much once the
index is beyond toy size.

I think a replacement counter in the buffer desc is the easiest way to achieve
that? We'd have to store the buffer ID, buffer replacement counter and page
LSN in RelationData. I think the protocol would have to be something like

1) do search on the copy of the root page

2) get page LSN from the relevant buffer contents - this could be from a
different relation / block or even an empty page, but will never be an
invalid memory access, as we don't free shared buffers before shutdown. If
the LSN changed since taking the copy, take a new copy of the root page and
start at 1)

3) check if buffer replacement counter is the same as at the time of the copy,
if not take a new copy of the root page and start at 1)

4) happiness

For optimization reasons it might make sense to store the buffer replacement
counter on a separate cacheline from BufferDesc.{state,content_lock}, so
reading the buffer replacement counter doesn't cause cacheline contention with
backends working with state/lock. But that's an implementation detail, and it
might not matter much, because the pressure on state,content_lock would be
reduced drastically.

Greetings,

Andres Freund

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#16)
2 attachment(s)
Re: ReadRecentBuffer() doesn't scale well

I (re)discovered why I used the lock-then-pin approach. In the
comments I mentioned InvalidBuffer(), but the main problem is in its
caller GetVictimBuffer() which has various sanity checks about
reference counts that can occasionally fail if you have code randomly
pinning any old buffer.

New idea: use the standard PinBuffer() function, but add a mode that
doesn't pin invalid buffers (with caveat that you can perhaps get a
false negative due to unlocked read, but never a false positive; see
commit message). Otherwise we'd have to duplicate all the same logic
to use cmpxchg for ReadRecentBuffer(), or rethink the assumptions in
that other code.

As for the lack of usage bump in the back-branches, I think the
options are: teach PinBuffer_Locked() to increment it optionally, or
back-patch whatever we come up with for this.

For the root buffer optimisation, the obvious place for storage seems
to be under rd_amcache. It was originally invented for the cached
metapage (commit d2896a9ed14) but could accommodate a new struct
holding whatever we want. Here is a patch to try that out.
BTAMCacheData would also be a natural place to put future things
including a copy of the root page itself, in later work on lock-free
tricks.

Experimental patches attached.

Attachments:

0001-Improve-ReadRecentBuffer-scalability.patchtext/x-patch; charset=US-ASCII; name=0001-Improve-ReadRecentBuffer-scalability.patchDownload
From b91d996c691077a7fa48718c933159136d650e77 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 29 Jun 2023 10:52:56 +1200
Subject: [PATCH 1/2] Improve ReadRecentBuffer() scalability.

While testing a new potential use for ReadRecentBuffer(), Andres
reported that it scales badly when called concurrently for the same
buffer by many backends.  Instead of a naive (but wrong) coding with
PinBuffer(), it used the spinlock, so that it could be careful to pin
only if the buffer was valid and holding the expected block, to avoid
breaking invariants in eg GetVictimBuffer().  Unfortunately that made it
less scalable than PinBuffer(), which uses compare-exchange instead.

We can fix that by giving PinBuffer() a new skip_if_not_valid mode that
doesn't pin invalid buffers.  It might occasionally skip when it
shouldn't due to the unlocked read of the header flags, but that's
unlikely and perfectly acceptable for an opportunistic optimisation
routine, and it can only succeed when it really should due to the
compare-exchange loop.

XXX This also fixes ReadRecentBuffer()'s failure to bump the usage
count.  Fix separately or back-patch this?

Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c | 60 ++++++++++++-----------------
 1 file changed, 24 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3c59bbd04e..0df9a1ec30 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -467,7 +467,8 @@ static BlockNumber ExtendBufferedRelShared(ExtendBufferedWhat eb,
 										   BlockNumber extend_upto,
 										   Buffer *buffers,
 										   uint32 *extended_by);
-static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
+static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
+					  bool skip_if_not_valid);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void BufferSync(int flags);
@@ -635,7 +636,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	BufferDesc *bufHdr;
 	BufferTag	tag;
 	uint32		buf_state;
-	bool		have_private_ref;
 
 	Assert(BufferIsValid(recent_buffer));
 
@@ -663,38 +663,17 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	else
 	{
 		bufHdr = GetBufferDescriptor(recent_buffer - 1);
-		have_private_ref = GetPrivateRefCount(recent_buffer) > 0;
 
-		/*
-		 * Do we already have this buffer pinned with a private reference?  If
-		 * so, it must be valid and it is safe to check the tag without
-		 * locking.  If not, we have to lock the header first and then check.
-		 */
-		if (have_private_ref)
-			buf_state = pg_atomic_read_u32(&bufHdr->state);
-		else
-			buf_state = LockBufHdr(bufHdr);
-
-		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
+		/* Is it still valid and holding the right tag? */
+		if (PinBuffer(bufHdr, NULL, true))
 		{
-			/*
-			 * It's now safe to pin the buffer.  We can't pin first and ask
-			 * questions later, because it might confuse code paths like
-			 * InvalidateBuffer() if we pinned a random non-matching buffer.
-			 */
-			if (have_private_ref)
-				PinBuffer(bufHdr, NULL);	/* bump pin count */
-			else
-				PinBuffer_Locked(bufHdr);	/* pin for first time */
-
-			pgBufferUsage.shared_blks_hit++;
-
-			return true;
+			if (BufferTagsEqual(&tag, &bufHdr->tag))
+			{
+				pgBufferUsage.shared_blks_hit++;
+				return true;
+			}
+			UnpinBuffer(bufHdr);
 		}
-
-		/* If we locked the header above, now unlock. */
-		if (!have_private_ref)
-			UnlockBufHdr(bufHdr, buf_state);
 	}
 
 	return false;
@@ -1252,7 +1231,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		buf = GetBufferDescriptor(existing_buf_id);
 
-		valid = PinBuffer(buf, strategy);
+		valid = PinBuffer(buf, strategy, false);
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
@@ -1330,7 +1309,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
 
-		valid = PinBuffer(existing_buf_hdr, strategy);
+		valid = PinBuffer(existing_buf_hdr, strategy, false);
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
@@ -1979,7 +1958,7 @@ ExtendBufferedRelShared(ExtendBufferedWhat eb,
 			 * Pin the existing buffer before releasing the partition lock,
 			 * preventing it from being evicted.
 			 */
-			valid = PinBuffer(existing_hdr, strategy);
+			valid = PinBuffer(existing_hdr, strategy, false);
 
 			LWLockRelease(partition_lock);
 
@@ -2225,10 +2204,13 @@ ReleaseAndReadBuffer(Buffer buffer,
  * Note that ResourceOwnerEnlargeBuffers must have been done already.
  *
  * Returns true if buffer is BM_VALID, else false.  This provision allows
- * some callers to avoid an extra spinlock cycle.
+ * some callers to avoid an extra spinlock cycle.  If skip_if_not_valid is
+ * true, then a false return value also indicates that the buffer was
+ * (recently) invalid and has not been pinned.
  */
 static bool
-PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
+PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
+		  bool skip_if_not_valid)
 {
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 	bool		result;
@@ -2252,6 +2234,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 			if (old_buf_state & BM_LOCKED)
 				old_buf_state = WaitBufHdrUnlocked(buf);
 
+			if (unlikely(skip_if_not_valid && !(old_buf_state & BM_VALID)))
+			{
+				ForgetPrivateRefCountEntry(ref);
+				return false;
+			}
+
 			buf_state = old_buf_state;
 
 			/* increase refcount */
-- 
2.40.1

0002-Use-ReadRecentBuffer-for-btree-root-page.patchtext/x-patch; charset=US-ASCII; name=0002-Use-ReadRecentBuffer-for-btree-root-page.patchDownload
From fdaa0873a37545cf42a90b9ede562bcbc2d72947 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 29 Jun 2023 12:27:52 +1200
Subject: [PATCH 2/2] Use ReadRecentBuffer() for btree root page.

The root page of a btree is accessed on every index scan, so it gets
very hot.  We can measure a speed-up on many workloads by pinning it
with ReadRecentBuffer() instead of ReadBuffer(), after remembering where
it was last time in the AM-private cache space in rel->rd_amcache.

Rearrange the existing use of rd_amcache into a new struct
BTAMCacheData.  It's likely that we'll find more things to put in there
in future work.

Discussion: https://postgr.es/m/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de
---
 src/backend/access/nbtree/nbtpage.c | 93 +++++++++++++++++++++--------
 src/include/access/nbtree.h         | 10 ++++
 src/tools/pgindent/typedefs.list    |  1 +
 3 files changed, 80 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d78971bfe8..bf270874d2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -311,6 +311,29 @@ _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
 	_bt_relbuf(rel, metabuf);
 }
 
+/*
+ * Get our private per-relation cache area.
+ */
+static inline BTAMCacheData *
+_bt_getcache(Relation rel)
+{
+	BTAMCacheData *amcache;
+
+	if (unlikely(rel->rd_amcache == NULL))
+	{
+		/* Set up cache on first time through. */
+		amcache = (BTAMCacheData *)
+			MemoryContextAlloc(rel->rd_indexcxt, sizeof(*amcache));
+		amcache->meta_page_is_valid = false;
+		amcache->recent_root_buffer = InvalidBuffer;
+		rel->rd_amcache = amcache;
+	}
+	else
+		amcache = (BTAMCacheData *) rel->rd_amcache;
+
+	return amcache;
+}
+
 /*
  *	_bt_getroot() -- Get the root page of the btree.
  *
@@ -350,17 +373,21 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 	BlockNumber rootblkno;
 	uint32		rootlevel;
 	BTMetaPageData *metad;
+	BTAMCacheData *amcache;
 
 	Assert(access == BT_READ || heaprel != NULL);
 
+	amcache = _bt_getcache(rel);
+
 	/*
 	 * Try to use previously-cached metapage data to find the root.  This
 	 * normally saves one buffer access per index search, which is a very
 	 * helpful savings in bufmgr traffic and hence contention.
 	 */
-	if (rel->rd_amcache != NULL)
+	if (amcache->meta_page_is_valid)
 	{
-		metad = (BTMetaPageData *) rel->rd_amcache;
+		metad = &amcache->meta_page;
+
 		/* We shouldn't have cached it if any of these fail */
 		Assert(metad->btm_magic == BTREE_MAGIC);
 		Assert(metad->btm_version >= BTREE_MIN_VERSION);
@@ -373,7 +400,25 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 		Assert(rootblkno != P_NONE);
 		rootlevel = metad->btm_fastlevel;
 
-		rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+		/* Try to find the root page in the buffer it was last seen in. */
+		if (BufferIsValid(amcache->recent_root_buffer) &&
+			ReadRecentBuffer(rel->rd_locator, MAIN_FORKNUM, rootblkno,
+							 amcache->recent_root_buffer))
+		{
+			/*
+			 * It's in the same buffer as last time, and we avoided a trip
+			 * through the buffer map.
+			 */
+			rootbuf = amcache->recent_root_buffer;
+			_bt_lockbuf(rel, rootbuf, BT_READ);
+			_bt_checkpage(rel, rootbuf);
+		}
+		else
+		{
+			/* Slow path.  Remember where it is for next time. */
+			rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+			amcache->recent_root_buffer = rootbuf;
+		}
 		rootpage = BufferGetPage(rootbuf);
 		rootopaque = BTPageGetOpaque(rootpage);
 
@@ -393,10 +438,8 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 			return rootbuf;
 		}
 		_bt_relbuf(rel, rootbuf);
-		/* Cache is stale, throw it away */
-		if (rel->rd_amcache)
-			pfree(rel->rd_amcache);
-		rel->rd_amcache = NULL;
+		/* Cache is stale, mark it invalid. */
+		amcache->meta_page_is_valid = false;
 	}
 
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
@@ -523,9 +566,8 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
 		/*
 		 * Cache the metapage data for next time
 		 */
-		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
-											 sizeof(BTMetaPageData));
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		amcache->meta_page = *metad;
+		amcache->meta_page_is_valid = true;
 
 		/*
 		 * We are done with the metapage; arrange to release it via first
@@ -588,16 +630,16 @@ _bt_gettrueroot(Relation rel)
 	BlockNumber rootblkno;
 	uint32		rootlevel;
 	BTMetaPageData *metad;
+	BTAMCacheData *amcache;
 
 	/*
 	 * We don't try to use cached metapage data here, since (a) this path is
 	 * not performance-critical, and (b) if we are here it suggests our cache
 	 * is out-of-date anyway.  In light of point (b), it's probably safest to
-	 * actively flush any cached metapage info.
+	 * actively invalidate any cached metapage info.
 	 */
-	if (rel->rd_amcache)
-		pfree(rel->rd_amcache);
-	rel->rd_amcache = NULL;
+	amcache = _bt_getcache(rel);
+	amcache->meta_page_is_valid = false;
 
 	metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
 	metapg = BufferGetPage(metabuf);
@@ -674,9 +716,12 @@ _bt_gettrueroot(Relation rel)
 int
 _bt_getrootheight(Relation rel)
 {
+	BTAMCacheData *amcache;
 	BTMetaPageData *metad;
 
-	if (rel->rd_amcache == NULL)
+	amcache = _bt_getcache(rel);
+
+	if (!amcache->meta_page_is_valid)
 	{
 		Buffer		metabuf;
 
@@ -697,14 +742,13 @@ _bt_getrootheight(Relation rel)
 		/*
 		 * Cache the metapage data for next time
 		 */
-		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
-											 sizeof(BTMetaPageData));
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		amcache->meta_page = *metad;
+		amcache->meta_page_is_valid = true;
 		_bt_relbuf(rel, metabuf);
 	}
 
 	/* Get cached page */
-	metad = (BTMetaPageData *) rel->rd_amcache;
+	metad = &amcache->meta_page;
 	/* We shouldn't have cached it if any of these fail */
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
@@ -738,9 +782,11 @@ _bt_getrootheight(Relation rel)
 void
 _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 {
+	BTAMCacheData *amcache;
 	BTMetaPageData *metad;
 
-	if (rel->rd_amcache == NULL)
+	amcache = _bt_getcache(rel);
+	if (!amcache->meta_page_is_valid)
 	{
 		Buffer		metabuf;
 
@@ -770,14 +816,13 @@ _bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
 		 * from version 2 to version 3, both of which are !heapkeyspace
 		 * versions.
 		 */
-		rel->rd_amcache = MemoryContextAlloc(rel->rd_indexcxt,
-											 sizeof(BTMetaPageData));
-		memcpy(rel->rd_amcache, metad, sizeof(BTMetaPageData));
+		amcache->meta_page = *metad;
+		amcache->meta_page_is_valid = true;
 		_bt_relbuf(rel, metabuf);
 	}
 
 	/* Get cached page */
-	metad = (BTMetaPageData *) rel->rd_amcache;
+	metad = &amcache->meta_page;
 	/* We shouldn't have cached it if any of these fail */
 	Assert(metad->btm_magic == BTREE_MAGIC);
 	Assert(metad->btm_version >= BTREE_MIN_VERSION);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa7973..85cab606a3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -151,6 +151,16 @@ typedef struct BTMetaPageData
 #define BTREE_MIN_VERSION	2	/* minimum supported version */
 #define BTREE_NOVAC_VERSION	3	/* version with all meta fields set */
 
+/*
+ * Cache space, stored in rel->rd_amcache.
+ */
+typedef struct BTAMCacheData
+{
+	BTMetaPageData meta_page;
+	bool		meta_page_is_valid;
+	Buffer		recent_root_buffer;
+} BTAMCacheData;
+
 /*
  * Maximum size of a btree index entry, including its tuple header.
  *
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 260854747b..b75d9a5cb2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -187,6 +187,7 @@ BOOL
 BOOLEAN
 BOX
 BTArrayKeyInfo
+BTAMCacheData
 BTBuildState
 BTCycleId
 BTDedupInterval
-- 
2.40.1

#18Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#17)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2023-06-29 19:35:30 +1200, Thomas Munro wrote:

I (re)discovered why I used the lock-then-pin approach. In the
comments I mentioned InvalidBuffer(), but the main problem is in its
caller GetVictimBuffer() which has various sanity checks about
reference counts that can occasionally fail if you have code randomly
pinning any old buffer.

You're right. Specifically non-valid buffers are the issue.

New idea: use the standard PinBuffer() function, but add a mode that
doesn't pin invalid buffers (with caveat that you can perhaps get a
false negative due to unlocked read, but never a false positive; see
commit message). Otherwise we'd have to duplicate all the same logic
to use cmpxchg for ReadRecentBuffer(), or rethink the assumptions in
that other code.

It might be worth using lock free code in more places before long, but I agree
with the solution here.

As for the lack of usage bump in the back-branches, I think the
options are: teach PinBuffer_Locked() to increment it optionally, or
back-patch whatever we come up with for this.

Hm, or just leave it as is.

For the root buffer optimisation, the obvious place for storage seems
to be under rd_amcache. It was originally invented for the cached
metapage (commit d2896a9ed14) but could accommodate a new struct
holding whatever we want. Here is a patch to try that out.
BTAMCacheData would also be a natural place to put future things
including a copy of the root page itself, in later work on lock-free
tricks.

I am wondering if we don't want something more generic than stashing this in
rd_amcache. Don't want to end up duplicating relevant code across the uses of
rd_amcache in every AM.

@@ -663,38 +663,17 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
else
{
bufHdr = GetBufferDescriptor(recent_buffer - 1);
-		have_private_ref = GetPrivateRefCount(recent_buffer) > 0;
-		/*
-		 * Do we already have this buffer pinned with a private reference?  If
-		 * so, it must be valid and it is safe to check the tag without
-		 * locking.  If not, we have to lock the header first and then check.
-		 */
-		if (have_private_ref)
-			buf_state = pg_atomic_read_u32(&bufHdr->state);
-		else
-			buf_state = LockBufHdr(bufHdr);
-
-		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
+		/* Is it still valid and holding the right tag? */
+		if (PinBuffer(bufHdr, NULL, true))

I do wonder if we should have an unlocked pre-check for a) the buffer being
valid and b) BufferTagsEqual() matching. With such a pre-check the race for
increasing the usage count of the wrong buffer is quite small, without the
pre-check it doesn't seem that small anymore.

Greetings,

Andres Freund

#19Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#18)
2 attachment(s)
Re: ReadRecentBuffer() doesn't scale well

On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <andres@anarazel.de> wrote:

I am wondering if we don't want something more generic than stashing this in
rd_amcache. Don't want to end up duplicating relevant code across the uses of
rd_amcache in every AM.

I suppose we could try to track hot pages automatically. Ants Aasma
mentioned that he was working on a tiny SIMD-based LRU that could be
useful for something like that, without being too slow. Just for
fun/experimentation, here's a simple attempt to use a very stupid
stand-in LRU to cache the most recent 8 lookups for each fork of each
relation. Obviously that will miss every time for many workloads so
it'd have to be pretty close to free and this code probably isn't good
enough, but perhaps Ants knows how to sprinkle the right magic fairy
dust on it. It should automatically discover things like root pages,
the heap target block during repeat inserts etc, and visibility map
pages would stick around, and perhaps a few more pages of btrees that
have a very hot key range (but not pgbench).

I do wonder if we should have an unlocked pre-check for a) the buffer being
valid and b) BufferTagsEqual() matching. With such a pre-check the race for
increasing the usage count of the wrong buffer is quite small, without the
pre-check it doesn't seem that small anymore.

Yeah, that makes sense. Done in this version.

Attachments:

v2-0001-Improve-ReadRecentBuffer-scalability.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Improve-ReadRecentBuffer-scalability.patchDownload
From d5b9f345961e2adb31213c645e40037f15ba6a83 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 29 Jun 2023 10:52:56 +1200
Subject: [PATCH v2 1/2] Improve ReadRecentBuffer() scalability.

While testing a new potential use for ReadRecentBuffer(), Andres
reported that it scales badly when called concurrently for the same
buffer by many backends.  Instead of a naive (but wrong) coding with
PinBuffer(), it used the spinlock, so that it could be careful to pin
only if the buffer was valid and holding the expected block, to avoid
breaking invariants in eg GetVictimBuffer().  Unfortunately that made it
less scalable than PinBuffer(), which uses compare-exchange instead.

We can fix that by giving PinBuffer() a new skip_if_not_valid mode that
doesn't pin invalid buffers.  It might occasionally skip when it
shouldn't due to the unlocked read of the header flags, but that's
unlikely and perfectly acceptable for an opportunistic optimisation
routine, and it can only succeed when it really should due to the
compare-exchange loop.

XXX This also fixes ReadRecentBuffer()'s failure to bump the usage
count.  Fix separately or back-patch this?

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c | 63 +++++++++++++----------------
 1 file changed, 29 insertions(+), 34 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3c59bbd04e..f7a8e0d576 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -467,7 +467,8 @@ static BlockNumber ExtendBufferedRelShared(ExtendBufferedWhat eb,
 										   BlockNumber extend_upto,
 										   Buffer *buffers,
 										   uint32 *extended_by);
-static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
+static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
+					  bool skip_if_not_valid);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void BufferSync(int flags);
@@ -635,7 +636,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	BufferDesc *bufHdr;
 	BufferTag	tag;
 	uint32		buf_state;
-	bool		have_private_ref;
 
 	Assert(BufferIsValid(recent_buffer));
 
@@ -663,38 +663,24 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	else
 	{
 		bufHdr = GetBufferDescriptor(recent_buffer - 1);
-		have_private_ref = GetPrivateRefCount(recent_buffer) > 0;
 
 		/*
-		 * Do we already have this buffer pinned with a private reference?  If
-		 * so, it must be valid and it is safe to check the tag without
-		 * locking.  If not, we have to lock the header first and then check.
+		 * Is it still valid and holding the right tag?  We do an unlocked
+		 * tag comparison first, to make it unlikely that we'll increment the
+		 * usage counter of the wrong buffer, if someone calls us with a very
+		 * out of date recent_buffer.  Then we'll check it again if we get the
+		 * pin.
 		 */
-		if (have_private_ref)
-			buf_state = pg_atomic_read_u32(&bufHdr->state);
-		else
-			buf_state = LockBufHdr(bufHdr);
-
-		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
+		if (BufferTagsEqual(&tag, &bufHdr->tag) &&
+			PinBuffer(bufHdr, NULL, true))
 		{
-			/*
-			 * It's now safe to pin the buffer.  We can't pin first and ask
-			 * questions later, because it might confuse code paths like
-			 * InvalidateBuffer() if we pinned a random non-matching buffer.
-			 */
-			if (have_private_ref)
-				PinBuffer(bufHdr, NULL);	/* bump pin count */
-			else
-				PinBuffer_Locked(bufHdr);	/* pin for first time */
-
-			pgBufferUsage.shared_blks_hit++;
-
-			return true;
+			if (BufferTagsEqual(&tag, &bufHdr->tag))
+			{
+				pgBufferUsage.shared_blks_hit++;
+				return true;
+			}
+			UnpinBuffer(bufHdr);
 		}
-
-		/* If we locked the header above, now unlock. */
-		if (!have_private_ref)
-			UnlockBufHdr(bufHdr, buf_state);
 	}
 
 	return false;
@@ -1252,7 +1238,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		buf = GetBufferDescriptor(existing_buf_id);
 
-		valid = PinBuffer(buf, strategy);
+		valid = PinBuffer(buf, strategy, false);
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
@@ -1330,7 +1316,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
 
-		valid = PinBuffer(existing_buf_hdr, strategy);
+		valid = PinBuffer(existing_buf_hdr, strategy, false);
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
@@ -1979,7 +1965,7 @@ ExtendBufferedRelShared(ExtendBufferedWhat eb,
 			 * Pin the existing buffer before releasing the partition lock,
 			 * preventing it from being evicted.
 			 */
-			valid = PinBuffer(existing_hdr, strategy);
+			valid = PinBuffer(existing_hdr, strategy, false);
 
 			LWLockRelease(partition_lock);
 
@@ -2225,10 +2211,13 @@ ReleaseAndReadBuffer(Buffer buffer,
  * Note that ResourceOwnerEnlargeBuffers must have been done already.
  *
  * Returns true if buffer is BM_VALID, else false.  This provision allows
- * some callers to avoid an extra spinlock cycle.
+ * some callers to avoid an extra spinlock cycle.  If skip_if_not_valid is
+ * true, then a false return value also indicates that the buffer was
+ * (recently) invalid and has not been pinned.
  */
 static bool
-PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
+PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
+		  bool skip_if_not_valid)
 {
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 	bool		result;
@@ -2252,6 +2241,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 			if (old_buf_state & BM_LOCKED)
 				old_buf_state = WaitBufHdrUnlocked(buf);
 
+			if (unlikely(skip_if_not_valid && !(old_buf_state & BM_VALID)))
+			{
+				ForgetPrivateRefCountEntry(ref);
+				return false;
+			}
+
 			buf_state = old_buf_state;
 
 			/* increase refcount */
-- 
2.40.1

v2-0002-Introduce-a-tiny-LRU-cache-for-buffer-mapping.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Introduce-a-tiny-LRU-cache-for-buffer-mapping.patchDownload
From 3cf3729c181aeb5b2770940200ec43656977ac97 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 30 Jun 2023 12:25:26 +1200
Subject: [PATCH v2 2/2] Introduce a tiny LRU cache for buffer mapping.

Remember which buffer held the last N blocks we accessed for each fork
of each relation, so we can try to skip the shared memory buffer mapping
table (and more importantly its partition locks).

XXX This is a very dumb and slow way of coding an LRU mapping table to
show the concept, with the theory that a clever and fast coding might be
possible?
---
 src/backend/storage/buffer/bufmgr.c | 102 ++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c     |   4 ++
 src/include/storage/smgr.h          |  13 ++++
 3 files changed, 119 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7a8e0d576..34114f6188 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -969,6 +969,87 @@ ExtendBufferedRelTo(ExtendBufferedWhat eb,
 	return buffer;
 }
 
+/*
+ * Try to find a buffer using our tiny LRU cache, to avoid a trip through the
+ * buffer mapping table.  Only for non-local RBM_NORMAL reads.
+ */
+static Buffer
+ReadBuffer_try_recent(SMgrRelation smgr, ForkNumber forkNum,
+					  BlockNumber blockNum)
+{
+	SMgrBufferLruEntry *buffer_lru = &smgr->recent_buffer_lru[forkNum][0];
+
+	Assert(BlockNumberIsValid(blockNum));
+	for (int i = 0; i < SMGR_BUFFER_LRU_SIZE; ++i)
+	{
+		if (buffer_lru[i].block == blockNum)
+		{
+			SMgrBufferLruEntry found = buffer_lru[i];
+
+			if (ReadRecentBuffer(smgr->smgr_rlocator.locator,
+								 forkNum,
+								 blockNum,
+								 found.buffer))
+			{
+				/* Move to front. */
+				if (i > 0)
+				{
+					memmove(&buffer_lru[1],
+							&buffer_lru[0],
+							sizeof(buffer_lru[0]) * i);
+					buffer_lru[0] = found;
+				}
+				return found.buffer;
+			}
+
+			/* Kill this entry and give up. */
+			if (i < SMGR_BUFFER_LRU_SIZE - 1)
+				memmove(&buffer_lru[i],
+						&buffer_lru[i + 1],
+						SMGR_BUFFER_LRU_SIZE - i);
+			buffer_lru[SMGR_BUFFER_LRU_SIZE - 1].block = InvalidBlockNumber;
+			break;
+		}
+	}
+
+	return InvalidBuffer;
+}
+
+/*
+ * Remember which buffer a block is in, for later lookups by
+ * ReadBuffer_try_recent().
+ */
+static void
+RememberRecentBuffer(SMgrRelation smgr, ForkNumber forkNum,
+					 BlockNumber blockNum, Buffer buffer)
+{
+	SMgrBufferLruEntry *buffer_lru = &smgr->recent_buffer_lru[forkNum][0];
+
+	/* If it's already there, move to front and update. */
+	for (int i = 0; i < SMGR_BUFFER_LRU_SIZE; ++i)
+	{
+		if (buffer_lru[i].block == blockNum)
+		{
+			if (i > 0)
+			{
+				memmove(&buffer_lru[1],
+						&buffer_lru[0],
+						sizeof(buffer_lru[0]) * i);
+				buffer_lru[0].block = blockNum;
+			}
+			buffer_lru[0].buffer = buffer;
+			return;
+		}
+	}
+
+	/* Otherwise insert at front. */
+	memmove(&buffer_lru[1],
+			&buffer_lru[0],
+			sizeof(buffer_lru[0]) * (SMGR_BUFFER_LRU_SIZE - 1));
+	buffer_lru[0].block = blockNum;
+	buffer_lru[0].buffer = buffer;
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
@@ -985,6 +1066,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	IOContext	io_context;
 	IOObject	io_object;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
+	Buffer		recent_buffer;
 
 	*hit = false;
 
@@ -1035,6 +1117,20 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.local_blks_read++;
 	}
+	else if (mode == RBM_NORMAL &&
+			 BufferIsValid((recent_buffer = ReadBuffer_try_recent(smgr,
+																  forkNum,
+																  blockNum))))
+	{
+		/*
+		 * Pinned buffer without having to look it up in the shared buffer
+		 * mapping table.
+		 */
+		found = true;
+		bufHdr = GetBufferDescriptor(recent_buffer - 1);
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_RELATION;
+	}
 	else
 	{
 		/*
@@ -1050,6 +1146,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.shared_blks_read++;
+
+		RememberRecentBuffer(smgr, forkNum, blockNum,
+							 BufferDescriptorGetBuffer(bufHdr));
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -1163,6 +1262,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	{
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		TerminateBufferIO(bufHdr, false, BM_VALID);
+
+		RememberRecentBuffer(smgr, forkNum, blockNum,
+							 BufferDescriptorGetBuffer(bufHdr));
 	}
 
 	VacuumPageMiss++;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f76c4605db..cd04647a54 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -179,7 +179,11 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
+		{
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
+			for (int j = 0; j < SMGR_BUFFER_LRU_SIZE; ++j)
+				reln->recent_buffer_lru[i][j].block = InvalidBlockNumber;
+		}
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..6d4b0fc0b4 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,8 +16,18 @@
 
 #include "lib/ilist.h"
 #include "storage/block.h"
+#include "storage/buf.h"
 #include "storage/relfilelocator.h"
 
+/* For each fork we'll use one (typical) cache line of recent memory. */
+#define SMGR_BUFFER_LRU_SIZE 8
+
+typedef struct SMgrBufferLruEntry
+{
+	BlockNumber	block;
+	Buffer		buffer;
+} SMgrBufferLruEntry;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -61,6 +71,9 @@ typedef struct SMgrRelationData
 	 */
 	int			smgr_which;		/* storage manager selector */
 
+	/* for bufmgr.c; cached recently accessed buffers */
+	SMgrBufferLruEntry recent_buffer_lru[MAX_FORKNUM + 1][SMGR_BUFFER_LRU_SIZE];
+
 	/*
 	 * for md.c; per-fork arrays of the number of open segments
 	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
-- 
2.40.1

#20Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#19)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2023-06-30 14:13:11 +1200, Thomas Munro wrote:

On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <andres@anarazel.de> wrote:

I am wondering if we don't want something more generic than stashing this in
rd_amcache. Don't want to end up duplicating relevant code across the uses of
rd_amcache in every AM.

I suppose we could try to track hot pages automatically.

I think that could be useful - but as a separate facility. The benefit of
stashing the root page buffer in the relcache is that it's practically free of
overhead and doesn't have complications from how many other intervening
accesses there are etc.

I was more thinking of just making the relevant fields part of RelationData
and delegating the precise use to the individual AMs.

I do wonder if we should have an unlocked pre-check for a) the buffer being
valid and b) BufferTagsEqual() matching. With such a pre-check the race for
increasing the usage count of the wrong buffer is quite small, without the
pre-check it doesn't seem that small anymore.

Yeah, that makes sense. Done in this version.

Cool.

Greetings,

Andres Freund

#21vignesh C
vignesh21@gmail.com
In reply to: Thomas Munro (#19)
Re: ReadRecentBuffer() doesn't scale well

On Fri, 30 Jun 2023 at 07:43, Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <andres@anarazel.de> wrote:

I am wondering if we don't want something more generic than stashing this in
rd_amcache. Don't want to end up duplicating relevant code across the uses of
rd_amcache in every AM.

I suppose we could try to track hot pages automatically. Ants Aasma
mentioned that he was working on a tiny SIMD-based LRU that could be
useful for something like that, without being too slow. Just for
fun/experimentation, here's a simple attempt to use a very stupid
stand-in LRU to cache the most recent 8 lookups for each fork of each
relation. Obviously that will miss every time for many workloads so
it'd have to be pretty close to free and this code probably isn't good
enough, but perhaps Ants knows how to sprinkle the right magic fairy
dust on it. It should automatically discover things like root pages,
the heap target block during repeat inserts etc, and visibility map
pages would stick around, and perhaps a few more pages of btrees that
have a very hot key range (but not pgbench).

I do wonder if we should have an unlocked pre-check for a) the buffer being
valid and b) BufferTagsEqual() matching. With such a pre-check the race for
increasing the usage count of the wrong buffer is quite small, without the
pre-check it doesn't seem that small anymore.

Yeah, that makes sense. Done in this version.

I have changed the status of commitfest entry to "Waiting on Author"
as Andres's comments were not discussed further. Feel free to handle
the comments and change the status accordingly for the commitfest
entry.

Regards,
Vignesh

#22Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#19)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2023-06-30 14:13:11 +1200, Thomas Munro wrote:

I do wonder if we should have an unlocked pre-check for a) the buffer being
valid and b) BufferTagsEqual() matching. With such a pre-check the race for
increasing the usage count of the wrong buffer is quite small, without the
pre-check it doesn't seem that small anymore.

Yeah, that makes sense. Done in this version.

I'm planning to commit 0001 soon, unless you'd like to do the honors - I would
break it with some upcoming patches, and it's a good improvement. Those
patches also will PinBuffer_Locked() a bit slower, i.e. it'd be good to avoid
using it in ReadRecentBuffer() for that reason alone.

From d5b9f345961e2adb31213c645e40037f15ba6a83 Mon Sep 17 00:00:00 2001 From:
Thomas Munro <thomas.munro@gmail.com> Date: Thu, 29 Jun 2023 10:52:56 +1200
Subject: [PATCH v2 1/2] Improve ReadRecentBuffer() scalability.

While testing a new potential use for ReadRecentBuffer(), Andres
reported that it scales badly when called concurrently for the same
buffer by many backends. Instead of a naive (but wrong) coding with
PinBuffer(), it used the spinlock, so that it could be careful to pin
only if the buffer was valid and holding the expected block, to avoid
breaking invariants in eg GetVictimBuffer(). Unfortunately that made it
less scalable than PinBuffer(), which uses compare-exchange instead.

We can fix that by giving PinBuffer() a new skip_if_not_valid mode that
doesn't pin invalid buffers. It might occasionally skip when it
shouldn't due to the unlocked read of the header flags, but that's
unlikely and perfectly acceptable for an opportunistic optimisation
routine, and it can only succeed when it really should due to the
compare-exchange loop.

XXX This also fixes ReadRecentBuffer()'s failure to bump the usage
count. Fix separately or back-patch this?

FWIW, I'm inclined to not backpatch the usagecount change at this
point. Unless we have a clear case where it really hurts, I'm more worried
about disturbing working workloads...

Greetings,

Andres Freund

#23Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#22)
Re: ReadRecentBuffer() doesn't scale well

On Fri, Sep 19, 2025 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

I'm planning to commit 0001 soon, unless you'd like to do the honors - I would
break it with some upcoming patches, and it's a good improvement. Those
patches also will PinBuffer_Locked() a bit slower, i.e. it'd be good to avoid
using it in ReadRecentBuffer() for that reason alone.

Oh, thanks for thinking about that interaction. I'll go ahead and
push it later today after I re-convince myself that it's correct.

XXX This also fixes ReadRecentBuffer()'s failure to bump the usage
count. Fix separately or back-patch this?

FWIW, I'm inclined to not backpatch the usagecount change at this
point. Unless we have a clear case where it really hurts, I'm more worried
about disturbing working workloads...

+1

#24Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#23)
Re: ReadRecentBuffer() doesn't scale well

On Fri, Sep 19, 2025 at 11:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Sep 19, 2025 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

I'm planning to commit 0001 soon, unless you'd like to do the honors - I would
break it with some upcoming patches, and it's a good improvement. Those
patches also will PinBuffer_Locked() a bit slower, i.e. it'd be good to avoid
using it in ReadRecentBuffer() for that reason alone.

Oh, thanks for thinking about that interaction. I'll go ahead and
push it later today after I re-convince myself that it's correct.

Sorry I haven't got to this yet. Please feel free to go ahead and
push it if it's blocking you...

#25Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#24)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2025-10-08 13:39:14 +1300, Thomas Munro wrote:

On Fri, Sep 19, 2025 at 11:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Sep 19, 2025 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

I'm planning to commit 0001 soon, unless you'd like to do the honors - I would
break it with some upcoming patches, and it's a good improvement. Those
patches also will PinBuffer_Locked() a bit slower, i.e. it'd be good to avoid
using it in ReadRecentBuffer() for that reason alone.

Oh, thanks for thinking about that interaction. I'll go ahead and
push it later today after I re-convince myself that it's correct.

Sorry I haven't got to this yet. Please feel free to go ahead and
push it if it's blocking you...

Done after two small changes:

- NewPrivateRefCountEntry() has to be deferred until after the added return
false in PinBuffer(), otherwise a failed ReadRecentBuffer() would leave a
refcount entry arround, triggering a CheckBufferLeaks() failure

Found via test_aio. Interestingly I had to do the same change for unrelated
reasons in one of the other patches that I am working on, which hid this
issue for a while...

- Moved the return false in PinBuffer() to before the WaitBufHdrUnlocked(), it
seems a bit silly to wait for the lock and then return.

We now should really again look at your patch to make btree searches cache the
root page buffer, the wins for that were rather large...

Greetings,

Andres Freund

#26Amit Langote
amitlangote09@gmail.com
In reply to: Andres Freund (#25)
Re: ReadRecentBuffer() doesn't scale well

Hi Andres,

On Thu, Oct 9, 2025 at 2:15 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-10-08 13:39:14 +1300, Thomas Munro wrote:

On Fri, Sep 19, 2025 at 11:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Sep 19, 2025 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

I'm planning to commit 0001 soon, unless you'd like to do the honors - I would
break it with some upcoming patches, and it's a good improvement. Those
patches also will PinBuffer_Locked() a bit slower, i.e. it'd be good to avoid
using it in ReadRecentBuffer() for that reason alone.

Oh, thanks for thinking about that interaction. I'll go ahead and
push it later today after I re-convince myself that it's correct.

Sorry I haven't got to this yet. Please feel free to go ahead and
push it if it's blocking you...

Done after two small changes:

Perhaps you are on it, but it seems skink is red due to either this or
one of the other bufmgr changes. I noticed it after I pushed a patch
of mine and couldn't see the connection.

postmaster.log suggests a valgrind error.

--
Thanks, Amit Langote

#27Andres Freund
andres@anarazel.de
In reply to: Amit Langote (#26)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On October 8, 2025 10:05:59 PM EDT, Amit Langote <amitlangote09@gmail.com> wrote:

Hi Andres,

On Thu, Oct 9, 2025 at 2:15 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2025-10-08 13:39:14 +1300, Thomas Munro wrote:

On Fri, Sep 19, 2025 at 11:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Sep 19, 2025 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

I'm planning to commit 0001 soon, unless you'd like to do the honors - I would
break it with some upcoming patches, and it's a good improvement. Those
patches also will PinBuffer_Locked() a bit slower, i.e. it'd be good to avoid
using it in ReadRecentBuffer() for that reason alone.

Oh, thanks for thinking about that interaction. I'll go ahead and
push it later today after I re-convince myself that it's correct.

Sorry I haven't got to this yet. Please feel free to go ahead and
push it if it's blocking you...

Done after two small changes:

Perhaps you are on it, but it seems skink is red due to either this or
one of the other bufmgr changes. I noticed it after I pushed a patch
of mine and couldn't see the connection.

postmaster.log suggests a valgrind error.

Yes, I saw it. I'm too tired to figure it out tonight, will look more tomorrow. I think it'll just need to mark the buffer data accessible in one more spot.

Greetings,

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#28Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#27)
Re: ReadRecentBuffer() doesn't scale well

Hi,

On 2025-10-08 22:09:54 -0400, Andres Freund wrote:

On October 8, 2025 10:05:59 PM EDT, Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, Oct 9, 2025 at 2:15 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-10-08 13:39:14 +1300, Thomas Munro wrote:

On Fri, Sep 19, 2025 at 11:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Sep 19, 2025 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

I'm planning to commit 0001 soon, unless you'd like to do the honors - I would
break it with some upcoming patches, and it's a good improvement. Those
patches also will PinBuffer_Locked() a bit slower, i.e. it'd be good to avoid
using it in ReadRecentBuffer() for that reason alone.

Oh, thanks for thinking about that interaction. I'll go ahead and
push it later today after I re-convince myself that it's correct.

Sorry I haven't got to this yet. Please feel free to go ahead and
push it if it's blocking you...

Done after two small changes:

Perhaps you are on it, but it seems skink is red due to either this or
one of the other bufmgr changes. I noticed it after I pushed a patch
of mine and couldn't see the connection.

postmaster.log suggests a valgrind error.

Yes, I saw it. I'm too tired to figure it out tonight, will look more
tomorrow. I think it'll just need to mark the buffer data accessible in one
more spot.

To update this thread as well:
This is fixes as of
https://git.postgresql.org/pg/commitdiff/c819d1017ddb349d92ab323d2631bc1f10bb4e86
the change from this thread wasn't to blame, it was 5e899859287.

Greetings,

Andres Freund

#29Amit Langote
amitlangote09@gmail.com
In reply to: Andres Freund (#28)
Re: ReadRecentBuffer() doesn't scale well

On Sat, Oct 11, 2025 at 1:08 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-10-08 22:09:54 -0400, Andres Freund wrote:

On October 8, 2025 10:05:59 PM EDT, Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, Oct 9, 2025 at 2:15 AM Andres Freund <andres@anarazel.de> wrote:

On 2025-10-08 13:39:14 +1300, Thomas Munro wrote:

On Fri, Sep 19, 2025 at 11:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Sep 19, 2025 at 12:36 AM Andres Freund <andres@anarazel.de> wrote:

I'm planning to commit 0001 soon, unless you'd like to do the honors - I would
break it with some upcoming patches, and it's a good improvement. Those
patches also will PinBuffer_Locked() a bit slower, i.e. it'd be good to avoid
using it in ReadRecentBuffer() for that reason alone.

Oh, thanks for thinking about that interaction. I'll go ahead and
push it later today after I re-convince myself that it's correct.

Sorry I haven't got to this yet. Please feel free to go ahead and
push it if it's blocking you...

Done after two small changes:

Perhaps you are on it, but it seems skink is red due to either this or
one of the other bufmgr changes. I noticed it after I pushed a patch
of mine and couldn't see the connection.

postmaster.log suggests a valgrind error.

Yes, I saw it. I'm too tired to figure it out tonight, will look more
tomorrow. I think it'll just need to mark the buffer data accessible in one
more spot.

To update this thread as well:
This is fixes as of
https://git.postgresql.org/pg/commitdiff/c819d1017ddb349d92ab323d2631bc1f10bb4e86
the change from this thread wasn't to blame, it was 5e899859287.

Thanks a lot for taking care of it.

--
Thanks, Amit Langote