Checkpointer write combining

Started by Melanie Plageman4 months ago24 messages

melanieplageman@gmail.com

4 months ago

7 attachment(s)

Hi,

The attached patchset implements checkpointer write combining -- which
makes immediate checkpoints at least 20% faster in my tests.
Checkpointer achieves higher write throughput and higher write IOPs
with the patch.

Besides the immediate performance gain with the patchset, we will
eventually need all writers to do write combining if we want to use
direct IO. Additionally, I think the general shape I refactored
BufferSync() into will be useful for AIO-ifying checkpointer.

The patch set has preliminary patches (0001-0004) that implement eager
flushing and write combining for bulkwrites (like COPY FROM). The
functions used to flush a batch of writes for bulkwrites (see 0004)
are reused for the checkpointer. The eager flushing component of this
patch set has been discussed elsewhere [1]/messages/by-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com.

0005 implements a fix for XLogNeedsFlush() when called by checkpointer
during an end-of-crash-recovery checkpoint. I've already started
another thread about this [2]/messages/by-id/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g@mail.gmail.com, but the patch is required for the patch
set to pass tests.

One outstanding action item is to test to see if there are any
benefits to spread checkpoints.

More on how I measured the performance benefit to immediate checkpoints:

I tuned checkpoint_completion_target, checkpoint_timeout, and min and
max_wal_size to ensure no other checkpoints were initiated.

With 16 GB shared buffers and io_combine_limit 128, I created a 15 GB
table. To get consistent results, I used pg_prewarm to read the table
into shared buffers, issued a checkpoint, then used Bilal's patch [3]/messages/by-id/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw@mail.gmail.com
to mark all the buffers as dirty again and issue another checkpoint.
On a fast local SSD, this proved to be a consistent 20%+ speed up
(~6.5 seconds to ~5 seconds).

- Melanie

[1]: /messages/by-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com
[2]: /messages/by-id/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g@mail.gmail.com
[3]: /messages/by-id/CAN55FZ0h_YoSqqutxV6DES1RW8ig6wcA8CR9rJk358YRMxZFmw@mail.gmail.com

Attachments:

v1-0005-Fix-XLogNeedsFlush-for-checkpointer.patchtext/x-patch; charset=US-ASCII; name=v1-0005-Fix-XLogNeedsFlush-for-checkpointer.patchDownload

From 3b57dbff6412f3864633eecd0d153d862e1737af Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 10:01:17 -0400
Subject: [PATCH v1 5/9] Fix XLogNeedsFlush() for checkpointer

XLogNeedsFlush() takes an LSN and compares it to either the flush pointer or the
min recovery point, depending on whether it is in normal operation or recovery.

Even though it is technically recovery, the checkpointer must flush WAL during
an end-of-recovery checkpoint, so in this case, it should compare the provided
LSN to the flush pointer and not the min recovery point.

If it compares the LSN to the min recovery point when the control file's min
recovery point has been updated to an incorrect value, XLogNeedsFlush() can
return an incorrect result of true -- even after just having flushed WAL.

Change this to only compare the LSN to min recovery point -- and, potentially
update the local copy of min recovery point, when xlog inserts are allowed --
which is true for the checkpointer during an end-of-recovery checkpoint, but
false during crash recovery otherwise.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..16ef6d2cd64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3115,7 +3115,7 @@ XLogNeedsFlush(XLogRecPtr record)
 	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
 	 * would need to be updated.
 	 */
-	if (RecoveryInProgress())
+	if (RecoveryInProgress() && !XLogInsertAllowed())
 	{
 		/*
 		 * An invalid minRecoveryPoint means that we need to recover all the
-- 
2.43.0

v1-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From 8d874b737771dbb9b2cb6968d79376a1b1276491 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v1 1/9] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with a
regular for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.
---
 src/backend/storage/buffer/bufmgr.c   | 200 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  17 +++
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 116 insertions(+), 106 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 350cc0402aa..c0f0e052135 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2356,130 +2352,122 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned with its header
-	 * spinlock still held!
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
-
-	/* Pin the buffer and then release the buffer spinlock */
-	PinBuffer_Locked(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
+		/*
+		 * Attempt to claim a victim buffer.  The buffer is returned with its
+		 * header spinlock still held!
+		 */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+		/* Pin the buffer and then release the buffer spinlock */
+		PinBuffer_Locked(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr, buf_state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
+
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
 
-	if (buf_state & BM_VALID)
-	{
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer. Then loop around and try again.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 01909be0272..f695ce43224 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -833,12 +834,21 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must pinned and content locked and the buffer header spinlock
+ * must not be held. We must have the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -848,6 +858,13 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
+	buf_state = LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf, buf_state);
+
+	if (!XLogNeedsFlush(lsn))
+		return true;
+
 	/*
 	 * Remove the dirty buffer from the ring; necessary to prevent infinite
 	 * loop if all ring members are dirty.
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 52a71b138f7..ed65ed84034 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -428,6 +428,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v1-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v1-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From 8cd1a72128e25a9fccc9ed4551498f13e650fc97 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:56:38 -0400
Subject: [PATCH v1 4/9] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.
---
 src/backend/storage/buffer/bufmgr.c   | 198 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  26 ++++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 +++++
 src/include/storage/bufpage.h         |   1 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 269 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a38f1247135..80122abd9aa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -539,6 +539,8 @@ static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber re
 												   RelFileLocator *rlocator,
 												   bool skip_pinned,
 												   XLogRecPtr *max_lsn);
+static void FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+							   uint32 max_batch_size, BufWriteBatch *batch);
 static void CleanVictimBuffer(BufferAccessStrategy strategy,
 							  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -4281,10 +4283,73 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+
+/*
+ * Given a buffer descriptor, start, from a strategy ring, strategy, that
+ * supports eager flushing, find additional buffers from the ring that can be
+ * combined into a single write batch with this buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to write this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+				   uint32 max_batch_size, BufWriteBatch *batch)
+{
+	BlockNumber limit;
+	uint32		buf_state;
+
+	Assert(start);
+	batch->bufdescs[0] = start;
+
+	buf_state = LockBufHdr(start);
+	batch->max_lsn = BufferGetLSN(start);
+	UnlockBufHdr(start, buf_state);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	Assert(BlockNumberIsValid(batch->start));
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Max(limit, 1);
+	limit = Min(max_batch_size, limit);
+
+	/* Now assemble a run of blocks to write out. */
+	for (batch->n = 1; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or or NULL when there are no further buffers to
- * consider writing out.
+ * consider writing out. This will be the start of a new batch of buffers to
+ * write out.
  */
 static BufferDesc *
 next_strat_buf_to_flush(BufferAccessStrategy strategy,
@@ -4316,7 +4381,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 	IOContext	io_context = IOContextForStrategy(strategy);
 
 	Assert(*buf_state & BM_DIRTY);
@@ -4327,19 +4391,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	if (from_ring && strategy_supports_eager_flush(strategy))
 	{
+		uint32		max_batch_size = max_write_batch_size_for_strategy(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
+
 		/* Clean victim buffer and find more to flush opportunistically */
 		StartStrategySweep(strategy);
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, bufdesc, max_batch_size, &batch);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, &BackendWritebackContext, io_context);
 		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
 	}
 	else
@@ -4461,6 +4528,73 @@ except_unlock_header:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (!XLogRecPtrIsInvalid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		BufferDesc *bufdesc = batch->bufdescs[i];
+		uint32		buf_state = LockBufHdr(bufdesc);
+		XLogRecPtr	lsn = BufferGetLSN(bufdesc);
+
+		UnlockBufHdr(bufdesc, buf_state);
+		Assert(!(buf_state & BM_PERMANENT) || !XLogNeedsFlush(lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with budesc for writing. buf_state and lsn are output
  * parameters. Returns true if the buffer acutally needs writing and false
@@ -4606,6 +4740,48 @@ DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch,
+					 WritebackContext *wb_context, IOContext io_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index af75c02723d..4ce70de11c9 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -843,6 +843,32 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
+{
+	uint32		max_possible_buffer_limit;
+	uint32		max_write_batch_size;
+	int			strategy_pin_limit;
+
+	max_write_batch_size = io_combine_limit;
+
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_possible_buffer_limit = GetPinLimit();
+
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+	max_write_batch_size = Max(1, max_write_batch_size);
+	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dbb49ed9197..fc749dd5a50 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums.
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 49914f8b46f..586e52cd01b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -425,6 +425,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -438,6 +466,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 /* solely to make it easier to write tests */
 extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
@@ -447,8 +476,11 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern uint32 max_write_batch_size_for_strategy(BufferAccessStrategy strategy);
 extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
 extern void StartStrategySweep(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, WritebackContext *wb_context,
+								 IOContext io_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..1020cb3ac78 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,5 +507,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..9492adeee58 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -349,6 +349,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v1-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v1-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From 62b718b0d3adbb95151ebbe8ef6d621f103458e9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:43:24 -0400
Subject: [PATCH v1 3/9] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse
them. By eagerly flushing the buffers in a larger batch, we encourage
larger writes at the kernel level and less interleaving of WAL flushes
and data file writes. The effect is mainly noticeable with multiple
parallel COPY FROMs. In this case, client backends achieve higher write
throughput and end up spending less time waiting on acquiring the lock
to flush WAL. Larger flush operations also mean less time waiting for
flush operations at the kernel level as well.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which flushing does not require flushing WAL.

This patch also is a stepping stone toward AIO writes.

Earlier version
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 166 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  62 ++++++++++
 src/include/storage/buf_internals.h   |   3 +
 3 files changed, 228 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a0077a3f662..a38f1247135 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -534,6 +534,11 @@ static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object
 						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static BufferDesc *next_strat_buf_to_flush(BufferAccessStrategy strategy, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator,
+												   bool skip_pinned,
+												   XLogRecPtr *max_lsn);
 static void CleanVictimBuffer(BufferAccessStrategy strategy,
 							  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -4276,6 +4281,31 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or or NULL when there are no further buffers to
+ * consider writing out.
+ */
+static BufferDesc *
+next_strat_buf_to_flush(BufferAccessStrategy strategy,
+						XLogRecPtr *lsn)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer)
+	{
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare to write and write a dirty victim buffer.
  */
@@ -4286,6 +4316,7 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 	IOContext	io_context = IOContextForStrategy(strategy);
 
 	Assert(*buf_state & BM_DIRTY);
@@ -4294,11 +4325,140 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && strategy_supports_eager_flush(strategy))
+	{
+		/* Clean victim buffer and find more to flush opportunistically */
+		StartStrategySweep(strategy);
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, returns the block -- the pointer to the block data in memory
+ * -- which we will opportunistically flush or NULL if this buffer does not
+ *  contain a block that should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/* Must do this before taking the buffer header spinlock. */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	buf_state = LockBufHdr(bufdesc);
+
+	if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+		goto except_unlock_header;
+
+	/* We don't include used buffers in batches */
+	if (skip_pinned &&
+		(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+		 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+		goto except_unlock_header;
+
+	/* Get page LSN while holding header lock */
+	lsn = BufferGetLSN(bufdesc);
+
+	PinBuffer_Locked(bufdesc);
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* If we'll have to flush WAL to flush the block, we're done */
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unpin_buffer;
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	buf_state = LockBufHdr(bufdesc);
+	lsn = BufferGetLSN(bufdesc);
+	UnlockBufHdr(bufdesc, buf_state);
+
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unlock_content;
+
+	/* Try to start an I/O operation. */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+	buf_state = LockBufHdr(bufdesc);
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
+
+except_unlock_header:
+	UnlockBufHdr(bufdesc, buf_state);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index f695ce43224..af75c02723d 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -83,6 +83,15 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
+	/*
+	 * If the strategy supports eager flushing, we may initiate a sweep of the
+	 * strategy ring, flushing all the dirty buffers we can cheaply flush.
+	 * sweep_start and sweep_current keep track of a given sweep so we don't
+	 * loop around the ring infinitely.
+	 */
+	int			sweep_start;
+	int			sweep_current;
+
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -181,6 +190,31 @@ have_free_buffer(void)
 		return false;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lean to better I/O
+ * patterns than lazily flushing buffers directly before reusing them.
+ */
+bool
+strategy_supports_eager_flush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -357,6 +391,34 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy)
+{
+	strategy->sweep_current++;
+	if (strategy->sweep_current >= strategy->nbuffers)
+		strategy->sweep_current = 0;
+
+	if (strategy->sweep_current == strategy->sweep_start)
+		return InvalidBuffer;
+
+	return strategy->buffers[strategy->sweep_current];
+}
+
+/*
+ * Start a sweep of the strategy ring.
+ */
+void
+StartStrategySweep(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return;
+	strategy->sweep_start = strategy->sweep_current = strategy->current;
+}
+
 /*
  * StrategyFreeBuffer: put a buffer on the freelist
  */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed65ed84034..49914f8b46f 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -446,6 +446,9 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
+extern void StartStrategySweep(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v1-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Split-FlushBuffer-into-two-parts.patchDownload

From 66804599c04512cf572921cea1af0e4b42a2e6c2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:32:24 -0400
Subject: [PATCH v1 2/9] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This provides better symmetry with the
batch flushing code.
---
 src/backend/storage/buffer/bufmgr.c | 103 ++++++++++++++++++++--------
 1 file changed, 76 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c0f0e052135..a0077a3f662 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -529,8 +529,13 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2426,12 +2431,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			CleanVictimBuffer(strategy, buf_hdr, &buf_state, from_ring);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4269,20 +4269,81 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (PrepareFlushBuffer(buf, &buf_state, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
+
+/*
+ * Prepare to write and write a dirty victim buffer.
+ */
+static void
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring)
+{
+
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
+	IOContext	io_context = IOContextForStrategy(strategy);
+
+	Assert(*buf_state & BM_DIRTY);
+
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
+		return;
 
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
+
+/*
+ * Prepare the buffer with budesc for writing. buf_state and lsn are output
+ * parameters. Returns true if the buffer acutally needs writing and false
+ * otherwise.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn)
+{
 	/*
 	 * Try to start an I/O operation.  If StartBufferIO returns false, then
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
+
+	*lsn = InvalidXLogRecPtr;
+	*buf_state = LockBufHdr(bufdesc);
+
+	/*
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
+	 */
+	if (*buf_state & BM_PERMANENT)
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	*buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, *buf_state);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
 
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = shared_buffer_write_error_callback;
@@ -4300,18 +4361,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 										reln->smgr_rlocator.locator.dbOid,
 										reln->smgr_rlocator.locator.relNumber);
 
-	buf_state = LockBufHdr(buf);
-
-	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
-	 */
-	recptr = BufferGetLSN(buf);
-
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
-
 	/*
 	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
 	 * rule that log updates must hit disk before any of the data-file changes
@@ -4329,8 +4378,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
 	 */
-	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v1-0006-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v1-0006-Add-database-Oid-to-CkptSortItem.patchDownload

From eda89d4b1491922315222773c739b5b04f44fa4a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v1 6/9] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 80122abd9aa..ab0b9246759 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3393,6 +3393,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->db_id = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6712,6 +6713,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->db_id < b->db_id)
+		return -1;
+	else if (a->db_id > b->db_id)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 586e52cd01b..3383a674c0c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -391,6 +391,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			db_id;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

v1-0007-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v1-0007-Implement-checkpointer-data-write-combining.patchDownload

From cdb40b2f12663bd687bae416962fdb95ff9252cc Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:42:29 -0400
Subject: [PATCH v1 7/9] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.
---
 src/backend/storage/buffer/bufmgr.c | 232 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 207 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ab0b9246759..a1d347b5966 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -512,6 +512,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 checkpointer_max_batch_size(void);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
@@ -3335,7 +3336,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3347,6 +3347,8 @@ BufferSync(int flags)
 	int			i;
 	int			mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3377,6 +3379,7 @@ BufferSync(int flags)
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3517,48 +3520,208 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = checkpointer_max_batch_size();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
-
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Each batch will have exactly one start and one max lsn and one
+		 * length.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
+
+		while (batch.n < limit)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
+
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
+
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
+			{
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.db_id;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Max(1, limit);
+				limit = Min(limit, max_batch_size);
+			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * becasue we want to start the next IO with this item.
+			 */
+			if (item.db_id != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			/*
+			 * It the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false. If
+			 * the buffer doesn't need checkpointing, don't include it in the
+			 * batch we are building. We're done with the item, so count it as
+			 * processed and break out of the loop to issue the IO we have
+			 * built so far.
+			 */
+			if (!(pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+
+			buf_state = LockBufHdr(bufHdr);
+
+			/*
+			 * If the buffer doesn't need eviction, we're done with the item,
+			 * so count it as processed and break out of the loop to issue the
+			 * IO so far.
+			 */
+			if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				processed++;
+				UnlockBufHdr(bufHdr, buf_state);
+				break;
+			}
+
+			PinBuffer_Locked(bufHdr);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
 			}
+
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			buf_state &= ~BM_JUST_DIRTIED;
+			UnlockBufHdr(bufHdr, buf_state);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
 		 * - otherwise writing become unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, &wb_context, IOCONTEXT_NORMAL);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -4284,6 +4447,23 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+checkpointer_max_batch_size(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Max(pin_limit, 1);
+	result = Min(pin_limit, io_combine_limit);
+	result = Max(result, 1);
+	Assert(result < MAX_IO_COMBINE_LIMIT);
+	return result;
+}
+
 
 /*
  * Given a buffer descriptor, start, from a strategy ring, strategy, that
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

Melanie Plageman

melanieplageman@gmail.com

4 months ago

In reply to: Melanie Plageman (#1)

7 attachment(s)

Re: Checkpointer write combining

On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

The attached patchset implements checkpointer write combining -- which
makes immediate checkpoints at least 20% faster in my tests.
Checkpointer achieves higher write throughput and higher write IOPs
with the patch.

These needed a rebase. Attached v2.

- Melanie

Attachments:

v2-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Split-FlushBuffer-into-two-parts.patchDownload

From 7c8e7111f321f3e5f4dd32e865f3612162754981 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:32:24 -0400
Subject: [PATCH v2 2/9] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This provides better symmetry with the
batch flushing code.
---
 src/backend/storage/buffer/bufmgr.c | 103 ++++++++++++++++++++--------
 1 file changed, 76 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f3668051574..84ff5e0f1bf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -529,8 +529,13 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2414,12 +2419,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			CleanVictimBuffer(strategy, buf_hdr, &buf_state, from_ring);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4246,20 +4246,81 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (PrepareFlushBuffer(buf, &buf_state, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
+
+/*
+ * Prepare to write and write a dirty victim buffer.
+ */
+static void
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring)
+{
+
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
+	IOContext	io_context = IOContextForStrategy(strategy);
+
+	Assert(*buf_state & BM_DIRTY);
+
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
+		return;
 
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
+
+/*
+ * Prepare the buffer with budesc for writing. buf_state and lsn are output
+ * parameters. Returns true if the buffer acutally needs writing and false
+ * otherwise.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn)
+{
 	/*
 	 * Try to start an I/O operation.  If StartBufferIO returns false, then
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
+
+	*lsn = InvalidXLogRecPtr;
+	*buf_state = LockBufHdr(bufdesc);
+
+	/*
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
+	 */
+	if (*buf_state & BM_PERMANENT)
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	*buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, *buf_state);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
 
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = shared_buffer_write_error_callback;
@@ -4277,18 +4338,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 										reln->smgr_rlocator.locator.dbOid,
 										reln->smgr_rlocator.locator.relNumber);
 
-	buf_state = LockBufHdr(buf);
-
-	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
-	 */
-	recptr = BufferGetLSN(buf);
-
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
-
 	/*
 	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
 	 * rule that log updates must hit disk before any of the data-file changes
@@ -4306,8 +4355,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
 	 */
-	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v2-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From 640c733261dec34686095ffb3ec64b717aacf830 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:43:24 -0400
Subject: [PATCH v2 3/9] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse
them. By eagerly flushing the buffers in a larger batch, we encourage
larger writes at the kernel level and less interleaving of WAL flushes
and data file writes. The effect is mainly noticeable with multiple
parallel COPY FROMs. In this case, client backends achieve higher write
throughput and end up spending less time waiting on acquiring the lock
to flush WAL. Larger flush operations also mean less time waiting for
flush operations at the kernel level as well.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which flushing does not require flushing WAL.

This patch also is a stepping stone toward AIO writes.

Earlier version
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 166 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  63 ++++++++++
 src/include/storage/buf_internals.h   |   3 +
 3 files changed, 229 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 84ff5e0f1bf..90f36a04c19 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -534,6 +534,11 @@ static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object
 						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static BufferDesc *next_strat_buf_to_flush(BufferAccessStrategy strategy, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator,
+												   bool skip_pinned,
+												   XLogRecPtr *max_lsn);
 static void CleanVictimBuffer(BufferAccessStrategy strategy,
 							  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -4253,6 +4258,31 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or or NULL when there are no further buffers to
+ * consider writing out.
+ */
+static BufferDesc *
+next_strat_buf_to_flush(BufferAccessStrategy strategy,
+						XLogRecPtr *lsn)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer)
+	{
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare to write and write a dirty victim buffer.
  */
@@ -4263,6 +4293,7 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 	IOContext	io_context = IOContextForStrategy(strategy);
 
 	Assert(*buf_state & BM_DIRTY);
@@ -4271,11 +4302,140 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && strategy_supports_eager_flush(strategy))
+	{
+		/* Clean victim buffer and find more to flush opportunistically */
+		StartStrategySweep(strategy);
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, returns the block -- the pointer to the block data in memory
+ * -- which we will opportunistically flush or NULL if this buffer does not
+ *  contain a block that should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/* Must do this before taking the buffer header spinlock. */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	buf_state = LockBufHdr(bufdesc);
+
+	if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+		goto except_unlock_header;
+
+	/* We don't include used buffers in batches */
+	if (skip_pinned &&
+		(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+		 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+		goto except_unlock_header;
+
+	/* Get page LSN while holding header lock */
+	lsn = BufferGetLSN(bufdesc);
+
+	PinBuffer_Locked(bufdesc);
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* If we'll have to flush WAL to flush the block, we're done */
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unpin_buffer;
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	buf_state = LockBufHdr(bufdesc);
+	lsn = BufferGetLSN(bufdesc);
+	UnlockBufHdr(bufdesc, buf_state);
+
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unlock_content;
+
+	/* Try to start an I/O operation. */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+	buf_state = LockBufHdr(bufdesc);
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
+
+except_unlock_header:
+	UnlockBufHdr(bufdesc, buf_state);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index ec6e0f86816..dd1d48a88fb 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -75,6 +75,15 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
+	/*
+	 * If the strategy supports eager flushing, we may initiate a sweep of the
+	 * strategy ring, flushing all the dirty buffers we can cheaply flush.
+	 * sweep_start and sweep_current keep track of a given sweep so we don't
+	 * loop around the ring infinitely.
+	 */
+	int			sweep_start;
+	int			sweep_current;
+
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -156,6 +165,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lean to better I/O
+ * patterns than lazily flushing buffers directly before reusing them.
+ */
+bool
+strategy_supports_eager_flush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -270,6 +304,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy)
+{
+	strategy->sweep_current++;
+	if (strategy->sweep_current >= strategy->nbuffers)
+		strategy->sweep_current = 0;
+
+	if (strategy->sweep_current == strategy->sweep_start)
+		return InvalidBuffer;
+
+	return strategy->buffers[strategy->sweep_current];
+}
+
+/*
+ * Start a sweep of the strategy ring.
+ */
+void
+StartStrategySweep(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return;
+	strategy->sweep_start = strategy->sweep_current = strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b1b81f31419..7963d1189a6 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -437,6 +437,9 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
+extern void StartStrategySweep(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v2-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v2-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From 11a2b7206d483b04d3be1f111469d0384ae75264 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:56:38 -0400
Subject: [PATCH v2 4/9] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.
---
 src/backend/storage/buffer/bufmgr.c   | 198 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  26 ++++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 +++++
 src/include/storage/bufpage.h         |   1 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 269 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90f36a04c19..ade83adca59 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -539,6 +539,8 @@ static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber re
 												   RelFileLocator *rlocator,
 												   bool skip_pinned,
 												   XLogRecPtr *max_lsn);
+static void FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+							   uint32 max_batch_size, BufWriteBatch *batch);
 static void CleanVictimBuffer(BufferAccessStrategy strategy,
 							  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -4258,10 +4260,73 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+
+/*
+ * Given a buffer descriptor, start, from a strategy ring, strategy, that
+ * supports eager flushing, find additional buffers from the ring that can be
+ * combined into a single write batch with this buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to write this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+				   uint32 max_batch_size, BufWriteBatch *batch)
+{
+	BlockNumber limit;
+	uint32		buf_state;
+
+	Assert(start);
+	batch->bufdescs[0] = start;
+
+	buf_state = LockBufHdr(start);
+	batch->max_lsn = BufferGetLSN(start);
+	UnlockBufHdr(start, buf_state);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	Assert(BlockNumberIsValid(batch->start));
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Max(limit, 1);
+	limit = Min(max_batch_size, limit);
+
+	/* Now assemble a run of blocks to write out. */
+	for (batch->n = 1; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or or NULL when there are no further buffers to
- * consider writing out.
+ * consider writing out. This will be the start of a new batch of buffers to
+ * write out.
  */
 static BufferDesc *
 next_strat_buf_to_flush(BufferAccessStrategy strategy,
@@ -4293,7 +4358,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 	IOContext	io_context = IOContextForStrategy(strategy);
 
 	Assert(*buf_state & BM_DIRTY);
@@ -4304,19 +4368,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	if (from_ring && strategy_supports_eager_flush(strategy))
 	{
+		uint32		max_batch_size = max_write_batch_size_for_strategy(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
+
 		/* Clean victim buffer and find more to flush opportunistically */
 		StartStrategySweep(strategy);
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, bufdesc, max_batch_size, &batch);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, &BackendWritebackContext, io_context);
 		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
 	}
 	else
@@ -4438,6 +4505,73 @@ except_unlock_header:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (!XLogRecPtrIsInvalid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		BufferDesc *bufdesc = batch->bufdescs[i];
+		uint32		buf_state = LockBufHdr(bufdesc);
+		XLogRecPtr	lsn = BufferGetLSN(bufdesc);
+
+		UnlockBufHdr(bufdesc, buf_state);
+		Assert(!(buf_state & BM_PERMANENT) || !XLogNeedsFlush(lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with budesc for writing. buf_state and lsn are output
  * parameters. Returns true if the buffer acutally needs writing and false
@@ -4583,6 +4717,48 @@ DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch,
+					 WritebackContext *wb_context, IOContext io_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index dd1d48a88fb..c123e3913ca 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -727,6 +727,32 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
+{
+	uint32		max_possible_buffer_limit;
+	uint32		max_write_batch_size;
+	int			strategy_pin_limit;
+
+	max_write_batch_size = io_combine_limit;
+
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_possible_buffer_limit = GetPinLimit();
+
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+	max_write_batch_size = Max(1, max_write_batch_size);
+	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dbb49ed9197..fc749dd5a50 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums.
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7963d1189a6..d1f0ecb7ca4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -416,6 +416,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -429,6 +457,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 /* solely to make it easier to write tests */
 extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
@@ -438,8 +467,11 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern uint32 max_write_batch_size_for_strategy(BufferAccessStrategy strategy);
 extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
 extern void StartStrategySweep(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, WritebackContext *wb_context,
+								 IOContext io_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..1020cb3ac78 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,5 +507,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..9492adeee58 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -349,6 +349,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v2-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From 053dd9d15416d76ce4b95044d848f51ba13a2d20 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v2 1/9] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with a
regular for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.
---
 src/backend/storage/buffer/bufmgr.c   | 200 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  17 +++
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 116 insertions(+), 106 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..f3668051574 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2344,130 +2340,122 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned with its header
-	 * spinlock still held!
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
-
-	/* Pin the buffer and then release the buffer spinlock */
-	PinBuffer_Locked(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
+		/*
+		 * Attempt to claim a victim buffer.  The buffer is returned with its
+		 * header spinlock still held!
+		 */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+		/* Pin the buffer and then release the buffer spinlock */
+		PinBuffer_Locked(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr, buf_state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
+
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
 
-	if (buf_state & BM_VALID)
-	{
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer. Then loop around and try again.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..ec6e0f86816 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -716,12 +717,21 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must pinned and content locked and the buffer header spinlock
+ * must not be held. We must have the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
+	buf_state = LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf, buf_state);
+
+	if (!XLogNeedsFlush(lsn))
+		return true;
+
 	/*
 	 * Remove the dirty buffer from the ring; necessary to prevent infinite
 	 * loop if all ring members are dirty.
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index dfd614f7ca4..b1b81f31419 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -419,6 +419,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v2-0005-Fix-XLogNeedsFlush-for-checkpointer.patchtext/x-patch; charset=US-ASCII; name=v2-0005-Fix-XLogNeedsFlush-for-checkpointer.patchDownload

From f11b7be35c56e95efb798242ac0029e6b35d34fb Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 10:01:17 -0400
Subject: [PATCH v2 5/9] Fix XLogNeedsFlush() for checkpointer

XLogNeedsFlush() takes an LSN and compares it to either the flush pointer or the
min recovery point, depending on whether it is in normal operation or recovery.

Even though it is technically recovery, the checkpointer must flush WAL during
an end-of-recovery checkpoint, so in this case, it should compare the provided
LSN to the flush pointer and not the min recovery point.

If it compares the LSN to the min recovery point when the control file's min
recovery point has been updated to an incorrect value, XLogNeedsFlush() can
return an incorrect result of true -- even after just having flushed WAL.

Change this to only compare the LSN to min recovery point -- and, potentially
update the local copy of min recovery point, when xlog inserts are allowed --
which is true for the checkpointer during an end-of-recovery checkpoint, but
false during crash recovery otherwise.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..16ef6d2cd64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3115,7 +3115,7 @@ XLogNeedsFlush(XLogRecPtr record)
 	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
 	 * would need to be updated.
 	 */
-	if (RecoveryInProgress())
+	if (RecoveryInProgress() && !XLogInsertAllowed())
 	{
 		/*
 		 * An invalid minRecoveryPoint means that we need to recover all the
-- 
2.43.0

v2-0006-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v2-0006-Add-database-Oid-to-CkptSortItem.patchDownload

From 125c0ed4f690ad02ef9421bca45165d853b1de88 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v2 6/9] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ade83adca59..5ab40a09960 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3370,6 +3370,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->db_id = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6689,6 +6690,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->db_id < b->db_id)
+		return -1;
+	else if (a->db_id > b->db_id)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index d1f0ecb7ca4..291cc31da06 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			db_id;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

v2-0007-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v2-0007-Implement-checkpointer-data-write-combining.patchDownload

From bb5b345c997fea0bc5838e78af91b85a603f279b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:42:29 -0400
Subject: [PATCH v2 7/9] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.
---
 src/backend/storage/buffer/bufmgr.c | 232 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 207 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5ab40a09960..8de669a39f3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -512,6 +512,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 checkpointer_max_batch_size(void);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
@@ -3312,7 +3313,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3324,6 +3324,8 @@ BufferSync(int flags)
 	int			i;
 	int			mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3354,6 +3356,7 @@ BufferSync(int flags)
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3494,48 +3497,208 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = checkpointer_max_batch_size();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
-
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Each batch will have exactly one start and one max lsn and one
+		 * length.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
+
+		while (batch.n < limit)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
+
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
+
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
+			{
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.db_id;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Max(1, limit);
+				limit = Min(limit, max_batch_size);
+			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * becasue we want to start the next IO with this item.
+			 */
+			if (item.db_id != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			/*
+			 * It the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false. If
+			 * the buffer doesn't need checkpointing, don't include it in the
+			 * batch we are building. We're done with the item, so count it as
+			 * processed and break out of the loop to issue the IO we have
+			 * built so far.
+			 */
+			if (!(pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+
+			buf_state = LockBufHdr(bufHdr);
+
+			/*
+			 * If the buffer doesn't need eviction, we're done with the item,
+			 * so count it as processed and break out of the loop to issue the
+			 * IO so far.
+			 */
+			if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				processed++;
+				UnlockBufHdr(bufHdr, buf_state);
+				break;
+			}
+
+			PinBuffer_Locked(bufHdr);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
 			}
+
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			buf_state &= ~BM_JUST_DIRTIED;
+			UnlockBufHdr(bufHdr, buf_state);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
 		 * - otherwise writing become unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, &wb_context, IOCONTEXT_NORMAL);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -4261,6 +4424,23 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+checkpointer_max_batch_size(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Max(pin_limit, 1);
+	result = Min(pin_limit, io_combine_limit);
+	result = Max(result, 1);
+	Assert(result < MAX_IO_COMBINE_LIMIT);
+	return result;
+}
+
 
 /*
  * Given a buffer descriptor, start, from a strategy ring, strategy, that
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

Nazir Bilal Yavuz

byavuz81@gmail.com

4 months ago

In reply to: Melanie Plageman (#2)

Re: Checkpointer write combining

Hi,

Thank you for working on this!

On Tue, 9 Sept 2025 at 02:44, Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Sep 2, 2025 at 5:10 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

The attached patchset implements checkpointer write combining -- which
makes immediate checkpoints at least 20% faster in my tests.
Checkpointer achieves higher write throughput and higher write IOPs
with the patch.

I did the same benchmark you did and I found it is %50 faster (16
seconds to 8 seconds).

From 053dd9d15416d76ce4b95044d848f51ba13a2d20 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v2 1/9] Refactor goto into for loop in GetVictimBuffer()

@@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy
strategy, BufferDesc *buf, bool from_r
strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
return false;

+    buf_state = LockBufHdr(buf);
+    lsn = BufferGetLSN(buf);
+    UnlockBufHdr(buf, buf_state);
+
+    if (!XLogNeedsFlush(lsn))
+        return true;

I think this should return false.

I am planning to review the other patches later and this is for the
first patch only.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Melanie Plageman

melanieplageman@gmail.com

4 months ago

In reply to: Nazir Bilal Yavuz (#3)

7 attachment(s)

Re: Checkpointer write combining

On Tue, Sep 9, 2025 at 9:27 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Thanks for the review!

From 053dd9d15416d76ce4b95044d848f51ba13a2d20 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v2 1/9] Refactor goto into for loop in GetVictimBuffer()

@@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy
strategy, BufferDesc *buf, bool from_r
strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
return false;
+    buf_state = LockBufHdr(buf);
+    lsn = BufferGetLSN(buf);
+    UnlockBufHdr(buf, buf_state);
+
+    if (!XLogNeedsFlush(lsn))
+        return true;
I think this should return false.

Oops, you're right. v3 attached with that mistake fixed.

- Melanie

Attachments:

v3-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From d954d754656b9bc05da4c2edc0a4bad8b3118091 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:43:24 -0400
Subject: [PATCH v3 3/9] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse
them. By eagerly flushing the buffers in a larger batch, we encourage
larger writes at the kernel level and less interleaving of WAL flushes
and data file writes. The effect is mainly noticeable with multiple
parallel COPY FROMs. In this case, client backends achieve higher write
throughput and end up spending less time waiting on acquiring the lock
to flush WAL. Larger flush operations also mean less time waiting for
flush operations at the kernel level as well.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which flushing does not require flushing WAL.

This patch also is a stepping stone toward AIO writes.

Earlier version
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 166 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  63 ++++++++++
 src/include/storage/buf_internals.h   |   3 +
 3 files changed, 229 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 84ff5e0f1bf..90f36a04c19 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -534,6 +534,11 @@ static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object
 						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static BufferDesc *next_strat_buf_to_flush(BufferAccessStrategy strategy, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator,
+												   bool skip_pinned,
+												   XLogRecPtr *max_lsn);
 static void CleanVictimBuffer(BufferAccessStrategy strategy,
 							  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -4253,6 +4258,31 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or or NULL when there are no further buffers to
+ * consider writing out.
+ */
+static BufferDesc *
+next_strat_buf_to_flush(BufferAccessStrategy strategy,
+						XLogRecPtr *lsn)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer)
+	{
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare to write and write a dirty victim buffer.
  */
@@ -4263,6 +4293,7 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 	IOContext	io_context = IOContextForStrategy(strategy);
 
 	Assert(*buf_state & BM_DIRTY);
@@ -4271,11 +4302,140 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && strategy_supports_eager_flush(strategy))
+	{
+		/* Clean victim buffer and find more to flush opportunistically */
+		StartStrategySweep(strategy);
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, returns the block -- the pointer to the block data in memory
+ * -- which we will opportunistically flush or NULL if this buffer does not
+ *  contain a block that should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/* Must do this before taking the buffer header spinlock. */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	buf_state = LockBufHdr(bufdesc);
+
+	if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+		goto except_unlock_header;
+
+	/* We don't include used buffers in batches */
+	if (skip_pinned &&
+		(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+		 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+		goto except_unlock_header;
+
+	/* Get page LSN while holding header lock */
+	lsn = BufferGetLSN(bufdesc);
+
+	PinBuffer_Locked(bufdesc);
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* If we'll have to flush WAL to flush the block, we're done */
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unpin_buffer;
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	buf_state = LockBufHdr(bufdesc);
+	lsn = BufferGetLSN(bufdesc);
+	UnlockBufHdr(bufdesc, buf_state);
+
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unlock_content;
+
+	/* Try to start an I/O operation. */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+	buf_state = LockBufHdr(bufdesc);
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
+
+except_unlock_header:
+	UnlockBufHdr(bufdesc, buf_state);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index a90a7ed4e16..e26a546bc99 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -75,6 +75,15 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
+	/*
+	 * If the strategy supports eager flushing, we may initiate a sweep of the
+	 * strategy ring, flushing all the dirty buffers we can cheaply flush.
+	 * sweep_start and sweep_current keep track of a given sweep so we don't
+	 * loop around the ring infinitely.
+	 */
+	int			sweep_start;
+	int			sweep_current;
+
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -156,6 +165,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lean to better I/O
+ * patterns than lazily flushing buffers directly before reusing them.
+ */
+bool
+strategy_supports_eager_flush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -270,6 +304,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy)
+{
+	strategy->sweep_current++;
+	if (strategy->sweep_current >= strategy->nbuffers)
+		strategy->sweep_current = 0;
+
+	if (strategy->sweep_current == strategy->sweep_start)
+		return InvalidBuffer;
+
+	return strategy->buffers[strategy->sweep_current];
+}
+
+/*
+ * Start a sweep of the strategy ring.
+ */
+void
+StartStrategySweep(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return;
+	strategy->sweep_start = strategy->sweep_current = strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b1b81f31419..7963d1189a6 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -437,6 +437,9 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
+extern void StartStrategySweep(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v3-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From 8fd1a6a68d80e3473ce34f86b0ebb38c15641bab Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v3 1/9] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with a
regular for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.
---
 src/backend/storage/buffer/bufmgr.c   | 200 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  17 +++
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 116 insertions(+), 106 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..f3668051574 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2344,130 +2340,122 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned with its header
-	 * spinlock still held!
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
-
-	/* Pin the buffer and then release the buffer spinlock */
-	PinBuffer_Locked(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
+		/*
+		 * Attempt to claim a victim buffer.  The buffer is returned with its
+		 * header spinlock still held!
+		 */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+		/* Pin the buffer and then release the buffer spinlock */
+		PinBuffer_Locked(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr, buf_state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
+
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
 
-	if (buf_state & BM_VALID)
-	{
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer. Then loop around and try again.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..a90a7ed4e16 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -716,12 +717,21 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must pinned and content locked and the buffer header spinlock
+ * must not be held. We must have the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -731,6 +741,13 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
+	buf_state = LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf, buf_state);
+
+	if (XLogNeedsFlush(lsn))
+		return true;
+
 	/*
 	 * Remove the dirty buffer from the ring; necessary to prevent infinite
 	 * loop if all ring members are dirty.
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index dfd614f7ca4..b1b81f31419 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -419,6 +419,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v3-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Split-FlushBuffer-into-two-parts.patchDownload

From 971486560f20be7ef484bd8b63ebb762a59812a7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:32:24 -0400
Subject: [PATCH v3 2/9] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This provides better symmetry with the
batch flushing code.
---
 src/backend/storage/buffer/bufmgr.c | 103 ++++++++++++++++++++--------
 1 file changed, 76 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f3668051574..84ff5e0f1bf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -529,8 +529,13 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2414,12 +2419,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			CleanVictimBuffer(strategy, buf_hdr, &buf_state, from_ring);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4246,20 +4246,81 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (PrepareFlushBuffer(buf, &buf_state, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
+
+/*
+ * Prepare to write and write a dirty victim buffer.
+ */
+static void
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring)
+{
+
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
+	IOContext	io_context = IOContextForStrategy(strategy);
+
+	Assert(*buf_state & BM_DIRTY);
+
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
+		return;
 
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
+
+/*
+ * Prepare the buffer with budesc for writing. buf_state and lsn are output
+ * parameters. Returns true if the buffer acutally needs writing and false
+ * otherwise.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn)
+{
 	/*
 	 * Try to start an I/O operation.  If StartBufferIO returns false, then
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
+
+	*lsn = InvalidXLogRecPtr;
+	*buf_state = LockBufHdr(bufdesc);
+
+	/*
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
+	 */
+	if (*buf_state & BM_PERMANENT)
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	*buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, *buf_state);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
 
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = shared_buffer_write_error_callback;
@@ -4277,18 +4338,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 										reln->smgr_rlocator.locator.dbOid,
 										reln->smgr_rlocator.locator.relNumber);
 
-	buf_state = LockBufHdr(buf);
-
-	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
-	 */
-	recptr = BufferGetLSN(buf);
-
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
-
 	/*
 	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
 	 * rule that log updates must hit disk before any of the data-file changes
@@ -4306,8 +4355,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
 	 */
-	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v3-0005-Fix-XLogNeedsFlush-for-checkpointer.patchtext/x-patch; charset=US-ASCII; name=v3-0005-Fix-XLogNeedsFlush-for-checkpointer.patchDownload

From b9a2335879f5bd3066bcf6df73ac5e14f631f390 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 10:01:17 -0400
Subject: [PATCH v3 5/9] Fix XLogNeedsFlush() for checkpointer

XLogNeedsFlush() takes an LSN and compares it to either the flush pointer or the
min recovery point, depending on whether it is in normal operation or recovery.

Even though it is technically recovery, the checkpointer must flush WAL during
an end-of-recovery checkpoint, so in this case, it should compare the provided
LSN to the flush pointer and not the min recovery point.

If it compares the LSN to the min recovery point when the control file's min
recovery point has been updated to an incorrect value, XLogNeedsFlush() can
return an incorrect result of true -- even after just having flushed WAL.

Change this to only compare the LSN to min recovery point -- and, potentially
update the local copy of min recovery point, when xlog inserts are allowed --
which is true for the checkpointer during an end-of-recovery checkpoint, but
false during crash recovery otherwise.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..16ef6d2cd64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3115,7 +3115,7 @@ XLogNeedsFlush(XLogRecPtr record)
 	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
 	 * would need to be updated.
 	 */
-	if (RecoveryInProgress())
+	if (RecoveryInProgress() && !XLogInsertAllowed())
 	{
 		/*
 		 * An invalid minRecoveryPoint means that we need to recover all the
-- 
2.43.0

v3-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v3-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From d484f180d0193c77cf6034751fa4e8c81833a605 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:56:38 -0400
Subject: [PATCH v3 4/9] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.
---
 src/backend/storage/buffer/bufmgr.c   | 198 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  26 ++++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 +++++
 src/include/storage/bufpage.h         |   1 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 269 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90f36a04c19..ade83adca59 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -539,6 +539,8 @@ static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber re
 												   RelFileLocator *rlocator,
 												   bool skip_pinned,
 												   XLogRecPtr *max_lsn);
+static void FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+							   uint32 max_batch_size, BufWriteBatch *batch);
 static void CleanVictimBuffer(BufferAccessStrategy strategy,
 							  BufferDesc *bufdesc, uint32 *buf_state, bool from_ring);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -4258,10 +4260,73 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+
+/*
+ * Given a buffer descriptor, start, from a strategy ring, strategy, that
+ * supports eager flushing, find additional buffers from the ring that can be
+ * combined into a single write batch with this buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to write this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+				   uint32 max_batch_size, BufWriteBatch *batch)
+{
+	BlockNumber limit;
+	uint32		buf_state;
+
+	Assert(start);
+	batch->bufdescs[0] = start;
+
+	buf_state = LockBufHdr(start);
+	batch->max_lsn = BufferGetLSN(start);
+	UnlockBufHdr(start, buf_state);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	Assert(BlockNumberIsValid(batch->start));
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Max(limit, 1);
+	limit = Min(max_batch_size, limit);
+
+	/* Now assemble a run of blocks to write out. */
+	for (batch->n = 1; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or or NULL when there are no further buffers to
- * consider writing out.
+ * consider writing out. This will be the start of a new batch of buffers to
+ * write out.
  */
 static BufferDesc *
 next_strat_buf_to_flush(BufferAccessStrategy strategy,
@@ -4293,7 +4358,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 	IOContext	io_context = IOContextForStrategy(strategy);
 
 	Assert(*buf_state & BM_DIRTY);
@@ -4304,19 +4368,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	if (from_ring && strategy_supports_eager_flush(strategy))
 	{
+		uint32		max_batch_size = max_write_batch_size_for_strategy(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
+
 		/* Clean victim buffer and find more to flush opportunistically */
 		StartStrategySweep(strategy);
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, bufdesc, max_batch_size, &batch);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, &BackendWritebackContext, io_context);
 		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
 	}
 	else
@@ -4438,6 +4505,73 @@ except_unlock_header:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (!XLogRecPtrIsInvalid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		BufferDesc *bufdesc = batch->bufdescs[i];
+		uint32		buf_state = LockBufHdr(bufdesc);
+		XLogRecPtr	lsn = BufferGetLSN(bufdesc);
+
+		UnlockBufHdr(bufdesc, buf_state);
+		Assert(!(buf_state & BM_PERMANENT) || !XLogNeedsFlush(lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with budesc for writing. buf_state and lsn are output
  * parameters. Returns true if the buffer acutally needs writing and false
@@ -4583,6 +4717,48 @@ DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch,
+					 WritebackContext *wb_context, IOContext io_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index e26a546bc99..1c94e95bf66 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -727,6 +727,32 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
+{
+	uint32		max_possible_buffer_limit;
+	uint32		max_write_batch_size;
+	int			strategy_pin_limit;
+
+	max_write_batch_size = io_combine_limit;
+
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_possible_buffer_limit = GetPinLimit();
+
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+	max_write_batch_size = Max(1, max_write_batch_size);
+	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dbb49ed9197..fc749dd5a50 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums.
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7963d1189a6..d1f0ecb7ca4 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -416,6 +416,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -429,6 +457,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 /* solely to make it easier to write tests */
 extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
@@ -438,8 +467,11 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern uint32 max_write_batch_size_for_strategy(BufferAccessStrategy strategy);
 extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
 extern void StartStrategySweep(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, WritebackContext *wb_context,
+								 IOContext io_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..1020cb3ac78 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,5 +507,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..9492adeee58 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -349,6 +349,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v3-0007-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v3-0007-Implement-checkpointer-data-write-combining.patchDownload

From ad6c3c0c01870fc83739daa14197a0359359b079 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:42:29 -0400
Subject: [PATCH v3 7/9] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.
---
 src/backend/storage/buffer/bufmgr.c | 232 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 207 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5ab40a09960..8de669a39f3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -512,6 +512,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 checkpointer_max_batch_size(void);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
@@ -3312,7 +3313,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3324,6 +3324,8 @@ BufferSync(int flags)
 	int			i;
 	int			mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3354,6 +3356,7 @@ BufferSync(int flags)
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3494,48 +3497,208 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = checkpointer_max_batch_size();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
-
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Each batch will have exactly one start and one max lsn and one
+		 * length.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
+
+		while (batch.n < limit)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
+
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
+
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
+			{
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.db_id;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Max(1, limit);
+				limit = Min(limit, max_batch_size);
+			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * becasue we want to start the next IO with this item.
+			 */
+			if (item.db_id != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			/*
+			 * It the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false. If
+			 * the buffer doesn't need checkpointing, don't include it in the
+			 * batch we are building. We're done with the item, so count it as
+			 * processed and break out of the loop to issue the IO we have
+			 * built so far.
+			 */
+			if (!(pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+
+			buf_state = LockBufHdr(bufHdr);
+
+			/*
+			 * If the buffer doesn't need eviction, we're done with the item,
+			 * so count it as processed and break out of the loop to issue the
+			 * IO so far.
+			 */
+			if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				processed++;
+				UnlockBufHdr(bufHdr, buf_state);
+				break;
+			}
+
+			PinBuffer_Locked(bufHdr);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
 			}
+
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			buf_state &= ~BM_JUST_DIRTIED;
+			UnlockBufHdr(bufHdr, buf_state);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
 		 * - otherwise writing become unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, &wb_context, IOCONTEXT_NORMAL);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -4261,6 +4424,23 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+checkpointer_max_batch_size(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Max(pin_limit, 1);
+	result = Min(pin_limit, io_combine_limit);
+	result = Max(result, 1);
+	Assert(result < MAX_IO_COMBINE_LIMIT);
+	return result;
+}
+
 
 /*
  * Given a buffer descriptor, start, from a strategy ring, strategy, that
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

v3-0006-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v3-0006-Add-database-Oid-to-CkptSortItem.patchDownload

From b1ba9f17e7f2746fa5a4c2ae86db20384b068632 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v3 6/9] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ade83adca59..5ab40a09960 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3370,6 +3370,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->db_id = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6689,6 +6690,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->db_id < b->db_id)
+		return -1;
+	else if (a->db_id > b->db_id)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index d1f0ecb7ca4..291cc31da06 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			db_id;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

Melanie Plageman

melanieplageman@gmail.com

4 months ago

In reply to: Melanie Plageman (#4)

7 attachment(s)

Re: Checkpointer write combining

On Tue, Sep 9, 2025 at 9:39 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Oops, you're right. v3 attached with that mistake fixed.

One more fix and a bit more cleanup in attached v4.

- Melanie

Attachments:

v4-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From fef00f5bd61dc0e3cac95fa39a75b3d73ba7d0a6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:43:24 -0400
Subject: [PATCH v4 3/9] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse
them. By eagerly flushing the buffers in a larger batch, we encourage
larger writes at the kernel level and less interleaving of WAL flushes
and data file writes. The effect is mainly noticeable with multiple
parallel COPY FROMs. In this case, client backends achieve higher write
throughput and end up spending less time waiting on acquiring the lock
to flush WAL. Larger flush operations also mean less time waiting for
flush operations at the kernel level as well.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which flushing does not require flushing WAL.

This patch also is a stepping stone toward AIO writes.

Earlier version
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 172 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  63 ++++++++++
 src/include/storage/buf_internals.h   |   3 +
 3 files changed, 233 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 27cc418ef61..d6aafddf883 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -534,7 +534,13 @@ static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object
 						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
-static void CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+static BufferDesc *next_strat_buf_to_flush(BufferAccessStrategy strategy, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator,
+												   bool skip_pinned,
+												   XLogRecPtr *max_lsn);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc, uint32 *buf_state,
 							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
@@ -4253,17 +4259,44 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or or NULL when there are no further buffers to
+ * consider writing out.
+ */
+static BufferDesc *
+next_strat_buf_to_flush(BufferAccessStrategy strategy,
+						XLogRecPtr *lsn)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer)
+	{
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare to write and write a dirty victim buffer.
  * bufdesc and buf_state may be modified.
  */
 static void
-CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc, uint32 *buf_state,
 				  bool from_ring, IOContext io_context)
 {
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 
 	Assert(*buf_state & BM_DIRTY);
 
@@ -4271,11 +4304,140 @@ CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
 	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && strategy_supports_eager_flush(strategy))
+	{
+		/* Clean victim buffer and find more to flush opportunistically */
+		StartStrategySweep(strategy);
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, returns the block -- the pointer to the block data in memory
+ * -- which we will opportunistically flush or NULL if this buffer does not
+ *  contain a block that should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/* Must do this before taking the buffer header spinlock. */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	buf_state = LockBufHdr(bufdesc);
+
+	if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+		goto except_unlock_header;
+
+	/* We don't include used buffers in batches */
+	if (skip_pinned &&
+		(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+		 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+		goto except_unlock_header;
+
+	/* Get page LSN while holding header lock */
+	lsn = BufferGetLSN(bufdesc);
+
+	PinBuffer_Locked(bufdesc);
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* If we'll have to flush WAL to flush the block, we're done */
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unpin_buffer;
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	buf_state = LockBufHdr(bufdesc);
+	lsn = BufferGetLSN(bufdesc);
+	UnlockBufHdr(bufdesc, buf_state);
+
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unlock_content;
+
+	/* Try to start an I/O operation. */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+	buf_state = LockBufHdr(bufdesc);
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
+
+except_unlock_header:
+	UnlockBufHdr(bufdesc, buf_state);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index ce95afe2e94..025592778f7 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -75,6 +75,15 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
+	/*
+	 * If the strategy supports eager flushing, we may initiate a sweep of the
+	 * strategy ring, flushing all the dirty buffers we can cheaply flush.
+	 * sweep_start and sweep_current keep track of a given sweep so we don't
+	 * loop around the ring infinitely.
+	 */
+	int			sweep_start;
+	int			sweep_current;
+
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -156,6 +165,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lean to better I/O
+ * patterns than lazily flushing buffers directly before reusing them.
+ */
+bool
+strategy_supports_eager_flush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -270,6 +304,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy)
+{
+	strategy->sweep_current++;
+	if (strategy->sweep_current >= strategy->nbuffers)
+		strategy->sweep_current = 0;
+
+	if (strategy->sweep_current == strategy->sweep_start)
+		return InvalidBuffer;
+
+	return strategy->buffers[strategy->sweep_current];
+}
+
+/*
+ * Start a sweep of the strategy ring.
+ */
+void
+StartStrategySweep(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return;
+	strategy->sweep_start = strategy->sweep_current = strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b1b81f31419..7963d1189a6 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -437,6 +437,9 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
+extern void StartStrategySweep(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v4-0005-Fix-XLogNeedsFlush-for-checkpointer.patchtext/x-patch; charset=US-ASCII; name=v4-0005-Fix-XLogNeedsFlush-for-checkpointer.patchDownload

From cb5fd13213b1f66e897243828aae136f93481472 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 10:01:17 -0400
Subject: [PATCH v4 5/9] Fix XLogNeedsFlush() for checkpointer

XLogNeedsFlush() takes an LSN and compares it to either the flush pointer or the
min recovery point, depending on whether it is in normal operation or recovery.

Even though it is technically recovery, the checkpointer must flush WAL during
an end-of-recovery checkpoint, so in this case, it should compare the provided
LSN to the flush pointer and not the min recovery point.

If it compares the LSN to the min recovery point when the control file's min
recovery point has been updated to an incorrect value, XLogNeedsFlush() can
return an incorrect result of true -- even after just having flushed WAL.

Change this to only compare the LSN to min recovery point -- and, potentially
update the local copy of min recovery point, when xlog inserts are allowed --
which is true for the checkpointer during an end-of-recovery checkpoint, but
false during crash recovery otherwise.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..16ef6d2cd64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3115,7 +3115,7 @@ XLogNeedsFlush(XLogRecPtr record)
 	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
 	 * would need to be updated.
 	 */
-	if (RecoveryInProgress())
+	if (RecoveryInProgress() && !XLogInsertAllowed())
 	{
 		/*
 		 * An invalid minRecoveryPoint means that we need to recover all the
-- 
2.43.0

v4-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Split-FlushBuffer-into-two-parts.patchDownload

From 2c8aafe30fb58516654e7d0cfdbfbb15a6a00498 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:32:24 -0400
Subject: [PATCH v4 2/9] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This provides better symmetry with the
batch flushing code.
---
 src/backend/storage/buffer/bufmgr.c | 103 ++++++++++++++++++++--------
 1 file changed, 76 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f3668051574..27cc418ef61 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -529,8 +529,13 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static void CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2414,12 +2419,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4246,20 +4246,81 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (PrepareFlushBuffer(buf, &buf_state, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
+
+/*
+ * Prepare to write and write a dirty victim buffer.
+ * bufdesc and buf_state may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+				  bool from_ring, IOContext io_context)
+{
+
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
 
+	Assert(*buf_state & BM_DIRTY);
+
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
+		return;
+
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
+
+/*
+ * Prepare the buffer with budesc for writing. buf_state and lsn are output
+ * parameters. Returns true if the buffer acutally needs writing and false
+ * otherwise. All three parameters may be modified.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn)
+{
 	/*
 	 * Try to start an I/O operation.  If StartBufferIO returns false, then
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
+
+	*lsn = InvalidXLogRecPtr;
+	*buf_state = LockBufHdr(bufdesc);
+
+	/*
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
+	 */
+	if (*buf_state & BM_PERMANENT)
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	*buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, *buf_state);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
 
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = shared_buffer_write_error_callback;
@@ -4277,18 +4338,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 										reln->smgr_rlocator.locator.dbOid,
 										reln->smgr_rlocator.locator.relNumber);
 
-	buf_state = LockBufHdr(buf);
-
-	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
-	 */
-	recptr = BufferGetLSN(buf);
-
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
-
 	/*
 	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
 	 * rule that log updates must hit disk before any of the data-file changes
@@ -4306,8 +4355,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
 	 */
-	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v4-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From dac227ef15ec46f5f8d9c918390147f9b00f3e29 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:56:38 -0400
Subject: [PATCH v4 4/9] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.

XXX: Because this sets in-place checksums for batches, it is not
committable until additional infrastructure goes in place.
---
 src/backend/storage/buffer/bufmgr.c   | 200 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  26 ++++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 +++++
 src/include/storage/bufpage.h         |   1 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 270 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d6aafddf883..98c03ef1b1a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -539,6 +539,8 @@ static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber re
 												   RelFileLocator *rlocator,
 												   bool skip_pinned,
 												   XLogRecPtr *max_lsn);
+static void FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+							   uint32 max_batch_size, BufWriteBatch *batch);
 static void CleanVictimBuffer(BufferAccessStrategy strategy,
 							  BufferDesc *bufdesc, uint32 *buf_state,
 							  bool from_ring, IOContext io_context);
@@ -2425,7 +2427,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
+			CleanVictimBuffer(strategy, buf_hdr, &buf_state, from_ring, io_context);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4259,10 +4261,73 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+
+/*
+ * Given a buffer descriptor, start, from a strategy ring, strategy, that
+ * supports eager flushing, find additional buffers from the ring that can be
+ * combined into a single write batch with this buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to write this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+				   uint32 max_batch_size, BufWriteBatch *batch)
+{
+	BlockNumber limit;
+	uint32		buf_state;
+
+	Assert(start);
+	batch->bufdescs[0] = start;
+
+	buf_state = LockBufHdr(start);
+	batch->max_lsn = BufferGetLSN(start);
+	UnlockBufHdr(start, buf_state);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	Assert(BlockNumberIsValid(batch->start));
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Max(limit, 1);
+	limit = Min(max_batch_size, limit);
+
+	/* Now assemble a run of blocks to write out. */
+	for (batch->n = 1; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or or NULL when there are no further buffers to
- * consider writing out.
+ * consider writing out. This will be the start of a new batch of buffers to
+ * write out.
  */
 static BufferDesc *
 next_strat_buf_to_flush(BufferAccessStrategy strategy,
@@ -4296,7 +4361,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 
 	Assert(*buf_state & BM_DIRTY);
 
@@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	if (from_ring && strategy_supports_eager_flush(strategy))
 	{
+		uint32		max_batch_size = max_write_batch_size_for_strategy(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
+
 		/* Clean victim buffer and find more to flush opportunistically */
 		StartStrategySweep(strategy);
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, bufdesc, max_batch_size, &batch);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
 	}
 	else
@@ -4440,6 +4507,73 @@ except_unlock_header:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (!XLogRecPtrIsInvalid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		BufferDesc *bufdesc = batch->bufdescs[i];
+		uint32		buf_state = LockBufHdr(bufdesc);
+		XLogRecPtr	lsn = BufferGetLSN(bufdesc);
+
+		UnlockBufHdr(bufdesc, buf_state);
+		Assert(!(buf_state & BM_PERMANENT) || !XLogNeedsFlush(lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with budesc for writing. buf_state and lsn are output
  * parameters. Returns true if the buffer acutally needs writing and false
@@ -4585,6 +4719,48 @@ DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 025592778f7..eadf2899a01 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -727,6 +727,32 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
+{
+	uint32		max_possible_buffer_limit;
+	uint32		max_write_batch_size;
+	int			strategy_pin_limit;
+
+	max_write_batch_size = io_combine_limit;
+
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_possible_buffer_limit = GetPinLimit();
+
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+	max_write_batch_size = Max(1, max_write_batch_size);
+	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dbb49ed9197..fc749dd5a50 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums.
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7963d1189a6..c082a50166f 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -416,6 +416,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -429,6 +457,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 /* solely to make it easier to write tests */
 extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
@@ -438,8 +467,11 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern uint32 max_write_batch_size_for_strategy(BufferAccessStrategy strategy);
 extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
 extern void StartStrategySweep(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+								 WritebackContext *wb_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..1020cb3ac78 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,5 +507,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..9492adeee58 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -349,6 +349,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v4-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From dae7c82146c2d73729fc12a742d84b660e6db2ad Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v4 1/9] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with a
regular for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.

Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 200 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  32 ++++-
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 124 insertions(+), 113 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..f3668051574 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2344,130 +2340,122 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned with its header
-	 * spinlock still held!
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
-
-	/* Pin the buffer and then release the buffer spinlock */
-	PinBuffer_Locked(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
+		/*
+		 * Attempt to claim a victim buffer.  The buffer is returned with its
+		 * header spinlock still held!
+		 */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+		/* Pin the buffer and then release the buffer spinlock */
+		PinBuffer_Locked(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr, buf_state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
+
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
 
-	if (buf_state & BM_VALID)
-	{
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer. Then loop around and try again.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..ce95afe2e94 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -716,12 +717,21 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must pinned and content locked and the buffer header spinlock
+ * must not be held. We must have the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
- * if this buffer should be written and re-used.
+ * if this buffer should be written and reused.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -731,11 +741,19 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
-	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
-	 * loop if all ring members are dirty.
-	 */
-	strategy->buffers[strategy->current] = InvalidBuffer;
+	buf_state = LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf, buf_state);
+
+	if (XLogNeedsFlush(lsn))
+	{
+		/*
+		 * Remove the dirty buffer from the ring; necessary to prevent an
+		 * infinite loop if all ring members are dirty.
+		 */
+		strategy->buffers[strategy->current] = InvalidBuffer;
+		return true;
+	}
 
-	return true;
+	return false;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index dfd614f7ca4..b1b81f31419 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -419,6 +419,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v4-0007-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v4-0007-Implement-checkpointer-data-write-combining.patchDownload

From 9f242a419d561b6830ea1795b6178d9e0e5ba4e2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:42:29 -0400
Subject: [PATCH v4 7/9] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.
---
 src/backend/storage/buffer/bufmgr.c | 232 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 207 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 246f675333e..e2a7111a0bb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -512,6 +512,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 checkpointer_max_batch_size(void);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
@@ -3313,7 +3314,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3325,6 +3325,8 @@ BufferSync(int flags)
 	int			i;
 	int			mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3355,6 +3357,7 @@ BufferSync(int flags)
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3495,48 +3498,208 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = checkpointer_max_batch_size();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
-
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Each batch will have exactly one start and one max lsn and one
+		 * length.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
+
+		while (batch.n < limit)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
+
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
+
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
+			{
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.db_id;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Max(1, limit);
+				limit = Min(limit, max_batch_size);
+			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * becasue we want to start the next IO with this item.
+			 */
+			if (item.db_id != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			/*
+			 * It the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false. If
+			 * the buffer doesn't need checkpointing, don't include it in the
+			 * batch we are building. We're done with the item, so count it as
+			 * processed and break out of the loop to issue the IO we have
+			 * built so far.
+			 */
+			if (!(pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+
+			buf_state = LockBufHdr(bufHdr);
+
+			/*
+			 * If the buffer doesn't need eviction, we're done with the item,
+			 * so count it as processed and break out of the loop to issue the
+			 * IO so far.
+			 */
+			if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				processed++;
+				UnlockBufHdr(bufHdr, buf_state);
+				break;
+			}
+
+			PinBuffer_Locked(bufHdr);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
 			}
+
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			buf_state &= ~BM_JUST_DIRTIED;
+			UnlockBufHdr(bufHdr, buf_state);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
 		 * - otherwise writing become unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, &wb_context, IOCONTEXT_NORMAL);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -4262,6 +4425,23 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+checkpointer_max_batch_size(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Max(pin_limit, 1);
+	result = Min(pin_limit, io_combine_limit);
+	result = Max(result, 1);
+	Assert(result < MAX_IO_COMBINE_LIMIT);
+	return result;
+}
+
 
 /*
  * Given a buffer descriptor, start, from a strategy ring, strategy, that
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

v4-0006-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v4-0006-Add-database-Oid-to-CkptSortItem.patchDownload

From fed3aa6eb1fcc0cb66b9ed56dbee7506d1ce563f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v4 6/9] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 98c03ef1b1a..246f675333e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3371,6 +3371,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->db_id = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6691,6 +6692,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->db_id < b->db_id)
+		return -1;
+	else if (a->db_id > b->db_id)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c082a50166f..99f17091a40 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			db_id;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

Melanie Plageman

melanieplageman@gmail.com

4 months ago

In reply to: Melanie Plageman (#5)

7 attachment(s)

Re: Checkpointer write combining

On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

One more fix and a bit more cleanup in attached v4.

Okay one more version: I updated the thread on eager flushing the
bulkwrite ring [1]/messages/by-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com, and some updates were needed here.

- Melanie

[1]: /messages/by-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com

Attachments:

v5-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v5-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From ab164bbeb8ea4d7718f7f3c2b33893be1e0907dd Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:56:38 -0400
Subject: [PATCH v5 4/9] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.

XXX: Because this sets in-place checksums for batches, it is not
committable until additional infrastructure goes in place.
---
 src/backend/storage/buffer/bufmgr.c   | 198 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  26 ++++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 +++++
 src/include/storage/bufpage.h         |   1 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 269 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d0f40b6a3ec..98c03ef1b1a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -539,6 +539,8 @@ static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber re
 												   RelFileLocator *rlocator,
 												   bool skip_pinned,
 												   XLogRecPtr *max_lsn);
+static void FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+							   uint32 max_batch_size, BufWriteBatch *batch);
 static void CleanVictimBuffer(BufferAccessStrategy strategy,
 							  BufferDesc *bufdesc, uint32 *buf_state,
 							  bool from_ring, IOContext io_context);
@@ -4259,10 +4261,73 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+
+/*
+ * Given a buffer descriptor, start, from a strategy ring, strategy, that
+ * supports eager flushing, find additional buffers from the ring that can be
+ * combined into a single write batch with this buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to write this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *start,
+				   uint32 max_batch_size, BufWriteBatch *batch)
+{
+	BlockNumber limit;
+	uint32		buf_state;
+
+	Assert(start);
+	batch->bufdescs[0] = start;
+
+	buf_state = LockBufHdr(start);
+	batch->max_lsn = BufferGetLSN(start);
+	UnlockBufHdr(start, buf_state);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	Assert(BlockNumberIsValid(batch->start));
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Max(limit, 1);
+	limit = Min(max_batch_size, limit);
+
+	/* Now assemble a run of blocks to write out. */
+	for (batch->n = 1; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or or NULL when there are no further buffers to
- * consider writing out.
+ * consider writing out. This will be the start of a new batch of buffers to
+ * write out.
  */
 static BufferDesc *
 next_strat_buf_to_flush(BufferAccessStrategy strategy,
@@ -4296,7 +4361,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 
 	Assert(*buf_state & BM_DIRTY);
 
@@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	if (from_ring && strategy_supports_eager_flush(strategy))
 	{
+		uint32		max_batch_size = max_write_batch_size_for_strategy(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
+
 		/* Clean victim buffer and find more to flush opportunistically */
 		StartStrategySweep(strategy);
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, bufdesc, max_batch_size, &batch);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
 	}
 	else
@@ -4440,6 +4507,73 @@ except_unlock_header:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (!XLogRecPtrIsInvalid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		BufferDesc *bufdesc = batch->bufdescs[i];
+		uint32		buf_state = LockBufHdr(bufdesc);
+		XLogRecPtr	lsn = BufferGetLSN(bufdesc);
+
+		UnlockBufHdr(bufdesc, buf_state);
+		Assert(!(buf_state & BM_PERMANENT) || !XLogNeedsFlush(lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with budesc for writing. buf_state and lsn are output
  * parameters. Returns true if the buffer acutally needs writing and false
@@ -4585,6 +4719,48 @@ DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 025592778f7..eadf2899a01 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -727,6 +727,32 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
+{
+	uint32		max_possible_buffer_limit;
+	uint32		max_write_batch_size;
+	int			strategy_pin_limit;
+
+	max_write_batch_size = io_combine_limit;
+
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_possible_buffer_limit = GetPinLimit();
+
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+	max_write_batch_size = Max(1, max_write_batch_size);
+	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dbb49ed9197..fc749dd5a50 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums.
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7963d1189a6..c082a50166f 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -416,6 +416,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -429,6 +457,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 /* solely to make it easier to write tests */
 extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
@@ -438,8 +467,11 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern uint32 max_write_batch_size_for_strategy(BufferAccessStrategy strategy);
 extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
 extern void StartStrategySweep(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+								 WritebackContext *wb_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..1020cb3ac78 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,5 +507,6 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..9492adeee58 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -349,6 +349,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v5-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From dae7c82146c2d73729fc12a742d84b660e6db2ad Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v5 1/9] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with a
regular for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.

Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 200 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  32 ++++-
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 124 insertions(+), 113 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..f3668051574 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2344,130 +2340,122 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned with its header
-	 * spinlock still held!
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
-
-	/* Pin the buffer and then release the buffer spinlock */
-	PinBuffer_Locked(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
+		/*
+		 * Attempt to claim a victim buffer.  The buffer is returned with its
+		 * header spinlock still held!
+		 */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+		/* Pin the buffer and then release the buffer spinlock */
+		PinBuffer_Locked(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr, buf_state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
+
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
 
-	if (buf_state & BM_VALID)
-	{
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer. Then loop around and try again.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..ce95afe2e94 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -716,12 +717,21 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must pinned and content locked and the buffer header spinlock
+ * must not be held. We must have the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
- * if this buffer should be written and re-used.
+ * if this buffer should be written and reused.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -731,11 +741,19 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
-	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
-	 * loop if all ring members are dirty.
-	 */
-	strategy->buffers[strategy->current] = InvalidBuffer;
+	buf_state = LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf, buf_state);
+
+	if (XLogNeedsFlush(lsn))
+	{
+		/*
+		 * Remove the dirty buffer from the ring; necessary to prevent an
+		 * infinite loop if all ring members are dirty.
+		 */
+		strategy->buffers[strategy->current] = InvalidBuffer;
+		return true;
+	}
 
-	return true;
+	return false;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index dfd614f7ca4..b1b81f31419 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -419,6 +419,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v5-0005-Fix-XLogNeedsFlush-for-checkpointer.patchtext/x-patch; charset=US-ASCII; name=v5-0005-Fix-XLogNeedsFlush-for-checkpointer.patchDownload

From bf4b108cd97f710e89f27be69709c7960764a932 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 10:01:17 -0400
Subject: [PATCH v5 5/9] Fix XLogNeedsFlush() for checkpointer

XLogNeedsFlush() takes an LSN and compares it to either the flush pointer or the
min recovery point, depending on whether it is in normal operation or recovery.

Even though it is technically recovery, the checkpointer must flush WAL during
an end-of-recovery checkpoint, so in this case, it should compare the provided
LSN to the flush pointer and not the min recovery point.

If it compares the LSN to the min recovery point when the control file's min
recovery point has been updated to an incorrect value, XLogNeedsFlush() can
return an incorrect result of true -- even after just having flushed WAL.

Change this to only compare the LSN to min recovery point -- and, potentially
update the local copy of min recovery point, when xlog inserts are allowed --
which is true for the checkpointer during an end-of-recovery checkpoint, but
false during crash recovery otherwise.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..16ef6d2cd64 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3115,7 +3115,7 @@ XLogNeedsFlush(XLogRecPtr record)
 	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
 	 * would need to be updated.
 	 */
-	if (RecoveryInProgress())
+	if (RecoveryInProgress() && !XLogInsertAllowed())
 	{
 		/*
 		 * An invalid minRecoveryPoint means that we need to recover all the
-- 
2.43.0

v5-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Split-FlushBuffer-into-two-parts.patchDownload

From 2c8aafe30fb58516654e7d0cfdbfbb15a6a00498 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:32:24 -0400
Subject: [PATCH v5 2/9] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This provides better symmetry with the
batch flushing code.
---
 src/backend/storage/buffer/bufmgr.c | 103 ++++++++++++++++++++--------
 1 file changed, 76 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f3668051574..27cc418ef61 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -529,8 +529,13 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static void CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2414,12 +2419,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4246,20 +4246,81 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (PrepareFlushBuffer(buf, &buf_state, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
+
+/*
+ * Prepare to write and write a dirty victim buffer.
+ * bufdesc and buf_state may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+				  bool from_ring, IOContext io_context)
+{
+
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
 
+	Assert(*buf_state & BM_DIRTY);
+
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
+		return;
+
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
+
+/*
+ * Prepare the buffer with budesc for writing. buf_state and lsn are output
+ * parameters. Returns true if the buffer acutally needs writing and false
+ * otherwise. All three parameters may be modified.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn)
+{
 	/*
 	 * Try to start an I/O operation.  If StartBufferIO returns false, then
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
+
+	*lsn = InvalidXLogRecPtr;
+	*buf_state = LockBufHdr(bufdesc);
+
+	/*
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
+	 */
+	if (*buf_state & BM_PERMANENT)
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	*buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, *buf_state);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
 
 	/* Setup error traceback support for ereport() */
 	errcallback.callback = shared_buffer_write_error_callback;
@@ -4277,18 +4338,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 										reln->smgr_rlocator.locator.dbOid,
 										reln->smgr_rlocator.locator.relNumber);
 
-	buf_state = LockBufHdr(buf);
-
-	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
-	 */
-	recptr = BufferGetLSN(buf);
-
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
-
 	/*
 	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
 	 * rule that log updates must hit disk before any of the data-file changes
@@ -4306,8 +4355,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
 	 */
-	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v5-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From c060a306293fbcecaab4fd8dd9174860c94ce6be Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:43:24 -0400
Subject: [PATCH v5 3/9] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse
them. By eagerly flushing the buffers in a larger batch, we encourage
larger writes at the kernel level and less interleaving of WAL flushes
and data file writes. The effect is mainly noticeable with multiple
parallel COPY FROMs. In this case, client backends achieve higher write
throughput and end up spending less time waiting on acquiring the lock
to flush WAL. Larger flush operations also mean less time waiting for
flush operations at the kernel level as well.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which flushing does not require flushing WAL.

This patch also is a stepping stone toward AIO writes.

Earlier version
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 174 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  63 ++++++++++
 src/include/storage/buf_internals.h   |   3 +
 3 files changed, 234 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 27cc418ef61..d0f40b6a3ec 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -534,7 +534,13 @@ static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object
 						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
-static void CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+static BufferDesc *next_strat_buf_to_flush(BufferAccessStrategy strategy, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator,
+												   bool skip_pinned,
+												   XLogRecPtr *max_lsn);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc, uint32 *buf_state,
 							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
@@ -2419,7 +2425,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
+			CleanVictimBuffer(strategy, buf_hdr, &buf_state, from_ring, io_context);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4253,17 +4259,44 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or or NULL when there are no further buffers to
+ * consider writing out.
+ */
+static BufferDesc *
+next_strat_buf_to_flush(BufferAccessStrategy strategy,
+						XLogRecPtr *lsn)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum = StrategySweepNextBuffer(strategy)) != InvalidBuffer)
+	{
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare to write and write a dirty victim buffer.
  * bufdesc and buf_state may be modified.
  */
 static void
-CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc, uint32 *buf_state,
 				  bool from_ring, IOContext io_context)
 {
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 
 	Assert(*buf_state & BM_DIRTY);
 
@@ -4271,11 +4304,140 @@ CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
 	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && strategy_supports_eager_flush(strategy))
+	{
+		/* Clean victim buffer and find more to flush opportunistically */
+		StartStrategySweep(strategy);
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = next_strat_buf_to_flush(strategy, &max_lsn)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, returns the block -- the pointer to the block data in memory
+ * -- which we will opportunistically flush or NULL if this buffer does not
+ *  contain a block that should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/* Must do this before taking the buffer header spinlock. */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	buf_state = LockBufHdr(bufdesc);
+
+	if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+		goto except_unlock_header;
+
+	/* We don't include used buffers in batches */
+	if (skip_pinned &&
+		(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+		 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+		goto except_unlock_header;
+
+	/* Get page LSN while holding header lock */
+	lsn = BufferGetLSN(bufdesc);
+
+	PinBuffer_Locked(bufdesc);
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* If we'll have to flush WAL to flush the block, we're done */
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unpin_buffer;
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	buf_state = LockBufHdr(bufdesc);
+	lsn = BufferGetLSN(bufdesc);
+	UnlockBufHdr(bufdesc, buf_state);
+
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unlock_content;
+
+	/* Try to start an I/O operation. */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+	buf_state = LockBufHdr(bufdesc);
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
+
+except_unlock_header:
+	UnlockBufHdr(bufdesc, buf_state);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index ce95afe2e94..025592778f7 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -75,6 +75,15 @@ typedef struct BufferAccessStrategyData
 	 */
 	int			current;
 
+	/*
+	 * If the strategy supports eager flushing, we may initiate a sweep of the
+	 * strategy ring, flushing all the dirty buffers we can cheaply flush.
+	 * sweep_start and sweep_current keep track of a given sweep so we don't
+	 * loop around the ring infinitely.
+	 */
+	int			sweep_start;
+	int			sweep_current;
+
 	/*
 	 * Array of buffer numbers.  InvalidBuffer (that is, zero) indicates we
 	 * have not yet selected a buffer for this ring slot.  For allocation
@@ -156,6 +165,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lean to better I/O
+ * patterns than lazily flushing buffers directly before reusing them.
+ */
+bool
+strategy_supports_eager_flush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -270,6 +304,35 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy)
+{
+	strategy->sweep_current++;
+	if (strategy->sweep_current >= strategy->nbuffers)
+		strategy->sweep_current = 0;
+
+	if (strategy->sweep_current == strategy->sweep_start)
+		return InvalidBuffer;
+
+	return strategy->buffers[strategy->sweep_current];
+}
+
+/*
+ * Start a sweep of the strategy ring.
+ */
+void
+StartStrategySweep(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return;
+	strategy->sweep_start = strategy->sweep_current = strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b1b81f31419..7963d1189a6 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -437,6 +437,9 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool strategy_supports_eager_flush(BufferAccessStrategy strategy);
+extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy);
+extern void StartStrategySweep(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v5-0006-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v5-0006-Add-database-Oid-to-CkptSortItem.patchDownload

From 7e54f2577dce04615b3082fe518490dbd2fa8ad7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v5 6/9] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 98c03ef1b1a..246f675333e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3371,6 +3371,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->db_id = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6691,6 +6692,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->db_id < b->db_id)
+		return -1;
+	else if (a->db_id > b->db_id)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c082a50166f..99f17091a40 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			db_id;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

v5-0007-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v5-0007-Implement-checkpointer-data-write-combining.patchDownload

From 5e9ebf1c1cafdb5805671d3c57cfeadab5e8c434 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:42:29 -0400
Subject: [PATCH v5 7/9] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.
---
 src/backend/storage/buffer/bufmgr.c | 232 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 207 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 246f675333e..e7c789dffd7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -512,6 +512,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 checkpointer_max_batch_size(void);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
@@ -3313,7 +3314,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3325,6 +3325,8 @@ BufferSync(int flags)
 	int			i;
 	int			mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3355,6 +3357,7 @@ BufferSync(int flags)
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3495,48 +3498,208 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = checkpointer_max_batch_size();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
-
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
-
-		bufHdr = GetBufferDescriptor(buf_id);
-
-		num_processed++;
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
 		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
+		 * Each batch will have exactly one start and one max lsn and one
+		 * length.
 		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
+
+		while (batch.n < limit)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
+
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
+
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
+			{
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.db_id;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Max(1, limit);
+				limit = Min(limit, max_batch_size);
+			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * becasue we want to start the next IO with this item.
+			 */
+			if (item.db_id != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			/*
+			 * It the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false. If
+			 * the buffer doesn't need checkpointing, don't include it in the
+			 * batch we are building. We're done with the item, so count it as
+			 * processed and break out of the loop to issue the IO we have
+			 * built so far.
+			 */
+			if (!(pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+
+			buf_state = LockBufHdr(bufHdr);
+
+			/*
+			 * If the buffer doesn't need eviction, we're done with the item,
+			 * so count it as processed and break out of the loop to issue the
+			 * IO so far.
+			 */
+			if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				processed++;
+				UnlockBufHdr(bufHdr, buf_state);
+				break;
+			}
+
+			PinBuffer_Locked(bufHdr);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
 			}
+
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			buf_state &= ~BM_JUST_DIRTIED;
+			UnlockBufHdr(bufHdr, buf_state);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
 		 * - otherwise writing become unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, IOCONTEXT_NORMAL, &wb_context);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -4262,6 +4425,23 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+checkpointer_max_batch_size(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Max(pin_limit, 1);
+	result = Min(pin_limit, io_combine_limit);
+	result = Max(result, 1);
+	Assert(result < MAX_IO_COMBINE_LIMIT);
+	return result;
+}
+
 
 /*
  * Given a buffer descriptor, start, from a strategy ring, strategy, that
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

Jeff Davis

pgsql@j-davis.com

4 months ago

In reply to: Melanie Plageman (#6)

Re: Checkpointer write combining

On Tue, 2025-09-09 at 13:55 -0400, Melanie Plageman wrote:

On Tue, Sep 9, 2025 at 11:16 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

One more fix and a bit more cleanup in attached v4.

Okay one more version: I updated the thread on eager flushing the
bulkwrite ring [1], and some updates were needed here.

v5-0005 comments:

* Please update the comment above the code change.
* The last paragraph in the commit message has a typo: "potentially
update the local copy of min recovery point, when xlog inserts are
*not* allowed", right?
* Shouldn't the code be consistent between XLogNeedsFlush() and
XLogFlush()? The latter only checks for !XLogInsertAllowed(), whereas
the former also checks for RecoveryInProgress().

I'm still not sure I understand the problem situation this is fixing,
but that's being discussed in another thread.

Regards,
Jeff Davis

Chao Li

li.evan.chao@gmail.com

4 months ago

In reply to: Melanie Plageman (#6)

Re: Checkpointer write combining

On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:

[1] /messages/by-id/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig@mail.gmail.com
<v5-0004-Write-combining-for-BAS_BULKWRITE.patch><v5-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v5-0005-Fix-XLogNeedsFlush-for-checkpointer.patch><v5-0002-Split-FlushBuffer-into-two-parts.patch><v5-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v5-0006-Add-database-Oid-to-CkptSortItem.patch><v5-0007-Implement-checkpointer-data-write-combining.patch>

1 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+ * The buffer must pinned and content locked and the buffer header spinlock
```

“Must pinned” -> “must be pinned"

2 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+	if (XLogNeedsFlush(lsn))
+	{
+		/*
+		 * Remove the dirty buffer from the ring; necessary to prevent an
+		 * infinite loop if all ring members are dirty.
+		 */
+		strategy->buffers[strategy->current] = InvalidBuffer;
+		return true;
+	}

-	return true;
+	return false;
 }
```

We can do:
```
If (!XLogNeedsFlush(lan))
Return false

/* Remove the dirty buffer ….
*/
Return true;
}
```

This way makes less diff.

3 - 0002
```
+ * Prepare to write and write a dirty victim buffer.
```

Prepare to write a dirty victim buffer.

4 - 0002
```
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
```
I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().

I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.

5 - 0002
```
* disastrous system-wide consequences. To make sure that can't happen,
* skip the flush if the buffer isn't permanent.
*/
- if (buf_state & BM_PERMANENT)
- XLogFlush(recptr);
+ if (!XLogRecPtrIsInvalid(buffer_lsn))
+ XLogFlush(buffer_lsn);
```

Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.

6 - 0003
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+ * patterns than lazily flushing buffers directly before reusing them.
+ */
```

Here “directly” is kind ambiguous. It could mean “immediately before” or “without going through something else”. My understanding is “immediately”, If that is true, please change “directly” to “immediately” or just remove it.

7 - 0003
```
+void
+StartStrategySweep(BufferAccessStrategy strategy)
+{
+	if (!strategy)
+		return;
```

I doubt if this “strategy” null check is needed. Because it is only called when strategy_supports_eager_flush() is true, and strategy_supports_eager_flush() has asserted “strategy”.

And as a pair function, StrategySweepNextBuffer() doesn’t do null check nor assert strategy.

8 - 0003
```
bool
+strategy_supports_eager_flush(BufferAccessStrategy strategy)
```

This function is only used in bufmgr.c, can we move it there and make it static?

9 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	Assert(BlockNumberIsValid(batch->start));
```

Why don’t assert immediately after batch->start is assigned? So upon error, smgropen() will not be called.

10 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+ limit = Min(max_batch_size, limit);
```

Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.

11 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+	for (batch->n = 1; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
+			break;
```

Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.

12 - 0004
```
@@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,

if (from_ring && strategy_supports_eager_flush(strategy))
{
+ uint32 max_batch_size = max_write_batch_size_for_strategy(strategy);
```

I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.

13 - 0004
```
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
```

Should we only increase shared_blks_written only after the loop of write-back is done?

14 - 0004
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+uint32
+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
```

I think this function can be moved to bufmgr.c and make it static.

15 - 0004
```
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)

 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums.
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, BlockNumber *blknos, uint32 length)
```

We should mark blknos as const to indicate it is readonly: const BlockNumber *blknos, which will also prevent from incidentally change on blknos in within the function.

16 - 0005
```
 	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
 	 * would need to be updated.
 	 */
-	if (RecoveryInProgress())
+	if (RecoveryInProgress() && !XLogInsertAllowed())
```

As a new check is added, the comment should be updated accordingly.

17 - 0006
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			db_id;
```

I think “db_id” should be named “dbId” or “dbOid”. Let’s keep the name conversation consistent.

18 - 0007
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+ max_batch_size = checkpointer_max_batch_size();
```

Look like we don’t need to calculate max_batch_size in the for loop.

19 - 0007
```
+		 * Each batch will have exactly one start and one max lsn and one
+		 * length.
 		 */
```

I don’t get what you want to explain with this comment. It sounds quite unnecessary.

Best regards,
—
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Melanie Plageman

melanieplageman@gmail.com

4 months ago

In reply to: Chao Li (#8)

7 attachment(s)

Re: Checkpointer write combining

On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote:

Thanks for the review!

For any of your feedback that I simply implemented, I omitted an
inline comment about it. Those changes are included in the attached
v6. My inline replies below are only for feedback requiring more
discussion.

On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
2 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
+ if (XLogNeedsFlush(lsn))
+ {
+ /*
+ * Remove the dirty buffer from the ring; necessary to prevent an
+ * infinite loop if all ring members are dirty.
+ */
+ strategy->buffers[strategy->current] = InvalidBuffer;
+ return true;
+ }
- return true;
+ return false;
}
```
We can do:
```
If (!XLogNeedsFlush(lan))
Return false

/* Remove the dirty buffer ….
*/
Return true;
}
```

This would make the order of evaluation the same as master but I
actually prefer it this way because then we only take the buffer
header spinlock if there is a chance we will reject the buffer (e.g.
we don't need to examine it for strategies except BAS_BULKREAD)

4 - 0002
```
- /* OK, do the I/O */
- FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
- LWLockRelease(content_lock);
-
- ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-  &buf_hdr->tag);
+ CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
```
I saw CleanVictimBuffer() will get content_lock from bufdesc and release it, but it makes the code hard to understand. Readers might be confused that why content_lock is not released after CleanVictimBuffer() without further reading CleanVictimBuffer().

I’d suggest pass content_lock to CleanVictimBuffer() as a parameter, which gives a clear hint that CleanVictimBuffer() will release the lock.

I think for this specific patch in the set your idea makes sense.
However, in the later patch to do write combining, I release the
content locks for the batch in CompleteWriteBatchIO() and having the
start buffer's lock separate as a parameter would force me to have a
special case handling this.

I've added a comment to both CleanVictimBuffer() and its caller
specifying that the lock must be held and that it will be released
inside CleanVictimBuffer.

5 - 0002
```
* disastrous system-wide consequences. To make sure that can't happen,
* skip the flush if the buffer isn't permanent.
*/
- if (buf_state & BM_PERMANENT)
- XLogFlush(recptr);
+ if (!XLogRecPtrIsInvalid(buffer_lsn))
+ XLogFlush(buffer_lsn);
```

Why this check is changed? Should the comment be updated accordingly as it says “if the buffer isn’t permanent”, which reflects to the old code.

It's changed because I split the logic for flushing to that LSN and
determining the LSN across the Prepare and Do functions. This is
needed because when we do batches, we want to flush to the max LSN
across all buffers in the batch.

I check if the buffer is BM_PERMANENT in PrepareFlushBuffer(). You
make a good point about my comment, though. I've moved it to
PrepareFlushBuffer() and updated it.

8 - 0003
```
bool
+strategy_supports_eager_flush(BufferAccessStrategy strategy)
```

This function is only used in bufmgr.c, can we move it there and make it static?

BufferAccessStrategyData is opaque to bufmgr.c. Only freelist.c can
access it. I agree it is gross that I have these helpers and functions
that would otherwise be static, though.

10 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+ limit = Min(max_batch_size, limit);
```

Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.

I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
since we are capping ourselves at io_combine_limit. Or is that your
point?

11 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+ for (batch->n = 1; batch->n < limit; batch->n++)
+ {
+ Buffer bufnum;
+
+ if ((bufnum = StrategySweepNextBuffer(strategy)) == InvalidBuffer)
+ break;
```
Is sweep next buffer right next to start? If yes, can we assert that? But my guess is no, if my guess is true, then is it possible that bufnum meets start? If that’s true, then we should check next buffer doesn’t equal to start.

Ah, great point. I didn't think about this. Our sweep will always
start right after the start buffer, but then if it goes all the way
around, it will "lap" the start buffer. Because of this and because I
think it is weird to have the sweep variables in the
BufferAccessStrategy object, I've changed my approach in attached v6.
I set sweep_end to be the start block in the batch and then pass
around a sweep cursor variable. Hitting sweep_end is the termination
condition.

12 - 0004
```
@@ -4306,19 +4370,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,

if (from_ring && strategy_supports_eager_flush(strategy))
{
+ uint32 max_batch_size = max_write_batch_size_for_strategy(strategy);
```

I think max_batch_size can be attribute of strategy and set it when creating a strategy, so that we don’t need to calculate in every round of clean.

Actually, the max pin limit can change quite frequently. See
GetAdditionalPinLimit()'s usage in read stream code. If the query is
pinning other buffers in another part of the query, it can change our
limit.

I'm not sure if I should call GetAdditionalPinLImit() for each batch
or for each run of batches (like in StrategyMaxWriteBatchSize()).
Currently, I call it for each batch (in FindFlushAdjacents()). The
read stream calls it pretty frequently (each
read_stream_start_pending_read()). But, in the batch flush case,
nothing could change between batches in a run of batches. So maybe I
should move it up and out and make it per run of batches...

13 - 0004
```
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+ WritebackContext *wb_context)
+{
+ ErrorContextCallback errcallback =
+ {
+ .callback = shared_buffer_write_error_callback,
+ .previous = error_context_stack,
+ };
+
+ error_context_stack = &errcallback;
+ pgBufferUsage.shared_blks_written += batch->n;
```

Should we only increase shared_blks_written only after the loop of write-back is done?

On master, FlushBuffer() does it after smgrwrite() (before writeback).
I think pgBufferUsage is mainly used in EXPLAIN (also
pg_stat_statements) which won't be used until the end of the query and
won't be displayed if we error out.

14 - 0004
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c

+uint32
+max_write_batch_size_for_strategy(BufferAccessStrategy strategy)
```

I think this function can be moved to bufmgr.c and make it static.

This technically could be moved, but it is a function giving you
information about a strategy which seemed to fit better in freelist.c.

18 - 0007
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+ max_batch_size = checkpointer_max_batch_size();
```

Look like we don’t need to calculate max_batch_size in the for loop.

I don't think it's in the for loop.

- Melanie

Attachments:

v6-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Split-FlushBuffer-into-two-parts.patchDownload

From c782753a430c1c967125509c6390d4e710fd2a63 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:32:24 -0400
Subject: [PATCH v6 2/9] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This separation procides symmetry with
future code for batch flushing which necessarily separates these steps,
as it must prepare multiple buffers before flushing them together.

These steps are moved into a new FlushBuffer() helper function,
CleanVictimBuffer() which will contain both the batch flushing and
single flush code in future commits.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 141 +++++++++++++++++++---------
 1 file changed, 98 insertions(+), 43 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f3668051574..f40f57e5582 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -529,8 +529,13 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static void CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2414,12 +2419,8 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			/* Content lock is released inside CleanVictimBuffer */
+			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4246,53 +4247,66 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
 	uint32		buf_state;
+	XLogRecPtr	lsn;
 
-	/*
-	 * Try to start an I/O operation.  If StartBufferIO returns false, then
-	 * someone else flushed the buffer before we could, so we need not do
-	 * anything.
-	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (PrepareFlushBuffer(buf, &buf_state, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
 
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = buf;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc and buf_state may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+				  bool from_ring, IOContext io_context)
+{
 
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
 
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber);
+	Assert(*buf_state & BM_DIRTY);
 
-	buf_state = LockBufHdr(buf);
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
+		return;
 
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
+
+/*
+ * Prepare the buffer with budesc for writing. buf_state and lsn are output
+ * parameters. Returns true if the buffer acutally needs writing and false
+ * otherwise. All three parameters may be modified.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, uint32 *buf_state, XLogRecPtr *lsn)
+{
 	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
+	 * Try to start an I/O operation.  If StartBufferIO returns false, then
+	 * someone else flushed the buffer before we could, so we need not do
+	 * anything.
 	 */
-	recptr = BufferGetLSN(buf);
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
 
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
+	*lsn = InvalidXLogRecPtr;
+	*buf_state = LockBufHdr(bufdesc);
 
 	/*
-	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
-	 * rule that log updates must hit disk before any of the data-file changes
-	 * they describe do.
+	 * Record the buffer's LSN. We will force XLOG flush up to buffer's LSN.
+	 * This implements the basic WAL rule that log updates must hit disk
+	 * before any of the data-file changes they describe do.
 	 *
 	 * However, this rule does not apply to unlogged relations, which will be
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
@@ -4305,9 +4319,50 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * happen, attempting to flush WAL through that location would fail, with
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
+	 *
+	 * We must hold the buffer header lock when examining the page LSN since
+	 * don't have buffer exclusively locked in all cases.
 	 */
-	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+	if (*buf_state & BM_PERMANENT)
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	*buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, *buf_state);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = shared_buffer_write_error_callback;
+	errcallback.arg = buf;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
+										buf->tag.blockNum,
+										reln->smgr_rlocator.locator.spcOid,
+										reln->smgr_rlocator.locator.dbOid,
+										reln->smgr_rlocator.locator.relNumber);
+
+	/* Force XLOG flush up to buffer's LSN */
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v6-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From dbcc430c4b92c2a69f84fe9ab3faa94f61eb3d99 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:43:24 -0400
Subject: [PATCH v6 3/9] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse them. By
eagerly flushing the buffers in a larger run, we encourage larger writes
at the kernel level and less interleaving of WAL flushes and data file
writes. The effect is mainly noticeable with multiple parallel COPY
FROMs. In this case, client backends achieve higher write throughput and
end up spending less time waiting on acquiring the lock to flush WAL.
Larger flush operations also mean less time waiting for flush operations
at the kernel level.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which do not require a WAL flush.

This patch also is a step toward AIO writes.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Earlier version Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 189 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  48 +++++++
 src/include/storage/buf_internals.h   |   4 +
 3 files changed, 235 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f40f57e5582..c64268f31ae 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -534,7 +534,16 @@ static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object
 						  IOContext io_context, XLogRecPtr buffer_lsn);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
-static void CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+static BufferDesc *NextStratBufToFlush(BufferAccessStrategy strategy,
+									   Buffer sweep_end,
+									   XLogRecPtr *lsn,
+									   int *sweep_cursor);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator,
+												   bool skip_pinned,
+												   XLogRecPtr *max_lsn);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc, uint32 *buf_state,
 							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
@@ -2420,7 +2429,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 			}
 
 			/* Content lock is released inside CleanVictimBuffer */
-			CleanVictimBuffer(buf_hdr, &buf_state, from_ring, io_context);
+			CleanVictimBuffer(strategy, buf_hdr, &buf_state, from_ring, io_context);
 		}
 
 		if (buf_state & BM_VALID)
@@ -4254,6 +4263,40 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or NULL when there are no further buffers to consider
+ * writing out.
+ */
+static BufferDesc *
+NextStratBufToFlush(BufferAccessStrategy strategy,
+					Buffer sweep_end,
+					XLogRecPtr *lsn, int *sweep_cursor)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum =
+			StrategySweepNextBuffer(strategy, sweep_cursor)) != sweep_end)
+	{
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare and write out a dirty victim buffer.
  *
@@ -4264,12 +4307,14 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
  * bufdesc and buf_state may be modified.
  */
 static void
-CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc, uint32 *buf_state,
 				  bool from_ring, IOContext io_context)
 {
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 
 	Assert(*buf_state & BM_DIRTY);
 
@@ -4277,11 +4322,143 @@ CleanVictimBuffer(BufferDesc *bufdesc, uint32 *buf_state,
 	if (!PrepareFlushBuffer(bufdesc, buf_state, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && StrategySupportsEagerFlush(strategy))
+	{
+		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
+		int			cursor = StrategySweepStart(strategy);
+
+		/* Clean victim buffer and find more to flush opportunistically */
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = NextStratBufToFlush(strategy, sweep_end,
+												&max_lsn, &cursor)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, return the block -- the pointer to the block data in memory
+ * -- which we will opportunistically flush or NULL if this buffer does not
+ *  contain a block that should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/* Must do this before taking the buffer header spinlock */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	buf_state = LockBufHdr(bufdesc);
+
+	if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+		goto except_unlock_header;
+
+	/* We don't eagerly flush buffers used by others */
+	if (skip_pinned &&
+		(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+		 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+		goto except_unlock_header;
+
+	/* Get page LSN while holding header lock */
+	lsn = BufferGetLSN(bufdesc);
+
+	PinBuffer_Locked(bufdesc);
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* If we'll have to flush WAL to flush the block, we're done */
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unpin_buffer;
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	buf_state = LockBufHdr(bufdesc);
+	lsn = BufferGetLSN(bufdesc);
+	UnlockBufHdr(bufdesc, buf_state);
+
+	if (buf_state & BM_PERMANENT && XLogNeedsFlush(lsn))
+		goto except_unlock_content;
+
+	/* Try to start an I/O operation */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+	buf_state = LockBufHdr(bufdesc);
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
+
+except_unlock_header:
+	UnlockBufHdr(bufdesc, buf_state);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 12bb7e2312e..8716109221b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -156,6 +156,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lead to better I/O
+ * patterns than lazily flushing buffers immediately before reusing them.
+ */
+bool
+StrategySupportsEagerFlush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -270,6 +295,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+	if (++(*sweep_cursor) >= strategy->nbuffers)
+		*sweep_cursor = 0;
+
+	return strategy->buffers[*sweep_cursor];
+}
+
+/*
+ * Return the starting buffer of a sweep of the strategy ring
+ */
+int
+StrategySweepStart(BufferAccessStrategy strategy)
+{
+	return strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b1b81f31419..03faf80e441 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -437,6 +437,10 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy,
+									  int *sweep_cursor);
+extern int	StrategySweepStart(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v6-0005-Fix-XLogNeedsFlush-for-checkpointer.patchtext/x-patch; charset=US-ASCII; name=v6-0005-Fix-XLogNeedsFlush-for-checkpointer.patchDownload

From 9a3f592e5dc7933decad22747d0a4335429d2170 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 10:01:17 -0400
Subject: [PATCH v6 5/9] Fix XLogNeedsFlush() for checkpointer

In normal operation, XLogNeedsFlush() returns true if the flush ptr has
not been advanced past the provided LSN. During normal recovery on a
standby (not crash recovery), it return true if the minimum recovery
point hasn't been advanced past the provided LSN.

However, during an end-of-recovery checkpoint, the checkpointer flushes
WAL, so XLogNeedsFlush() should compare the provided location with the
flush pointer.

Correct the logic in XLogNeedsFlush() to compare the LSN to the flush
pointer when WAL inserts are allowed and the minimum recovery point
otherwise.

This is not an active bug because no current users of XLogNeedsFlush()
temporarily allowed WAL inserts during recovery.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/flat/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g%40mail.gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0baf0ac6160..62923d33b79 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3102,21 +3102,26 @@ XLogBackgroundFlush(void)
 }
 
 /*
- * Test whether XLOG data has been flushed up to (at least) the given position.
+ * Test whether XLOG data has been flushed up to (at least) the given position
+ * or whether the minimum recovery point is updated past the given position.
  *
- * Returns true if a flush is still needed.  (It may be that someone else
- * is already in process of flushing that far, however.)
+ * Returns true if a flush is still needed or if the minimum recovery point
+ * must be updated.
+ *
+ * It is possible that someone else is already in the process of flushing that
+ * far or updating the minimum recovery point that far.
  */
 bool
 XLogNeedsFlush(XLogRecPtr record)
 {
 	/*
-	 * During recovery, we don't flush WAL but update minRecoveryPoint
-	 * instead. So "needs flush" is taken to mean whether minRecoveryPoint
-	 * would need to be updated.
+	 * During recovery, when WAL inserts are forbidden, "needs flush" is taken
+	 * to mean whether minRecoveryPoint would need to be updated.
 	 */
-	if (RecoveryInProgress())
+	if (!XLogInsertAllowed())
 	{
+		Assert(RecoveryInProgress());
+
 		/*
 		 * An invalid minRecoveryPoint means that we need to recover all the
 		 * WAL, i.e., we're doing crash recovery.  We never modify the control
-- 
2.43.0

v6-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From 6c46b33c7a51990f1d2df0fab7dfea2f88e0861e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 11:00:44 -0400
Subject: [PATCH v6 1/9] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with
standard for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 200 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  32 ++++-
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 124 insertions(+), 113 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..f3668051574 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2344,130 +2340,122 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned with its header
-	 * spinlock still held!
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
-
-	/* Pin the buffer and then release the buffer spinlock */
-	PinBuffer_Locked(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
+		/*
+		 * Attempt to claim a victim buffer.  The buffer is returned with its
+		 * header spinlock still held!
+		 */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+		/* Pin the buffer and then release the buffer spinlock */
+		PinBuffer_Locked(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr, buf_state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
+
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
 
-	if (buf_state & BM_VALID)
-	{
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer. Then loop around and try again.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7d59a92bd1a..12bb7e2312e 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -716,12 +717,21 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held. We must hold the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
- * if this buffer should be written and re-used.
+ * if this buffer should be written and reused.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -731,11 +741,19 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
-	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
-	 * loop if all ring members are dirty.
-	 */
-	strategy->buffers[strategy->current] = InvalidBuffer;
+	buf_state = LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf, buf_state);
+
+	if (XLogNeedsFlush(lsn))
+	{
+		/*
+		 * Remove the dirty buffer from the ring; necessary to prevent an
+		 * infinite loop if all ring members are dirty.
+		 */
+		strategy->buffers[strategy->current] = InvalidBuffer;
+		return true;
+	}
 
-	return true;
+	return false;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index dfd614f7ca4..b1b81f31419 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -419,6 +419,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v6-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v6-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From 78767ad0f17d2b74ad2aafa10ecd828d7e53bf0e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 12:56:38 -0400
Subject: [PATCH v6 4/9] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.

XXX: Because this sets in-place checksums for batches, it is not
committable until additional infrastructure goes in place.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 218 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  26 +++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 ++++
 src/include/storage/bufpage.h         |   2 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 290 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c64268f31ae..4cc73bf4363 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -542,6 +542,10 @@ static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber re
 												   RelFileLocator *rlocator,
 												   bool skip_pinned,
 												   XLogRecPtr *max_lsn);
+static void FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+							   BufferDesc *batch_start,
+							   uint32 max_batch_size, BufWriteBatch *batch,
+							   int *sweep_cursor);
 static void CleanVictimBuffer(BufferAccessStrategy strategy,
 							  BufferDesc *bufdesc, uint32 *buf_state,
 							  bool from_ring, IOContext io_context);
@@ -4263,10 +4267,91 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+
+/*
+ * Given a buffer descriptor, start, from a strategy ring, strategy, that
+ * supports eager flushing, find additional buffers from the ring that can be
+ * combined into a single write batch with this buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to write this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+				   BufferDesc *batch_start,
+				   uint32 max_batch_size,
+				   BufWriteBatch *batch,
+				   int *sweep_cursor)
+{
+	BlockNumber limit;
+	uint32		buf_state;
+
+	Assert(batch_start);
+	batch->bufdescs[0] = batch_start;
+
+	buf_state = LockBufHdr(batch_start);
+	batch->max_lsn = BufferGetLSN(batch_start);
+	UnlockBufHdr(batch_start, buf_state);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	Assert(BlockNumberIsValid(batch->start));
+	batch->n = 1;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Min(max_batch_size, limit);
+	limit = Min(GetAdditionalPinLimit(), limit);
+
+	/*
+	 * It's possible we're not allowed any more pins or there aren't more
+	 * blocks in the target relation. In this case, just return. Our batch
+	 * will have only one buffer.
+	 */
+	if (limit <= 0)
+		return;
+
+	/* Now assemble a run of blocks to write out. */
+	for (; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum =
+			 StrategySweepNextBuffer(strategy, sweep_cursor)) == sweep_end)
+			break;
+
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or NULL when there are no further buffers to consider
- * writing out.
+ * writing out. This will be the start of a new batch of buffers to write out.
  */
 static BufferDesc *
 NextStratBufToFlush(BufferAccessStrategy strategy,
@@ -4314,7 +4399,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 
 	Assert(*buf_state & BM_DIRTY);
 
@@ -4326,19 +4410,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 	{
 		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
 		int			cursor = StrategySweepStart(strategy);
+		uint32		max_batch_size = StrategyMaxWriteBatchSize(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
 
 		/* Clean victim buffer and find more to flush opportunistically */
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, sweep_end, bufdesc, max_batch_size,
+							   &batch, &cursor);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 		} while ((bufdesc = NextStratBufToFlush(strategy, sweep_end,
 												&max_lsn, &cursor)) != NULL);
 	}
@@ -4461,6 +4548,73 @@ except_unlock_header:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (!XLogRecPtrIsInvalid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		BufferDesc *bufdesc = batch->bufdescs[i];
+		uint32		buf_state = LockBufHdr(bufdesc);
+		XLogRecPtr	lsn = BufferGetLSN(bufdesc);
+
+		UnlockBufHdr(bufdesc, buf_state);
+		Assert(!(buf_state & BM_PERMANENT) || !XLogNeedsFlush(lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with budesc for writing. buf_state and lsn are output
  * parameters. Returns true if the buffer acutally needs writing and false
@@ -4606,6 +4760,48 @@ DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	error_context_stack = errcallback.previous;
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 8716109221b..317f41cdfa2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -712,6 +712,32 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+	uint32		max_possible_buffer_limit;
+	uint32		max_write_batch_size;
+	int			strategy_pin_limit;
+
+	max_write_batch_size = io_combine_limit;
+
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_possible_buffer_limit = GetPinLimit();
+
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+	max_write_batch_size = Max(1, max_write_batch_size);
+	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dbb49ed9197..12503934502 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 03faf80e441..7c60e5e6f54 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -416,6 +416,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -429,6 +457,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 /* solely to make it easier to write tests */
 extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
@@ -438,9 +467,12 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern uint32 StrategyMaxWriteBatchSize(BufferAccessStrategy strategy);
 extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy,
 									  int *sweep_cursor);
 extern int	StrategySweepStart(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+								 WritebackContext *wb_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..bb4e6af461a 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,5 +507,7 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos,
+										uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a13e8162890..9492adeee58 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -349,6 +349,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v6-0006-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v6-0006-Add-database-Oid-to-CkptSortItem.patchDownload

From 9fa36f0a1f489809d7d50798e07d491f4d806ecc Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v6 6/9] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4cc73bf4363..2ea91e777e2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3377,6 +3377,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->dbId = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6732,6 +6733,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->dbId < b->dbId)
+		return -1;
+	else if (a->dbId > b->dbId)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7c60e5e6f54..03c395903e5 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -382,6 +382,7 @@ UnlockBufHdr(BufferDesc *desc, uint32 buf_state)
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			dbId;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

v6-0007-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v6-0007-Implement-checkpointer-data-write-combining.patchDownload

From 898db59ca9a02f8eb5481caa66dca6cd02f30082 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:42:29 -0400
Subject: [PATCH v6 7/9] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 236 +++++++++++++++++++++++++---
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 211 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2ea91e777e2..2475b1c85be 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -512,6 +512,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
 static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
@@ -3319,7 +3320,6 @@ UnpinBufferNoOwner(BufferDesc *buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3331,6 +3331,8 @@ BufferSync(int flags)
 	int			i;
 	int			mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3361,6 +3363,7 @@ BufferSync(int flags)
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3501,48 +3504,212 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = CheckpointerMaxBatchSize();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
 
-		bufHdr = GetBufferDescriptor(buf_id);
+		while (batch.n < limit)
+		{
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
 
-		num_processed++;
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
 
-		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
-		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
-		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
+			{
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.dbId;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Max(max_batch_size, limit);
+				limit = Min(GetAdditionalPinLimit(), limit);
+
+				/*
+				 * If we aren't allowed any more pins or there are no more
+				 * blocks in the relation, break out of the loop and issue the
+				 * IO.
+				 */
+				if (limit <= 0)
+					break;
+			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * becasue we want to start the next IO with this item.
+			 */
+			if (item.dbId != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			/*
+			 * It the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a single bit. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false. If
+			 * the buffer doesn't need checkpointing, don't include it in the
+			 * batch we are building. We're done with the item, so count it as
+			 * processed and break out of the loop to issue the IO we have
+			 * built so far.
+			 */
+			if (!(pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+
+			buf_state = LockBufHdr(bufHdr);
+
+			/*
+			 * If the buffer doesn't need eviction, we're done with the item,
+			 * so count it as processed and break out of the loop to issue the
+			 * IO so far.
+			 */
+			if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
+			{
+				processed++;
+				UnlockBufHdr(bufHdr, buf_state);
+				break;
+			}
+
+			PinBuffer_Locked(bufHdr);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
 			}
+
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			buf_state &= ~BM_JUST_DIRTIED;
+			UnlockBufHdr(bufHdr, buf_state);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
 		 * - otherwise writing become unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, IOCONTEXT_NORMAL, &wb_context);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -4268,6 +4435,23 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+CheckpointerMaxBatchSize(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Max(pin_limit, 1);
+	result = Min(pin_limit, io_combine_limit);
+	result = Max(result, 1);
+	Assert(result < MAX_IO_COMBINE_LIMIT);
+	return result;
+}
+
 
 /*
  * Given a buffer descriptor, start, from a strategy ring, strategy, that
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

#10

Chao Li

li.evan.chao@gmail.com

4 months ago

In reply to: Melanie Plageman (#9)

Re: Checkpointer write combining

On Sep 12, 2025, at 07:11, Melanie Plageman <melanieplageman@gmail.com> wrote:

On Wed, Sep 10, 2025 at 4:24 AM Chao Li <li.evan.chao@gmail.com> wrote:

Thanks for the review!

For any of your feedback that I simply implemented, I omitted an
inline comment about it. Those changes are included in the attached
v6. My inline replies below are only for feedback requiring more
discussion.
On Sep 10, 2025, at 01:55, Melanie Plageman <melanieplageman@gmail.com> wrote:
2 - 0001
```
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
+ if (XLogNeedsFlush(lsn))
+ {
+ /*
+ * Remove the dirty buffer from the ring; necessary to prevent an
+ * infinite loop if all ring members are dirty.
+ */
+ strategy->buffers[strategy->current] = InvalidBuffer;
+ return true;
+ }
- return true;
+ return false;
}
```
We can do:
```
If (!XLogNeedsFlush(lan))
Return false

/* Remove the dirty buffer ….
*/
Return true;
}
```
This would make the order of evaluation the same as master but I
actually prefer it this way because then we only take the buffer
header spinlock if there is a chance we will reject the buffer (e.g.
we don't need to examine it for strategies except BAS_BULKREAD)

I don’t understand why the two versions are different:

if (XLogNeedsFlush(lsn))
{
/*
* Remove the dirty buffer from the ring; necessary to prevent an
* infinite loop if all ring members are dirty.
*/
strategy->buffers[strategy->current] = InvalidBuffer;
return true;
}

return false;

if (XLogNeedsFlush(lsn))
return false;

/*
* Remove the dirty buffer from the ring; necessary to prevent an
* infinite loop if all ring members are dirty.
*/
strategy->buffers[strategy->current] = InvalidBuffer;
return true;

10 - 0004
```
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+ limit = Min(max_batch_size, limit);
```

Do we need to check max_batch_size should be less than (MAX_IO_COMBINE_LIMIT-1)? Because BufWriteBatch.bufdescs is defined with length of MAX_IO_COMBINE_LIMIT, and the first place has been used to store “start”.
I assert that in StrategyMaxWriteBatchSize(). io_combine_limit is not
allowed to exceed MAX_IO_COMBINE_LIMIT, so it shouldn't happen anyway,
since we are capping ourselves at io_combine_limit. Or is that your
point?

Please ignore comment 10. I think I cross-line it in my original email. I added the comment, then lately I found you have checked MAX_IO_COMBINE_LIMIT in the other function, so tried to delete it by cross-lining the comment.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#11

Melanie Plageman

melanieplageman@gmail.com

3 months ago

In reply to: Chao Li (#10)

7 attachment(s)

Re: Checkpointer write combining

On Thu, Sep 11, 2025 at 11:33 PM Chao Li <li.evan.chao@gmail.com> wrote:

I don’t understand why the two versions are different:

if (XLogNeedsFlush(lsn))
{
/*
* Remove the dirty buffer from the ring; necessary to prevent an
* infinite loop if all ring members are dirty.
*/
strategy->buffers[strategy->current] = InvalidBuffer;
return true;
}

return false;

VS

if (XLogNeedsFlush(lsn))
return false;

I think you mean
if (!XLogNeedsFlush(lsn))
{
return false;
}
// remove buffer
return true

is the same as

if (XLogNeedsFlush(lsn))
{
//remove dirty buffer
return true
}
return false;

Which is true. I've changed it to be like that.

Attached version 7 is rebased and has some bug fixes.

I also added a bonus batch on the end (0007) that refactors
SyncOneBuffer() to use the CAS loop pattern for pinning the buffer
that Andres introduced in 5e89985928795f243. bgwriter is now the only
user of SyncOneBuffer() and it rejects writing out buffers that are
used, so it seemed like a decent use case for this.

- Melanie

Attachments:

v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From ade120635e4d20b36829200f9e6806063ff4eb7a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:53:48 -0400
Subject: [PATCH v7 1/7] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with
standard for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 189 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  20 ++-
 src/include/storage/buf_internals.h   |   6 +
 3 files changed, 112 insertions(+), 103 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index edf17ce3ea1..453fa16de84 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2331,125 +2327,116 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned pinned and owned by
-	 * this backend.
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		/* Attempt to claim a victim buffer. Buffer is returned pinned. */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr, buf_state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
+
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
 
-	if (buf_state & BM_VALID)
-	{
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
+
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7fe34d3ef4c..b76be264eb5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -779,12 +780,21 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held. We must hold the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -794,11 +804,17 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
+	buf_state = LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf, buf_state);
+
+	if (XLogNeedsFlush(lsn))
+		return false;
+
 	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
+	 * Remove the dirty buffer from the ring; necessary to prevent an infinite
 	 * loop if all ring members are dirty.
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
-
 	return true;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c1206a46aba..7e258383048 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v7-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Split-FlushBuffer-into-two-parts.patchDownload

From 6dfe3036229f25b9708109c460ce9c1425111650 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:54:19 -0400
Subject: [PATCH v7 2/7] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This separation procides symmetry with
future code for batch flushing which necessarily separates these steps,
as it must prepare multiple buffers before flushing them together.

These steps are moved into a new FlushBuffer() helper function,
CleanVictimBuffer() which will contain both the batch flushing and
single flush code in future commits.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 143 +++++++++++++++++++---------
 1 file changed, 100 insertions(+), 43 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 453fa16de84..769138a5373 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -533,6 +533,12 @@ static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						  IOObject io_object, IOContext io_context,
+						  XLogRecPtr buffer_lsn);
+static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
+							  IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2394,12 +2400,8 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			/* Content lock is released inside CleanVictimBuffer */
+			CleanVictimBuffer(buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4276,53 +4278,67 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
-	uint32		buf_state;
+	XLogRecPtr	lsn;
 
-	/*
-	 * Try to start an I/O operation.  If StartBufferIO returns false, then
-	 * someone else flushed the buffer before we could, so we need not do
-	 * anything.
-	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (PrepareFlushBuffer(buf, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
 
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = buf;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
+{
 
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
 
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber);
+	Assert(pg_atomic_read_u32(&bufdesc->state) & BM_DIRTY);
 
-	buf_state = LockBufHdr(buf);
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+		return;
+
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
+
+/*
+ * Prepare the buffer with bufdesc for writing. Returns true if the buffer
+ * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
+ * the table is logged.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state;
 
 	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
+	 * Try to start an I/O operation.  If StartBufferIO returns false, then
+	 * someone else flushed the buffer before we could, so we need not do
+	 * anything.
 	 */
-	recptr = BufferGetLSN(buf);
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
 
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
+	*lsn = InvalidXLogRecPtr;
+	buf_state = LockBufHdr(bufdesc);
 
 	/*
-	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
-	 * rule that log updates must hit disk before any of the data-file changes
-	 * they describe do.
+	 * Record the buffer's LSN. We will force XLOG flush up to buffer's LSN.
+	 * This implements the basic WAL rule that log updates must hit disk
+	 * before any of the data-file changes they describe do.
 	 *
 	 * However, this rule does not apply to unlogged relations, which will be
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
@@ -4335,9 +4351,50 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * happen, attempting to flush WAL through that location would fail, with
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
+	 *
+	 * We must hold the buffer header lock when examining the page LSN since
+	 * don't have buffer exclusively locked in all cases.
 	 */
 	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = shared_buffer_write_error_callback;
+	errcallback.arg = buf;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
+										buf->tag.blockNum,
+										reln->smgr_rlocator.locator.spcOid,
+										reln->smgr_rlocator.locator.dbOid,
+										reln->smgr_rlocator.locator.relNumber);
+
+	/* Force XLOG flush up to buffer's LSN */
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From 75209f0288a5d539168aad7b177e6629f4790569 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:15:43 -0400
Subject: [PATCH v7 3/7] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse them. By
eagerly flushing the buffers in a larger run, we encourage larger writes
at the kernel level and less interleaving of WAL flushes and data file
writes. The effect is mainly noticeable with multiple parallel COPY
FROMs. In this case, client backends achieve higher write throughput and
end up spending less time waiting on acquiring the lock to flush WAL.
Larger flush operations also mean less time waiting for flush operations
at the kernel level.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which do not require a WAL flush.

This patch also is a step toward AIO writes.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Earlier version Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 238 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  48 ++++++
 src/include/storage/buf_internals.h   |   4 +
 3 files changed, 282 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 769138a5373..7a553b8cdd2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -531,14 +531,25 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
+
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
-static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *NextStratBufToFlush(BufferAccessStrategy strategy,
+									   Buffer sweep_end,
+									   XLogRecPtr *lsn, int *sweep_cursor);
+
+static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator, bool skip_pinned,
+												   XLogRecPtr *max_lsn);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc,
+							   XLogRecPtr *lsn);
 static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						  IOObject io_object, IOContext io_context,
 						  XLogRecPtr buffer_lsn);
-static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
-							  IOContext io_context);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc,
+							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2401,7 +2412,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 			}
 
 			/* Content lock is released inside CleanVictimBuffer */
-			CleanVictimBuffer(buf_hdr, from_ring, io_context);
+			CleanVictimBuffer(strategy, buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4284,6 +4295,61 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state = LockBufHdr(bufdesc);
+
+	*lsn = BufferGetLSN(bufdesc);
+
+	UnlockBufHdr(bufdesc, buf_state);
+
+	/*
+	 * See buffer flushing code for more details on why we condition this on
+	 * the relation being logged.
+	 */
+	return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}
+
+
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or NULL when there are no further buffers to consider
+ * writing out.
+ */
+static BufferDesc *
+NextStratBufToFlush(BufferAccessStrategy strategy,
+					Buffer sweep_end,
+					XLogRecPtr *lsn, int *sweep_cursor)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum =
+			StrategySweepNextBuffer(strategy, sweep_cursor)) != sweep_end)
+	{
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare and write out a dirty victim buffer.
  *
@@ -4294,12 +4360,14 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
  * bufdesc may be modified.
  */
 static void
-CleanVictimBuffer(BufferDesc *bufdesc,
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc,
 				  bool from_ring, IOContext io_context)
 {
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 
 	Assert(pg_atomic_read_u32(&bufdesc->state) & BM_DIRTY);
 
@@ -4307,11 +4375,165 @@ CleanVictimBuffer(BufferDesc *bufdesc,
 	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && StrategySupportsEagerFlush(strategy))
+	{
+		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
+		int			cursor = StrategySweepStart(strategy);
+
+		/* Clean victim buffer and find more to flush opportunistically */
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = NextStratBufToFlush(strategy, sweep_end,
+												&max_lsn, &cursor)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, return the buffer descriptor of the buffer to eagerly flush,
+ * pinned and locked, or NULL if this buffer does not contain a block that
+ * should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+static BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		old_buf_state;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/*
+	 * Ensure that theres a free refcount entry and resource owner slot for
+	 * the pin before pinning the buffer. While this may leake a refcount and
+	 * slot if we return without a buffer, we should use that slot the next
+	 * time we try and reserve a spot.
+	 */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	/*
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header. We have to lock
+	 * the buffer header later if we succeed in pinning the buffer here, but
+	 * avoiding locking the buffer header if the buffer is in use is worth it.
+	 */
+	old_buf_state = pg_atomic_read_u32(&bufdesc->state);
+
+	for (;;)
+	{
+		buf_state = old_buf_state;
+
+		if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+			return NULL;
+
+		/* We don't eagerly flush buffers used by others */
+		if (skip_pinned &&
+			(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+			 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+			return NULL;
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufdesc);
+			continue;
+		}
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufdesc->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufdesc));
+			break;
+		}
+	}
+
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
+	/* Don't eagerly flush buffers requiring WAL flush */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
+	/* Try to start an I/O operation */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+
+	buf_state = LockBufHdr(bufdesc);
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index b76be264eb5..4baa0550bb1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -156,6 +156,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lead to better I/O
+ * patterns than lazily flushing buffers immediately before reusing them.
+ */
+bool
+StrategySupportsEagerFlush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -307,6 +332,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+	if (++(*sweep_cursor) >= strategy->nbuffers)
+		*sweep_cursor = 0;
+
+	return strategy->buffers[*sweep_cursor];
+}
+
+/*
+ * Return the starting buffer of a sweep of the strategy ring
+ */
+int
+StrategySweepStart(BufferAccessStrategy strategy)
+{
+	return strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 7e258383048..b48dece3e63 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -442,6 +442,10 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy,
+									  int *sweep_cursor);
+extern int	StrategySweepStart(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v7-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v7-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From f3b1be877ddbd5911f3813a8a0b3cd3877e0d5b9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:42:47 -0400
Subject: [PATCH v7 4/7] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.

XXX: Because this sets in-place checksums for batches, it is not
committable until additional infrastructure goes in place.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 217 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  26 +++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 ++++
 src/include/storage/bufpage.h         |   2 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 288 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7a553b8cdd2..3c49c8c2ef2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -537,7 +537,11 @@ static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 static BufferDesc *NextStratBufToFlush(BufferAccessStrategy strategy,
 									   Buffer sweep_end,
 									   XLogRecPtr *lsn, int *sweep_cursor);
-
+static void FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+							   BufferDesc *batch_start,
+							   uint32 max_batch_size,
+							   BufWriteBatch *batch,
+							   int *sweep_cursor);
 static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
 static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
 												   RelFileLocator *rlocator, bool skip_pinned,
@@ -4316,10 +4320,91 @@ BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
 }
 
 
+
+/*
+ * Given a buffer descriptor, start, from a strategy ring, strategy, that
+ * supports eager flushing, find additional buffers from the ring that can be
+ * combined into a single write batch with this buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to write this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+				   BufferDesc *batch_start,
+				   uint32 max_batch_size,
+				   BufWriteBatch *batch,
+				   int *sweep_cursor)
+{
+	BlockNumber limit;
+	uint32		buf_state;
+
+	Assert(batch_start);
+	batch->bufdescs[0] = batch_start;
+
+	buf_state = LockBufHdr(batch_start);
+	batch->max_lsn = BufferGetLSN(batch_start);
+	UnlockBufHdr(batch_start, buf_state);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	Assert(BlockNumberIsValid(batch->start));
+	batch->n = 1;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Min(max_batch_size, limit);
+	limit = Min(GetAdditionalPinLimit(), limit);
+
+	/*
+	 * It's possible we're not allowed any more pins or there aren't more
+	 * blocks in the target relation. In this case, just return. Our batch
+	 * will have only one buffer.
+	 */
+	if (limit <= 0)
+		return;
+
+	/* Now assemble a run of blocks to write out. */
+	for (; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum =
+			 StrategySweepNextBuffer(strategy, sweep_cursor)) == sweep_end)
+			break;
+
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or NULL when there are no further buffers to consider
- * writing out.
+ * writing out. This will be the start of a new batch of buffers to write out.
  */
 static BufferDesc *
 NextStratBufToFlush(BufferAccessStrategy strategy,
@@ -4367,7 +4452,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 
 	Assert(pg_atomic_read_u32(&bufdesc->state) & BM_DIRTY);
 
@@ -4379,19 +4463,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 	{
 		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
 		int			cursor = StrategySweepStart(strategy);
+		uint32		max_batch_size = StrategyMaxWriteBatchSize(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
 
 		/* Clean victim buffer and find more to flush opportunistically */
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, sweep_end, bufdesc, max_batch_size,
+							   &batch, &cursor);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 		} while ((bufdesc = NextStratBufToFlush(strategy, sweep_end,
 												&max_lsn, &cursor)) != NULL);
 	}
@@ -4536,6 +4623,70 @@ except_unpin_buffer:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (!XLogRecPtrIsInvalid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		XLogRecPtr	lsn;
+
+		Assert(!BufferNeedsWALFlush(batch->bufdescs[i], &lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with bufdesc for writing. Returns true if the buffer
  * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
@@ -4696,6 +4847,48 @@ FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 	LWLockRelease(BufferDescriptorGetContentLock(buf));
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4baa0550bb1..f73a52c7e56 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -775,6 +775,32 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+	uint32		max_possible_buffer_limit;
+	uint32		max_write_batch_size;
+	int			strategy_pin_limit;
+
+	max_write_batch_size = io_combine_limit;
+
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_possible_buffer_limit = GetPinLimit();
+
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+	max_write_batch_size = Max(1, max_write_batch_size);
+	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index dbb49ed9197..12503934502 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b48dece3e63..337f1427bbc 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -418,6 +418,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -432,6 +460,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 extern void TrackNewBufferPin(Buffer buf);
 
@@ -443,9 +472,12 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern uint32 StrategyMaxWriteBatchSize(BufferAccessStrategy strategy);
 extern Buffer StrategySweepNextBuffer(BufferAccessStrategy strategy,
 									  int *sweep_cursor);
 extern int	StrategySweepStart(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+								 WritebackContext *wb_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index aeb67c498c5..bb4e6af461a 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -507,5 +507,7 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									Item newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos,
+										uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..c80e1ff4107 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -350,6 +350,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v7-0005-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v7-0005-Add-database-Oid-to-CkptSortItem.patchDownload

From 7a48e402ebabc76bcead5ffd8db1f62720cc1704 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v7 5/7] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3c49c8c2ef2..0ee7feba7ba 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3413,6 +3413,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->dbId = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6812,6 +6813,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->dbId < b->dbId)
+		return -1;
+	else if (a->dbId > b->dbId)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 337f1427bbc..a0051780a13 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -384,6 +384,7 @@ extern uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			dbId;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

v7-0006-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v7-0006-Implement-checkpointer-data-write-combining.patchDownload

From 69f182fd94768efc2dac8fca3c823e8c2a8cd483 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 15:23:16 -0400
Subject: [PATCH v7 6/7] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 222 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 196 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0ee7feba7ba..7a0284973e0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -513,6 +513,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
@@ -3355,7 +3356,6 @@ TrackNewBufferPin(Buffer buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3367,6 +3367,8 @@ BufferSync(int flags)
 	int			i;
 	uint32		mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3397,6 +3399,7 @@ BufferSync(int flags)
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3537,48 +3540,196 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = CheckpointerMaxBatchSize();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
 
-		bufHdr = GetBufferDescriptor(buf_id);
+		while (batch.n < limit)
+		{
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
 
-		num_processed++;
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
 
-		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
-		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
-		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
+			{
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.dbId;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Min(max_batch_size, limit);
+				limit = Min(GetAdditionalPinLimit(), limit);
+				/* Guarantee progress */
+				limit = Max(limit, 1);
+			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * because we want to start the next IO with this item.
+			 */
+			if (item.dbId != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			Assert(item.tsId == batch.rlocator.spcOid);
+
+			/*
+			 * If the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a few bits. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false.
+			 *
+			 * If the buffer doesn't need checkpointing, don't include it in
+			 * the batch we are building. And if the buffer doesn't need
+			 * flushing, we're done with the item, so count it as processed
+			 * and break out of the loop to issue the IO so far.
+			 */
+			buf_state = pg_atomic_read_u32(&bufHdr->state);
+			if ((buf_state & (BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY)) !=
+				(BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+			PinBuffer(bufHdr, NULL, false);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				UnpinBuffer(bufHdr);
+				break;
 			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			buf_state &= ~BM_JUST_DIRTIED;
+			UnlockBufHdr(bufHdr, buf_state);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
+		 * - otherwise writing becomes unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, IOCONTEXT_NORMAL, &wb_context);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -6414,6 +6565,23 @@ IsBufferCleanupOK(Buffer buffer)
 	return false;
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+CheckpointerMaxBatchSize(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Max(pin_limit, 1);
+	result = Min(pin_limit, io_combine_limit);
+	result = Max(result, 1);
+	Assert(result < MAX_IO_COMBINE_LIMIT);
+	return result;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patchtext/x-patch; charset=US-ASCII; name=v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patchDownload

From ab750885bc1816b0e6f01173b62f3296b2ea21a1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 16:16:58 -0400
Subject: [PATCH v7 7/7] WIP: Refactor SyncOneBuffer for bgwriter only

Only bgwriter uses SyncOneBuffer now so we can remove the
skip_recently_used parameter and make it the default.

5e89985928795f243 introduced the pattern of using a CAS loop instead of
locking the buffer header and then calling PinBuffer_Locked(). Do that
in SyncOneBuffer() so we can avoid taking the buffer header spinlock in
the common case that the buffer is recently used.
---
 src/backend/storage/buffer/bufmgr.c | 96 +++++++++++++++++------------
 1 file changed, 56 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7a0284973e0..cf515a4d07a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,8 +515,7 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
+static int	SyncOneBuffer(int buf_id, WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -4003,8 +4002,7 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state = SyncOneBuffer(next_to_clean, wb_context);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -4067,8 +4065,8 @@ BgBufferSync(WritebackContext *wb_context)
 /*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
- * If skip_recently_used is true, we don't write currently-pinned buffers, nor
- * buffers marked recently used, as these are not replacement candidates.
+ * We don't write currently-pinned buffers, nor buffers marked recently used,
+ * as these are not replacement candidates.
  *
  * Returns a bitmask containing the following flag bits:
  *	BUF_WRITTEN: we wrote the buffer.
@@ -4079,53 +4077,71 @@ BgBufferSync(WritebackContext *wb_context)
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+SyncOneBuffer(int buf_id, WritebackContext *wb_context)
 {
 	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
+	uint32		old_buf_state;
 	uint32		buf_state;
 	BufferTag	tag;
 
-	/* Make sure we can handle the pin */
-	ReservePrivateRefCountEntry();
-	ResourceOwnerEnlarge(CurrentResourceOwner);
-
 	/*
-	 * Check whether buffer needs writing.
-	 *
-	 * We can make this check without taking the buffer content lock so long
-	 * as we mark pages dirty in access methods *before* logging changes with
-	 * XLogInsert(): if someone marks the buffer dirty just after our check we
-	 * don't worry because our checkpoint.redo points before log record for
-	 * upcoming changes and so we are not required to write such dirty buffer.
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header.
 	 */
-	buf_state = LockBufHdr(bufHdr);
-
-	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
-		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
+	old_buf_state = pg_atomic_read_u32(&bufHdr->state);
+	for (;;)
 	{
+		buf_state = old_buf_state;
+
+		/*
+		 * We can make these check without taking the buffer content lock so
+		 * long as we mark pages dirty in access methods *before* logging
+		 * changes with XLogInsert(): if someone marks the buffer dirty just
+		 * after our check we don't worry because our checkpoint.redo points
+		 * before log record for upcoming changes and so we are not required
+		 * to write such dirty buffer.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 0 ||
+			BUF_STATE_GET_USAGECOUNT(buf_state) != 0)
+		{
+			/* Don't write recently-used buffers */
+			return result;
+		}
+
 		result |= BUF_REUSABLE;
-	}
-	else if (skip_recently_used)
-	{
-		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
-		return result;
-	}
 
-	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
-	{
-		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
-		return result;
+		if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
+		{
+			/* It's clean, so nothing to do */
+			return result;
+		}
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufHdr);
+			continue;
+		}
+
+		/* Make sure we can handle the pin */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufHdr->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufHdr));
+			break;
+		}
 	}
 
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Share lock and write it out (FlushBuffer will do nothing if the buffer
+	 * is clean by the time we've locked it.)
 	 */
-	PinBuffer_Locked(bufHdr);
-
 	FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 
 	tag = bufHdr->tag;
@@ -4133,8 +4149,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	UnpinBuffer(bufHdr);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * SyncOneBuffer() is only called by bgwriter, so IOContext will always be
+	 * IOCONTEXT_NORMAL.
 	 */
 	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
 
-- 
2.43.0

#12

Chao Li

li.evan.chao@gmail.com

3 months ago

In reply to: Melanie Plageman (#11)

Re: Checkpointer write combining

Hi Milanie,

Thanks for updating the patch set. I review 1-6 and got a few more small comments. I didn’t review 0007 as it’s marked as WIP.

- Melanie
<v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>

1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+
+/* Note: these two macros only work on shared buffers, not local ones! */
```

Nit: here you added two empty lines, I think we need only 1.

2 - 0002
```
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
+{

-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
```

Nit: the empty line after “{“ should be removed.

3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+	if (++(*sweep_cursor) >= strategy->nbuffers)
+		*sweep_cursor = 0;
+
+	return strategy->buffers[*sweep_cursor];
+}
```

Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.

Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."

4 - 0003
```
static BufferDesc *
NextStratBufToFlush(BufferAccessStrategy strategy,
Buffer sweep_end,
XLogRecPtr *lsn, int *sweep_cursor)
``

“Strat” is confusing. I think it’s the short version of “Strategy”. As this is a static function, and other function names all have the whole word of “strategy”, why don’t also use the whole word in this function name as well?

5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+	uint32		max_possible_buffer_limit;
+	uint32		max_write_batch_size;
+	int			strategy_pin_limit;
+
+	max_write_batch_size = io_combine_limit;
+
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_possible_buffer_limit = GetPinLimit();
+
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+	max_write_batch_size = Max(1, max_write_batch_size);
+	max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+	Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+	return max_write_batch_size;
+}
```

This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);

/* Clamp to io_combine_limit and enforce minimum of 1 */
if (max_write_batch_size > io_combine_limit)
max_write_batch_size = io_combine_limit;
if (max_write_batch_size == 0)
max_write_batch_size = 1;

Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
return max_write_batch_size;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#13

BharatDB

bharatdbpg@gmail.com

3 months ago

In reply to: Chao Li (#12)

4 attachment(s)

Re: Checkpointer write combining

On Thu, Oct 16, 2025 at 9:55 AM Chao Li <li.evan.chao@gmail.com> wrote:

Hi Milanie,

Thanks for updating the patch set. I review 1-6 and got a few more small
comments. I didn’t review 0007 as it’s marked as WIP.

- Melanie

<v7-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v7-0002-Split-FlushBuffer-into-two-parts.patch><v7-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v7-0004-Write-combining-for-BAS_BULKWRITE.patch><v7-0005-Add-database-Oid-to-CkptSortItem.patch><v7-0006-Implement-checkpointer-data-write-combining.patch><v7-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>
1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,12 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner,
Buffer buffer)
/*
* Internal buffer management routines
*/
+
+
+/* Note: these two macros only work on shared buffers, not local ones! */
```
Nit: here you added two empty lines, I think we need only 1.
2 - 0002
```
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+                                 bool from_ring, IOContext io_context)
+{
-       /* Find smgr relation for buffer */
-       if (reln == NULL)
-               reln = smgropen(BufTagGetRelFileLocator(&buf->tag),
INVALID_PROC_NUMBER);
+       XLogRecPtr      max_lsn = InvalidXLogRecPtr;
+       LWLock     *content_lock;
```
Nit: the empty line after “{“ should be removed.
3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current
sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+       if (++(*sweep_cursor) >= strategy->nbuffers)
+               *sweep_cursor = 0;
+
+       return strategy->buffers[*sweep_cursor];
+}
```
Feels the function comment is a bit confusing, because the function code
doesn’t really perform sweep, the function is just a getter. InvalidBuffer
just implies the current sweep is over.

Maybe rephrase to something like: “Return the next buffer in the range. If
InvalidBuffer is returned, that implies the current sweep is done."

4 - 0003
```
static BufferDesc *
NextStratBufToFlush(BufferAccessStrategy strategy,
Buffer sweep_end,
XLogRecPtr *lsn, int *sweep_cursor)
``

“Strat” is confusing. I think it’s the short version of “Strategy”. As
this is a static function, and other function names all have the whole word
of “strategy”, why don’t also use the whole word in this function name as
well?
5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+       uint32          max_possible_buffer_limit;
+       uint32          max_write_batch_size;
+       int                     strategy_pin_limit;
+
+       max_write_batch_size = io_combine_limit;
+
+       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+       max_possible_buffer_limit = GetPinLimit();
+
+       max_write_batch_size = Min(strategy_pin_limit,
max_write_batch_size);
+       max_write_batch_size = Min(max_possible_buffer_limit,
max_write_batch_size);
+       max_write_batch_size = Max(1, max_write_batch_size);
+       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+       return max_write_batch_size;
+}
```
This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_write_batch_size = Min(GetPinLimit(),
(uint32)strategy_pin_limit);

/* Clamp to io_combine_limit and enforce minimum of 1 */
if (max_write_batch_size > io_combine_limit)
max_write_batch_size = io_combine_limit;
if (max_write_batch_size == 0)
max_write_batch_size = 1;

Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
return max_write_batch_size;
}
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Hello All,

As per reference to the previous mails, I understood the changes made and
had tried to replicate the patches into the source code for the bug fix but
it didn't show any significant bug. Also I ran some verification tests for
the recent changes related to batched write statistics during checkpoints.
Below are my observations and results:

Test Setup

-

PostgreSQL version: 19devel (custom build)
-

OS: Ubuntu Linux
-

Port: 55432
-

Database: postgres
-

Test tool: pgbench
-

Duration: 120 seconds
-

Command used: pgbench -c 4 -j 4 -T 120 -p 55432 -d postgres

Log Output

After running the workload, I triggered a manual checkpoint and checked the
latest log entry:
2025-10-28 16:53:05.696 IST [11422] LOG: checkpoint complete:
wrote 1383 buffers (8.4%), wrote 3 SLRU buffers;
write=0.023 s, sync=0.017 s, total=0.071 s;
sync files=8, longest=0.004 s, average=0.003 s;
distance=33437 kB, estimate=308790 kB;

Observations:

Metric

Value

Source

Interpretation

Buffers written

1383

From log

Consistent with moderate workload

Checkpoint write time

0.023 s

From log

Realistic for ~11 MB write

Checkpoint sync time

0.017 s

From log

Reasonable

Total checkpoint time

0.071 s

From log

≈ write + sync + small overhead

CHECKPOINT runtime (psql)

Fast, confirms idle background activity

The total time closely matches the sum of write and sync times, with only a
small overhead (expected for control file updates).
Checkpoint stats in pg_stat_checkpointer also updated correctly, with no
missing or duplicated values.
Expected behavior observed:

write + sync ≈ total (no zero, truncation, or aggregation bug)
-

Buffer counts and timing scale realistically with workload
-

No evidence of under- or over-counted times
-

Checkpoint stats properly recorded in log and pg_stat_checkpointer

Math check:
0.023 s + 0.017 s = 0.040 s, total = 0.071 s => difference ≈ 0.03 s
overhead => normal for control file + metadata writes.
Comparison to Prior Reports:

Test

Pre-Patch

Post-Patch

Difference

Checkpoint duration

6.5 s

5.0 s

−23

Heavy workload test

16 s

8 s

−50

Result: It shows consistent and stable timing even under moderate pgbench
load — confirming the patch is integrated and functioning correctly.

Final Status:

Attachments:

Screenshot from 2025-10-28 16-54-20.pngimage/png; name="Screenshot from 2025-10-28 16-54-20.png"Download

�PNG


IHDR���sBIT|d�tEXtSoftwaregnome-screenshot��>1tEXtCreation TimeTuesday 28 October 2025 04:54:20 PM��� IDATx���t�u���g$-�@�� ��	�M=���w�N;�ef��KW��N�lY��E��N9�-�;]�|{���an��#�nW���.;q�3�:�^i�H�ksvH��g�D �I|�H��M>��"��p�����z'i�y��$,Y�d��W��������E��1�n��_��C��K)�W�:����Z��;y}
!�����~������>�\��
��T@��ES�|��R��s�F}��g�k���Q
1�������B��gV!�B!��6������B!�B1��1+�B!������������$YR�:���=��{�����B!���i7f+J��b�>�j�3m��nA���Bc��_�7sH)�&�o��;r�_�����g����i��_/�����}6��:��r'�
��F��4v�w������q�eY������|����Qv��y��D�����{h��d�%iRH��u7���}��w��8BA��%�~rb���������OH��\�>9�����k�����w����W�*���2��2I������n�3���O��*��g����D-��3.�������-��w�A�v,����L��CR�Nf���|������n�z������5��8	f���f��f�9O���\��������l����q���^��4������0��g����U��/*?�|L�cc
��=����|��������ws���H����x�����7��le0{���O��"��p�����s#�*BL�m��T�H�+O�Ls�,����h;�?t�g����vfs�T[=�__J�?�I����a�X��kq�z����X���(tdc6�I#��u���z���k+���U62t!|�V�v�4n�\�*;�������kww��m���*���\+&c�P�n�Qj�hS��cT������T�=F��
�����$'}�ox��$\8;{�]��S�?���M��W�~+��w����sp�=h}�T����f��t��U~�HXa"i����w'g��s�an��I�V�I��${�A;���.1������O�O?��}�������'���y���D�I�F��5��0�����)$-��`E��1�6~3�3��sn���?.?��������?2�������[�K�tq���U���G��$,23�V��������c��,��	g���B^Y5�v+�4�:�������_�����.��M�A����W�K5W|������l�GF��l�+�R�� �����������C���L���������{z�fV�c�-�<]7|�j����^[�N�GI���<V�8{��P�A*����>D(�Lk;����Bk��?@Xc!g][w@���p���O����[D����e��{df���J��0t�(���kn!��A4��Jbvw3p�=B����YF��?'Y�M��>���D���m�^~��M��WH��3t�<�~���O�#�� u��$��1��a��E�OX�W�|��~o����)��[�����.���������h�<������;I4�HL��O����+B��	,_E���3��}�2���	��M����7?���$�*%�0@��/&��4&�\:V�U�/n�5VR?B2^B{�$=�I�b��o�|���w�b��|�������~�/�	�����w�~=0+������x����e�S��w�=����7��L`�"e���~X6���?�`H����/��5�����c����c��}�Ke��i���d#-o+�L+�d�����9�rx��7�eQZ'j�9����e�y���C�^�+y�8����w���]kX���?��)5�����yo����2������9�&'������������f�Sa�����>�8�.~���4)w=@����4�K7�Q�������~9wn���$�t306�(�\���|RR/p����I����HZ��-��8����xm�z����8�������������,+�
�l~�O�����>��e�{����a���X�}����w2f~����q����0��� �8�����4���ok�����}b�>��,�V��f4����]��������+Ws��20����Vp���j�ot����I6|k+�wY��B�����_��.w��s�^J�.�J��=
�\�Vc�-�����#t���6���8k���s�sd���fg�
vq�I�Vh��������l�(�a�C�MK�����o+�Fq�
�^K����nu�]?y;Qj�;�6#��B�+U�nk�hL9�����>�Q�����p���0��.��p�L��~:;%%�kb���>���;�~�����{.�2���;�`
��@���=�Kf�fv�M������5�+*�����xc��"�+���A{�["����V�/:�dm�����P���+��r�&�����\�X��.�����X��R��o�tVV���~~���I��7gS5�r��!|�Nkw���'nO�^������f7
�����<4��������CC
�x��~�;�"�?�����{��p������]��x��g�������1��F��I�H���e���?u�.I��#�[��&��n��f4�8��)��Gd�yt>��9��v��������K�I�1|���~G��~33�b��{'�������$��	��#��|Dw�Z��Br�ch���;M�Z���p&I� ����s����h��G�"���ZI��w���y��]R�y�_lD{w2	�����W�2��%�O!����@���y���?u�T�3��f����W'E�Y]��@���v���G�?<���S�����;�c�70;���\��iw2�_������|�$������grH�u�w���;���>�����m����h��D��J�+�D���4+�{W����7�����b��r���\��&q�C���������2IY�G��nf��IR����6�x���7q�oxt���W�B7�l~���3�y����������n�����uw1��Ae.'���o�	���!���n��������8�>�.i�	�O3�|3��?d�p~b�?�}��w|�.\�O�����rr��Gz�s�����m��
�
�M�o��R���Q�X��Y����Ez3}]N�����0�y��9�[�q�]K������	���,.{��YY��oY��f��e �An���)b���� ���l�I<��?>���>�W���M��,&#Aw����7��K��}�����j���'���&p�o&o�s$��,

�)�r���������UX\�yf�j;t8������\��m�it�VR����M����Sd;CZv6gk�x��#�[����I�9�Woe%-�nY���*����������R����-��t��>~�T\?���bL��h�L��R���G��G*8��n���5���A(����`5���5����'d�b����W���Qu[1�:���_e��l������4�fV��Hx����%:��^��
�����V������#�����es.��la}�N��6J��L����tG���R��/�����4�)[����|��B>����<���<��XC���������^(�r�����'�:�gK�qswg��8�����7������f���%�}��i7�*���o3���!��;	���~��������&�c$��>��^J�7L��$S�s��F���L���}���>�_�B���$e$��0t�BG!�$g,����6�I����a.�xR��_��>�BtI$�3���$�X���T����I��I���q�����A�J~g�_����x����c��!2H\a�����O�?u�>����_G�O�W�G�|q��Oq~�
+�K�����w�^���<�����Ez�}�H�XJ�����L!X�UG'|��h��'������U����[��v���s���pl�c�b+g�~�My�=f�]QC��r��a���������?r�����S�����Z���K�Hu8�����+8���O��V�_u�����s>GbA37��B��s�2���,��&>��L��6����"
���'9<E���R���m+��3[�;N�����>}w������}���Ms���H5��{���y.���7w��K�>�'-`������L�b-�
��?�Qx��k�����K�<�����f0�&�r�u��3z������9� ��z�4�+?3�_uNs�_�s�;D��5n��r���9������?r���9��S�O{>N�?���)c�OZ��u���4Y�����r�-����g}�:��<��q����y���pwx�S��B����W3/
�t(������#7��
�]����fW�P���}�e-9O����R����r;�\�&k�
��?E}{��Pi�IS�!����E��Cn����-�(�yXM����5d�JMU#�a?���^*-x�����6�R`��?S���!o��������k�z�oZt6�
�Z�p��z����0����������`Lz=0K���z���w��h��^�)�5���c;<�L���j�Oc/���\�����]�tN���<3��!���UW��=���`������� /7�s'������]
��W�#KCg[�!����+��XH^�����a�||���S������L�{$�)>�I�N�>�8x���~4�������]�����4_�8�vF�8�%t,�4���!��h\�`����e4�4�g�?I��a�3u� �'"�����2�0"�T���3tb����<�7�I���/��.�����0�}2�I��	�����z+Ik���8�T�E7�3�����&j{���N��E���2�D��?�	t�@����� ��RW�����|CT�_��]U�_������3���4?.D�,��o���_���oxx����I�kVLvu�*�����������?d��My?���_��y����W����G�z�;��;G�����>@�W�����g+c�]pb�S`0G����z:/u����f�us���������C
��]t�������Ig��_���g�m��������?�������
�8��Ia��4����d��T���*'�#�pSj��[�
��O���N����a`����du��qOh�N����|c�>u~fV��x9=<Yl��e��h�U�>���]]3Y�����z���>�����g/���bs���.��1�����t�$;O�����n����7~�EV�`����g����U+�Z�~�A�������L.D��M�D	���FFm�.��7��G���;�WQ���A��F&}��opx������0��2��~o�Z��MV(M�
�����G]�7@0�x5l�P���2�:x��J��j�����uZ����A|3r��7M��rr�
<1aq��@�@����|1�?*�J[#��^�wP���[�s��PW��*~p�>���B�K�TL�S?u�b�2?#�J���r��G<1��3�������(t�dZ����mu�vl���V\nW�N�?���������F����)��S4f?����)+�@
G��
�3�C�S/�c���
���[�l�?pp�.~���w��<��oH^���_D��N���#������I�����0�)���
R���D���_taE)�X$��O���?��
7�$���3��$��7���!�t���+ tLm���)��:1����V�?c��yX�����D���T��$j�Ix���Q����������I�������:F��a��t=��zm���_G��� }E&��Q>�y���������z
^�`���\�;���C���C�&������I.E\�h8�[�c��ZN�����=u�������o���o�S�j�o�Iz�����\�C�	��8��v���^a&����7c��2��I^K�:^���n&�t��G{|�F�����(5��Z���i���������E�<@j�r���$��wpb�&U�'���8��,8�=�1��`(��������t�)�n���J���o�Be��f�KE;���c	��y�S3��x�oj��P������(�5x���KV}��"?z���8��
�=�n��x�]��=����v�"}+;+�v
������6V�t#��a���kn�+�l��>�0�50q�q�5z����z�����`y��r����_h&��J1��p��ch�_�c�7����T<��-g5Y�����q���D�s�q��<xh��+w�gH�X�H/T�����O|�f�AB?����:=����x�G&^�B�7�����I�?�D`���\������F~�H��!)��������=�9�_;B���$=��f���*>�3���;�~�/qydq�y��� ����t3������Ko@�7o!�s��xb�wT�$��xq���1�� t����F�C.���I+n�<e�G[��w���o����p?C@��Z3�Oe�g"������^4���|!��P�����8����tIZ�
����j���._j��?f��u��6���m�Mr\���I[����w�oHy�����������GJ����^�@��A ��Oj��$ |�$�h/\�b�9{���q�=.���,��}�������|IX�4�v9�6��:�r��BOq�MF�4��	���3��E7��	od�8����}��~r�y���I�j&�G��8`���d��N~&�Q�����{���G����
0x�~��>���8���oA�,�����3���v���?E?�]����m��u�^��
�=I?�#�����d%������7"@g[=�����j��-i&�,��c1���_:�P>�tpG>WtF���G�4R����v|���d��[�v"������i���Q�=
����\?������sS��t�
j�<��L\������U���������U�
���[&��#���h}t���|1�?W@c�`$�;0��-z{�}���E5�s���4e~||�|
#�~i�&���3;e~��L�f5B�pO]��}��e��]m����i�����v�tNF�IG���|.��$������~��0/c	7O5g6�-y>�b���$fd1�*�2H����i#��rR�����d`��<�����4p�'q��<r�T�6�%�����lp��_��XS�&Ma��s�(��g�b�8�[H�x/��$���!LXq?Z}�>!�������+��[�����<I�h����1��$�h�������c�O\"��;�'m���������0��/��������H���{\�1<$x�*��^��|���S�oZ��D��U�bs��3�����Cg?���nu�}:����~R+oe�w�����?��T�?f���$mY�e�>2l3����[1����p����a��e(	FCd8��F��s���MQ��W|�y;�!��6.UG~l�y���drvbk���'+�9����?�<��s�<�d3��0|�<�3��fR�7��������c
���5�������Ko�����Y:���,���$�e�@���Q}�}��W�D�a��n��0_#e���� ��������D��U��������G��[�,]N2�*~�.n��u��S�K�d��Y	�����s\�c��_���$�-�4T��iP����|*��T���B�=���>����	~�>�o���\P���S�s�y}�_��"|.�,S������In2>���k�]�#�z�Z�7����`q`�d��'�J���L�~���w����^T�������Xm��C(��#X����M]{/��"���k��p��
l/(����#d����e��M�����k�C��A����)���
�1����bl��n��{�)G��������OP���
���,�F{���D����4�n�����=>���Nk�L�7Y�Z{	���09��R��������� ����thu�((.�,/@�G��8�gvV25��'ME��Z]�sb�X�.]�~��*���tv��l�Z]x{���SPHf��]��{/^�f$���s�������/�&����x��o-���r�P�����c/*�l�j����(f�����\i�v�-��������_���=�|�Q�Rx�� IDAT+b���<t������������VU�GB�J���/$}�I���h��=���&Cs��� �~	� �'��?zN���UM���IZ�	�����������Qz�V��u'	�~�>=Ex����V,&��{�2qY&����nbc#�	�}�'�o�D�1���#�O�i#�w!���D�L>I��;K���o!��f��������������$h��M!A���%�3�o<I+n%�~��E�y?�=����a>�]��_����f���D��U-m��y���@P]|���� ��$�I���=�=3�OM�c�?�@���'!m	i���><K���$�����C�/�lxp������;�b�M���'iD\z�����s)7e�M����\���9�������i|��>����w9��s�����u/���F���{���gf�t��7q����tS?�j$���������,Z]����z���eV��3���T3��2}�z���&�8���fL!^���=���9���[�Y������������F$v��|�������O<S��jx�4��"�����5��"�S��>������rt�|���>�����?gw���]���s$���,0Z�^z���XN�]�M\�|R'�������~�rR��$��<!��\��R�O�����U&#z-�����5T��E+D����e��`$�s�T�k����*���QT���`G�Z�.�L8����6#��y�pG�����`9E���cL�����A��n��6j*�������'@��(
�Y�y&���_v�"��d
��T���>��n0�2��6���g���p�b-�]�wV����M����
�s�RS3�������-iY�)q�O�����|���f��������<��4��^|�v^��������o�:k�h(+���������2�/����{�]�j���aM<��RU��zs{uX
7S]�'�H�k����D�s�gml��
���f��y>����a���]<~������<bb2	K�,Q��5����K����,��)q�_�����j��t3L"a�=$�v�?�����K���J�'��;X��A��k���>WR���jX�����2��/H~��,��J�w�>�p�����_���[�J��/�����T~�4v����T�Ew�����J����j+��=��V2�����H�E����j�r.V_�����GL��������h�sCCC`�+^(]��-?��������4���t
��?��������7��L?��t����&��[)�1���|�up�#�\�,������ � �����^�:��\�i~��~g��!'K��@�������4%2��Z6d�&s�M~:]|d�_hG�����u$=��~��V��W�B������[?���s�����j�%��O�_���4f���7���&$Xn%A��`���B|��EV��n��`�W�I����U�0�mq���=?sm���U�I�Y��j�[��:��4VO�J�n�w� ��L�#��w�3���	�SSy��5��p��z��t4�sw�U����_8��L��GGC
�Wx�a!�$������)^@b��:yi��+���0c!�B\w4�X:��n��A!���f���f,�B����Mr�E!����8�!�B!��%�Y!�B!�_8��B!�B��#�Y!�B!�_8���l�8��I�%�s�R���v����M����bL�^������f�UZ!�B!��[����`���8k+�����(+�NhK)�^HwT�9�jJq*�3Y�5^���B!�B|Y�n�j�X�o���)���(�;i�=�c���:/zC::���[�_�{��B!�B���1k�`��a���7�����^��N��
z������j����z!�B!���KX�d���[M������)^�3������l��l��Z��p�r�B!�B!nCC����]�������G���������i3�
���L��JJk���N�NM�B!�B!��%������[?������rk@A	j��r/�k�h���A��b����A����#|m��agc��+�B!�B�����������f��/�2XNQ�6�� �C�tp� ���R���b=i����S[���@!�B!��=gV!�B!��
���Y!�B!��:$�Y!�B!�_8��B!�B��#�Y!�B!�_8��B!�B��s��y;q�,�j_&��_/���������r�)���{h��{��,������j�n���I��B!�b�gV-���B�F�����v����"
����;�;�����>�:{����]��u�{�#<�HG�)��i���>�m9�7v�B!��Z���J��r�8;��7h5�s����I�+�^�C��v�z�o��%�K���k�ljn�%;?o��B!�B�����>Da��l�M:4A/�������g���fg�
vq�I�Vh����Fex���%8lz�i�{�����m�(��a�k	��������D�	��h�N���.���J�����kL9�����>�Q�w�D��.&e�(���z&��=��*\��6����49do���m���[Bt<_G�7�i�k"w��M�z^r:+���F��zD�rS��\����oy�z�eu��b�����UKi�����C3��;T!�B!f���Y-�l��-�}�	�iG�f��#[������P���*'iy�dh/;~�*,��<�q�:�[q�5�a[~����/egS/�����������Y��'����������Cy�VV�B���l,���K�Iu�=�,e??*�B�K������da������A0��
G7�C!BZ#���]{�v6X�+6RZ��6w|�b��[O�b����t�� ����O��B	���d��B!�b��f�j����X�y���
Gn:hrp�u�kiv)(�j]L�}�i�������V����G�4Y��7��_E}����j�I���}�Xk+�v�^�����5c^ ��j
��R[�H���_q�V��F���Z�4������������U�1�C���F�7�����p�>e/t`t=hr���������6D�OE~Z:������������V�_U����`���@!�B!�H�a�"���=w��f���6@�gl�p�������(���� �d��>
��78�y�U{�z 2v6��~o�Z����&�	�&�/�}}��{��WG�&r�ik����|�oz�nm��g�d,JT������������G���z/����h��6���Z��9�B!�B1�T4fA3���eZm�XB>~^�Td����X�i�]�����������A���7��!G7�C�����Gi�U)$/�L} �\k��
]��W��>
��i�N!�B!�U�j�l��2�Pc�b�S������aI����0j/?>jL�&�|�>JHO����U��B�������p,W~��`Od����U�Pp����3���M�G��2?�����Y+s�:hY�Ju~]�B���Hb�	�.������B!�sO��YK>;V�0���a�EKk��p�����0�s(.�M�)�:�^��\���V?�;9�T0l�,���`�l�a��mg�k�*G�������3Y��qX���x�;m�I���V�#:@c�`��1�G��e���a�f�a_]������"+��pv�����+� ���*�����WQR`%���k������|l�t&��ea�h���CB!�Bq���3{���2�l0��h���sx�gg]���(��Gq��#M�tZ&�YY���{��<�s���[_Je����m�1�A�������I����QS�%����=��Gih�V�r;[Q��64��9i.b�O��1�xM5/����?��z/!��

�7�A�W������Y[ECY^/�����|��vU��7���c����[\DS����:V����5hqQ�d�hd��<�3��d!�B!��v�,Y2k��=K+Y��k�1�2���m%-7��J{��U<]7�	�_,�^x!�����!�B!�b&��&6E���������$U7�8s�s���5h�m���Zo��,@�+�p�|��I��h�����B!����j5���FVn`�Q����Fv�n����C=�47�u�^���/a��B!�������B!�B��6����B!�B�kL�B!�B!�pn����^����aWw{Z�Ee*b��=��:!�B!�\���l���C���:+g����}.�[]rT!�B!����j�������:!�B!�����l��j��Z1�A�O��TKe������~��\�>Y�����l.�c1����i����w�F��M�RmqR�j�p�����"��5Ye�{vz��Q��9�w6o�!�CM4b'�fDVh}���mc7M]�m/��h�]iR�Zh)-�Q��s{_����i��cO;�Q��b��NU���r(..d�-|����z�v�p��*������������x�:w���ue�+��(z��^�[�8
��yh�YA���tc�
�Q�o�����i��E]{$�q�GgeuY�h���W)B!�B�/����tG5�s����:��\��F#��4F��+���)^r�?>M�A9RG�K��a+(���|�k��jKiM����.\=al9Y���{�p�n�>������O�M���PU���=X��P]�������Uv�
�.^��Ig�JaY	Z�
&o=�J>�\3�^�H����8�RW��������O��V<a=�<&=������i�4�������B;�+��^�,��l�]��p���{���Y[I�'��^DY�vB[JG�U?���l�tQ[�6
�����B!���w������
~�����z�;�a��K�5���*�7-��xtV�-������f\=����s:��)�v�Rg��V��	�����p5���RP�-�6�M��-
F{������X���z�+�i��i����v�����~7��S[�K��Kg��P�M�����C����b�d�&��{��n���y'M�U���b�O�p�8�
��Im���G!�Bqc��3�mu�vl���V\nW�N��h�yo(`�5�v���v�~!�����G��P�7��7D-Z-`�`�����nz|!�
=��%?�"s=�^X�k%�����x�7YM�4]���B��
� i��a�����-��iX6�����7���@�k6n�(Q��L�~�B!�7��sf��T<��-g5Y�����q���D�S��m/a�g?5���a�9�>P8q����Mi�U)$/�L} �\k��
]����_{�9�L���?��>%���T���'��xG���:B!�B�*���q�5S������`����Fz������p��9����1L�!5��g�����aI}Jg1a��u�[#C��V�b
t�6�-����[!d��������:@o�����}:���������d�W�p�c��MS�/^|B!�B�G����QL�#�!�t��U��
-�C�C>|A�������L���H����\�L�T���p�� ��Vs��iwZ*��(�U�X	t�1:[E��#��N���f�����8����q{�f��3�N�+|��j��8��}[qL�����N:L�)��a2�0�rX�i�YS�����pv����>�_�M��'�B!����f��a)�Lu��4z�y����?~�p'��[)_�����2v��N��jbkI5�;�>\�-�,��]jjS�
�f]�bR��:���FJ���z�YW���2���Q��HS+��V����H���u?o��OVQ��6j*�������'@��(
�#�~�����{�
>/��tegG]:�����/�2XNQ�6�� �C�tp��n�pgm
eex��~�::�Y���.�B!���$,Y�dh������8�C�����{��B!�B�����M���.������cC1U���a�sX�c���A�n������*
Y!�B!����_��F�5��p��z��t4�sw�\G%�B!�"f,�B!����a�B!�B!�t�1+�B!��G�*i���~hv�G���8�����?x������a���v�f��b:��vp������H�S���:|���_�<�>�7t��B!�����T�+���)^r_�������}.�[5���=�����?&��Nv�}���(;��)�N{�����l������/�Wr������B�\G����stO�9��
��f����������1lz�0�M\��n7�����E���46
-��j�w�l����"�������zOK%G�-���6Y&]!�b��j�jy���=�A�)]�����o��[�����Q�V��+�
�BA�~���bW��1��a�v^��L�&�V�`r����d��0�5�U�2��>OU�4zL��(���M�����r����1sF��������^���A[�5�L!����1k(`��uX��^<�wpQ�d�����M�RmqR�j�p����V4�n/fs��I��>O;��u8��/}y;Qj�;�6#��B�+U�n�������u�V�i��Sp7�R��&��S�����e�#��}��]/�N��i�����*���T=�sB��h|��p��.��
5����kx��|����v�e��:��:�l�Q}���:t�d�th�^���S��3���`��6Lz-!����]��O��`H������t��pX=T��b��Q���M7-u�S��N��C�x�f��1}5/���]�unu�3%��x��N�N7A���i�� D��q_^�P�W��
���l�gA����[��<RY?����]u�x����w����5���b����
�$u��=n�$�/����U�_��~G��o����@/�?ZKM�X�Jk'��U�i�S]^��ij�e�B!����FJiT5LOk*���@�v��	c����L�kP����R��u�
�(y���F���egc��`���E{�.YG{{
�aHwT�9�jJq*�3Y��/���9�Wo%����-�x�z�yLz��|����}d7{9�'N���esnM5[p����S��f|k��/����H���B�
\��=]y6��i�h�g��hk;{0������(+i�s���q�VR�	��QV����R��F���8��l����<�W���mA���u���E�r��5<S���`3����l������~bR��Vs~5�;��MG0
�
�-
����o�iU.����i|i��Y�����)��[������w����J���T>B�? Q�c;�&?��B	9��t4�D�;�����z
��Z�M���9��B!�b�Yf����f\�_0]m�/e��%���j�����O�:��X��N^�)t�����a5A���ttR��Dv�+D���w~��5d�JMU�p����~�&��������9_��V�+�����-�|3�z�EQ�*��N�������dw{�7}W����A��l���N����|��OQ?�������C8V����y����p8r�������R�#�.oy��ltE�_�����q�p6�h�^�f�N�)�����
z=���i�����B:^����:#F}�@`����ou����?ff:��#��?Y&=	2��Q���:F�'�Z���JSS����B!�g��!����6����:5P�Q���l���E;���������W���<�6��.��7YM�4]��{t&2t!wT�=
�-�#�)q�7S�v'��2��:���������*o��/���N��L`�`��a���7�?o�_Q��������iJ��j�o��;Z�cO;��+D�������l�	��4fU����O��
����MH[y����
�d�,�M2��"����1�ta6�� �S�)a��!�iQZR�2bQ�H��tn*���o2�4��0�9�f�,<��M\��	��Z�����vz}@0���v���mss�ci��~�OK�l+��oH|�<�}��~��1N;1g�?��p������Z��j9�(6��������.��s#W����J^ IDAT�	X�����_�O���5�dt�g+[?/�{�B!�8���Y���T���(wK+�=��U�`�xn=�r���4jW�~?���`���L���)n�����ljy2�:����������v������i��'Z�u/��o�~�{"��	����L:KAo=�8����a~�J�:��j�~8��P�B�
��.��������D�����i�3[����D��Z�\�R3�����������n�]��8�P3����C�B=���l�Lo�)_!�B��������Mt��sV�N�M��5m+�����m������^�1����^�~5�����0�����4��n6��*��2�������������]�����L}������j�.��j���Z����{�&�Su����
o'�t{����h����0(���������Vi�&X,�	��������*��P.�d��k����Ko��������C[M�zndzQ[�py/�)>���������v(�m5���h�`(�����L��`�_���Fg��h_��!�B���Z0k5��E�ZV��kN��������}���	u�q9p�|����:�Zk�����C������Cg_���
�C#����2���RT��a�����iR������2�=No����>���+:D�ln��et��I��U��G��������_��:���]3��i^N����X���6l��P�@d���H6w��|�X�z����|�����H�M��~T��
\������w]��54[���Ct�<�N��bu^a�,���������P����5��P���g�_��n����|��N<5����c� ~[���C��g!�EY��
!�B���a������I6������+�<�&x�S�h�,�����F�}��8��;��<�������b.�h����8��[�f�qw�K�7���pC���7(�{���Nq�g�-	�c���Z��j&I����Zh�L�D�2�{�L�������8��446����-��2U�1�`�m���^��]�SL2���FvU�8�>(_��/F�`hd�~;Z���z������[F�2;�m>���f�����%��"L�{���������if�5�F������2����;���
��:����W[�����������:/�|C{>_������������m���s��o�i!�B�o�������o�����7L+���4}�x(�N������z��h�@�
C�~^���d�/^�������B!����������N���_�pulm������!���t��/W��<�������9�&�dRY!�B�6k�>��6���d��O)��i��{��F{���.�������-�B���0c!�B!���3B!�B���`V!�B!�W����N��l��y���.����v}����l+G�����on�����U��B!�B�V��T���,�2�f�%W�
�|�rb����38D�����U
�6/���X`7�/��]K<!����y�t8�h�?����B!����f\����n~�'�v:A���~��042+��p�eB!�B�C-��v=N0������S�H2�>�B�(]D"a����b�Y�I��@��p�����Q��J.�)�Z:�X~����f��������D�
R�W���:�W3,lo��7�a���o��a/��QRz����7�|����g�>�_0W��T������^�����h$���;��V�T ��'S>n���$��r^���N�jk<�����rP�����G�i������X�v�B!�B|M�YG(�x6�F��`S|�;��%�q��B�I�R���;��*`wZ��/��t��
u`���(��9
-�x���%�M��y�v��L��{��ID�������s�
#�����`��E�����N��o�������������)��M]:������sO�T�"1���1�Qh�>��V�X�em����y"���u���'�d!�B!��Z5
f�.T���N����3��s3���
����L��d����J�P��
m�j3�,Dh�l���y��P����$��:�T����=��~���r*�����Q������!�i����m����4�a��mn^9!�B!��B5
f������}h�%m���vt^�)�o\�0���Mdd����GcT��m�jV���&&V�V���m���t�Y+�f���f�k�>��)t�L�bC��tv�V�x��u��lV�l]2r{k�[�B!�B��J���i��{�Z�Q2���M�N\rY\:<��~��G�Coo/�?$�����J�n���X�?�Q�B!����W3����.L1��_���-a������������Ye;�F�vMz=��Dw�	K��n5J���������nW�p���}>�{��tvkv_�$e�B���Z2�*����"!�B!��
4
f=��P'���GO�����q�Q��;v� �q�ry����5���N�sp����P���d5���GU�p����k���|.\�."j�V�m�~���G��Q��;�hhds��CQ�:\�J_4�Z��ie��]��,/?#t��U<>|^76��x��<(�6� �B!�W�t�lm��7<N"���>F)Orf����-������>��*��m��@������	�����Z:��
��gki��������5���&����&��	@��#FW��J\X�ael���2���o��
{����V�wY�4��Z��d�{+�[��~f��`8�����<�f��?�v��!&�w�����>��~����ew
!�B!�'��������.�g���f�������e!�B!�����
E���������8j���������.�����2`�r�
!�B!�W��0��������n'V��vR�.do�TB!�B!n��f,�B!����0c!�B!��	f�B!�B|u$��*e�?�Pn� 'Y1��Oh�������ds��'��3q!�B!��6	f�!��F>����3�z�����j�3a!�B!����j���r����.�B!�B|-���E�^l�����w)M=�����Dj3��+\����r?��O,�D�U��(�]�g)_9J���5	o�����`������������w9&��.qo��-v,�c,������O>��
�������z�f���An��Q*�J��)�g��76F�������{�L>�������R�k��O8*��
��F��t����{p����0{�;�J�C�u��{vl~��
K�L&������j��B!��]�3�E� �V��z��_���cm=��"�vv�S�>evc�@l�!�����a�K��@o�=��,A%E�h�R�/�u�����|k�[(���|�J0�8����&��o<��'<�e��}>N����@t���j��3Y�� 
�o�~]��,/?#��J��a�Fy�p�xr�R�P�za�����L��k��8�*������
���n����_!�B!��I0���@�X"��T�y�V�T[M���`������"zEGK��QR���v4�-���"5m���7�^���s���-��Z/���Z+C���V��S��I�h����<���I�Z(S�5��kh��NX����RI2E��!��q4{�J�������JE�\���)�����?��yy��g������-D����B!�Bq����:l��qt�^*a��Knj���8�x������S�'P>zmV�l��������������:2���4M0�P�^���~&z��~���V�������m��A?r[�u�y��mM�c��"7(�2C#����K��-V3�R������FC���*�J�d���/�B!��t����q{z�i�g�'^6�L4/X-��'S��	�(���l�\%���Q9���kB�����l�~�UN��F�����'0� Lr?��Y�
��rf����c�����B!�B���a�5�����>:d�zq7D:u��������K��n:���nTH-�MM���?����/p:���;���I�d�Q�0��\���Z�����r]q�lm+�����m��n���V�����%��eb����c�`�����M�W!�B!����Y�ln�?������h�mI�X�z����8�{ �QhX/�.��L2����<j}�'�t�
�.Q�^�7T����/_��j��E�Tp)*}�0�ao0�-2�j���~����u���"DB�x\.=A/��N��)�-��3����<tE�����b��&�`?��P�6=_!�B!�������K?�t.2<�o����,����J�)k�k��o1��d���?�S[^e�c(<���U����J��^��K�w@.svk��(.���%�pl�A;TKy^�$���T!9���o�@�����qOn��w���
ox�D���}�R��������M�:V'��	@��#��4�������#nLCccn�����[KK��E��� V4�?���}��!�B!�y���w��}�;l!o���3���co�V�q�i��e�
%�	S<X��S!�B�������h�������_8�#E��3�%p�C,h�v1B!�B����j��Z��MF���rI��u��B!�B�����!�B!�7�M3B!�B!I0+�B!���s��l��:o�:o��9D^�f}s��������j'co6��<�I�l^i	�x�>A@�oB!�B��f���!B�f�@o�}�
��G����^z{�����^Y74�9������"�����O����_�����>G���B!���/~5�/���]�P����n'����9��;5v/���B!�h!���^'bn�"@���V����aa�r|�6�����
�>^,�S����m���]w�S����XBa���qV��nF���t����j���o,�����L�WQ�VL�Hvi��|�����k�s9/��^'T�5�S-5B���An��Q*�J��)�����������|����A�����H$�]�U��6��/i����v����'�����gzg/k�K��k���a���}���H�'�w�}��o�7�L��g��r9�T��o� ���{��xvPW!�B!��DK���~?�L�����p�s7�Jk�2_�(�=!��	�>F�V�w{�j<}�����P�!��Y��"�vv�S�>evc�@l�!��l��a�K��@o�=��,�T<_�����{���Y6j*���tN_�N�������>��o���^m����X�w��|��G�3�}�����<���fr��s������UR�����c�4��I�a��YG(q�>�<|8J<�A���wM������5Fw�EPr�����B!�B���`�,fxY8�n�����Oi1�j��La��Y-�*��4��!�+m0���+e��V��|��.�t2����2�j��^����l�=��Y4�-���"m��B�T���$��2]#=��fj�O���'Gr&E�\���Z] u�c��Zl��<?�����JE�\���iG����q�]-#��B!��U-��5����}+Vk�Y���`�&�"�z,�����~|y�@������������'S��8��q.�"[W��jS����������V����aK���)>��O�pS��s��W�e(�FX�C+�(i[�f�3�mM�c���w�v������B!�B�o���:�^O�An�����e��@��+}��Z��������r���4jW�~?���`�������x;�9;P�/��'F����+�=M�8!�B!�?�����	X�5����7+v�a���������;��-��tU5�������a/�C��iR-���"��c��	X����j�&X,W�l���>u�Io���F�J��5������S=�W]���e���g�}[&��Fks��B!����3[*������p���� ��&��g�OEqyD�	Z�ds�x��Q&�VQ\
����D�2�U#����2���RT��a�Z�L[����;v� �q�ry���:�����x����VPjW��F������)O(B$���������b��g��.��z�O�}�4v�ok$!�B!�W��=���S���D��2b���l���o~����{Jw�x1�C#3�������(S�C�	�v����U�l�~f��`8�����<�f���K�m1�H4L��a�T�-n��?y���wc��=��ak���M����M�w���G����i�z��J���$��[i����o�'qbg��'9�r�fB!�B����w�}��mB!�B!����������N���_�p��0c!�B!��s�`V!�B!�WG�Y!�B!�_	f�B!�B|u$�B!�B����`�{z�7cg�7��8D^�f}s���������A������+C�����r�B!�B���z0�����0��z{�3W��>�����<�}�����_�����>G���B!��4	f[�v:A�(Tj�]�$[�����5o�$B!�B����kvA��:s��U7��N��[������np��b��b�'Vlc,������O>��
���������n�T����T'T�d�~c)�wTu`�H����bE�K�,�+G�}�_��f��y	�:���� �j��'�p[M�R�Tr�L�~X�M���|���3`�������X\����o�=��d���&�b��N�o+���}���^����'�\"S>�m>����;��>�{�_�>v�7t�I�����u9�T��o� ���{��x���)!�B!�1-����~\�8��#���a�N�[(���|�J0�8����&�!Zq����-����;6B�g�\g=�L���IN1�p���}�I�<'��*��,���������R�|�?Bn�	Gg���D���i;8���������}������m}�q%�ho/�?&�L�?������=�O�����OHj6z�i�����}�N��%����pt�5�G�y��S~�IF�:��G��d��tX��u���3���������z��:���]c�p�X%7��y	!�B!�9Z
f�b���������[=�Z] �ZfG+S��`V���e4���l�+m0���+e��V��|��.�t2����2�j��^����l�=��Y4�-���"m��B�T���$��2]#=��fj�O�Y���l�@Y/�]��{n��S��G��d)�G+�T�"��%���t�����d�:z1CrM�q�����j���N��S.dY�����.W�4#���e$�B!�B���0c�j��71�b�^|������4M0�$�^������0��^�d�_Q��Eq����������W�@�8�J���Ng�)t�L�bC��tv�V�x��uv�m���W���R;�}��_�c�t����Q�Ut���nk��q7qM�1��/������}h�%m���v�[S���j�]!��]�ff�z�B!���R0{�3�����z�i�g�'^6�4��r�W���E�k�>��Q��V�{�A���"�&�p?��>�O9M�~���N���P�P0��x�F��x;�9;P�/��'F����+�=M�8!�B!�?�����	X"S����V�J��ZK'^7y�M'wT+'q]5�������a��C��iR-��{u�J��-2J�������v*�������
�������e��� _�[9��i[i��4k]��m��=�S|����YF�|���G�e��m�6�Y!�B�m���<�b	��n�����j"�~��T��@d���H6W�z�we`��n���Q��{<A�-sZ5��]��(C�
.E�/F���\e�fL�f�����4�y��t���;����6CC
���E�PC�?���co���	u�q9p�|������@��@����Sa0�]t�I!�Bq���VROYS_]~��Qf'������GL���)�������od{���L�b�'X�(� IDATt��j��vxWiO�Y\��yK���"�v�����I�og,U/�Z��d�{+g����K��^`e~�g�oCujU-�E���/$gX����U�����=�:��o�'qbg��'9�r��!�B!�
��������!�B!������gC��c�����p8����x��B!�B��I0+�B!���#��B!�B����B!�B!�:�
!�B!���|1��%����vl���^��X����"/^������[bm���u���C!�Bq��3�.uC#���o�����<�C�l;�<\�P��&����v�?��x������'��p7>���m���vnR,�B!�8��R���p��q;��kWd��K}X-`hdr%����vi�B!��f�����;�S�!K��1/������Oe����m%6f��ju���^����'�\"S��E���4�6��/��4=$~��2�s���<���0���a�����������f�?W8������~���X�5�����u"�)��nlu����*��o���:����AB��E�K����;*�:0A�_EqZ1�"��y��������Ix3����x�P��xO��&�z�`8��m51JR�y2��a�6���c����+4i?����x���H�'������?Zm����a��)n$�J��"v�T*{g�������a��Y!�B!Z���gC��c�����p8G�����[(���|�J0�8����&:�p�N��%����pt�5�G�y�NP�XZ���3�X�
��xgn���`�^X�Ao/�3�c�����-%�����yp�����>w�i�V�w{�j<}�����P�!��Y��"�vv�S�>evc�@l�!��l��a�K��@o�=��,�T<_�����{���Y6j*���>.�S����2�����������o�TR�����c�4��I�����l��7{4kG(qX�Q>%���tN����Y^~F�q��B!����<������v*��"�bM3�:����,��he��^$���d��?���W�[�ch��-���K�$�O~?`3�,D_�l��������
�Rz����
���P��N�]WfX��+:Zz���B��T4����4��A��m���*���L��B�����_C�}���U����r@��jA�R�)��f�9qW!�B��ha��I��i�i���r�ug�t������=Ut�q
z�)k�uFz6������������Y5�_���X�Z/��j��������%��n/���w�?l��Lu�����T�l]ut�M��f�����[�r���-��{��Z�o�}������}h�%m���v&���{l]��B!�B����@s��Q��V�{�A���"�&|�ZG�;
���	�������y��V���O��i�g+��gJ�j�����[�f�SN��F�����'0� Lr?��vQ�B!���f��:��9�l���,����T��=�S_!�[e(���}CX�&`i���n�!��n����0f����
���QB7��Q]m*�)5�����>>�P�p�TK���]U�Rov���>�V���)F����}N���%f�B!��Z�,����5'w��7m>���4�F(}	"�K�i��9�'�����]G�X�z����8�{�d��Z[���3����<"�-E��
����(��u�(.��E��	"m�������E�Tp)*}�0j-O���Ze5�����	-��'!���r�p��	z�WuN/8|������
�(^|>��s�9B!���r�}f�V�7xM�6T�V5��Y4o8��'6����B�pdq>�$�8����Rc`q����[&��	@��#����xF%��5�5����ev2�|���R�M�<���{Jw�x1�>��t�72��U^e�c(<���U����J{������[��C����L�|;w��H���
O�����[�\��������}j�6��q'v�1Jy�3+���g������������_�y�|��B!�B��t�Y!�B!��S����
!�B!�_	f�B!�B|u$�B!�B���`V!�B!�WG�Y!�B!�_�����/2�|����%�����S!�B!��n������X`7�/��_R74�9��l��>�c����^:�V���g�7qU����x�N�T)i�YK���'�M��G��fbs,��<q^!�B!�5��a����.�YV���|8�t�]7Fv��L���;������yW(�D���J�GO�[��&>_��B!����g���ol���k���������Dj3��+p��b��b�'�{Q}�_��f��y	�:���� ���0�U�811JyR�%2��X\�{����&����1Nf,�c,?��	`n3��������	��>�V�T ��'S���� E�~�����{5�����O��A���j�(�Y(�.���c=uz5ox������g*C
�����a!{��KK9��x�K]�K}�Q���o�7�L�[�Y!�B!�@M��@t����Y4T��V��mgX���K�����Q�:�;-���Hi:F��:0F�y���J����������.`	�x;;a�����sO�T�"1���1�Q8�k��q��y0��oh�Dt�|~�DP�5��3�(��T�V;�5*�����LC�nI�d~����b��8B	���17JF��#��t�B!�B����a��.�6�T�LQG/fH�i���J��Bm� B��zr��K��y��NE/�]�P���{/K�U*���L��B�����_C�u�f1���A`X����|����Meh�G5�� `�9qZM����"����������t�P-�Z��Tt��,���B!�B|�.��u{q[�|��{k��a�H52+E�jg�[<�DF����y4��~z<�u�:l&z�8>���[�r�����f��p~�����T�[S��jC�.������P8u����Q���]q��r.C14��kZ�DI�b5����B!�Bq�/eB��+�4O�>�� �<�?wx2���s�����^zH�]��������'�=��!g������Z��i�n�QN�������f�Z��Z��4���L��(�nC	�&B��B!�B��\�%���u<$��VzP]K�����K9��9�l��h��.G������9��M�tvkv_�$Q�B���Z2.��.������{�5w���Zc�u}�����1���jPja�l#m+�����m������B!�Bq.f�[d
5����C�~�
���%�w��:�������PMG�9�����:�~�����.�:�lnw(�P��KQ���Qky2������^gy��k��������X/>��q�AW�5�;:���������x���e41��C��D�8K��W2��"DB�x\.=A/��N��!�B!�B��h��q!9���o�@�����q�YI=eM}Mt�-#F���6���r�X���Y4��P�Z�@Kg���3��Vr��'���J��<��7,5��MJ�1��S\��yK���"�v�����I��\��g������������_�y���N7V�����mV�}�T2qf]�<�$����(���K����j�6��q'v�1Jy�3+m��B!�B|~��������B!�B!�������P��������/������B!�B!�0�
!�B!���H0+�B!���#��B!�B����B!�B!�:��vO��f�����G�����on����X���K�U�b�]�  �v����ds��'�]��h;e�?�Pn� B!���>���gp��m����*W���"�<�n�'^?M�nU�����|�����d�����^P{3������D}����n>�����=��H�'�z�J���Lek`S���zq�M����$K�
�cox��<qw���<X(��7F8���vb�JI{�Zr����B!�h����t��]=��n�B|�l�c<�Sc�l8X�0�K	����Uf:l��9�;�`�����a�jC
���0=%}���1��� i����y���]������C_d��i+�?/}�*!�B�X�`�{z���A���[]'�j����������������~���X������N|�>�TK(��Vbc6��v�MgS��R�P-�]�����Q��	"�*���i�.��<��&��0���uBU[�A<�R#t=N0�������<�r��~���?^����g�>�_0Wh���`qy��z����h$���k^~G �x8�Wq���(�I%���k-=������S�H2�>�Bl�>�E�"	sW��F�����K�4���s���=8�mf~�=�;{y���~�|��_��X*�����x��~���,h���xK�����l�$B�u��{vl~��
K�L&�����g���e�.i��G������7;i�6}cc��Xk��?�4��A��w���D�^�s�P�h�G���4v�v�=�;I�����c'M�xp��Z����x-�+�m\����V���W,*�y�2����UB!��"��3k��q��y0��oh�Dt�|~��!�v��_��g�g�+�'��	�Y������+s<M��g<��Q�L<�L���$��+�p��Mb>e�|��U �����<�^���y�������{B�b'�}>��h�B
�S�d9Hrov��t���M�4�������K�4�nC#�<��p����e��J0���Q2:�����M�wtK<�Ni�����_w�	���^X�A��@�7��f���M����������J�l1L8��vL���Va'�o1+N�����P�����?�:
����o��{Y����o����o�~ :�����,*�X���(�kz�g���/�Z�5A��0_r��BW!�N~��������.:�=�����������G��Ut-�R2�vA�V;�5*�W��B!�mh)�5�^���9��n|
��Mn������L_�T��K3��m�(m0����V���$t��P�wQ\����Ad��g�������X�wi��H�*�P8��`Ff����?���5�.GuZ(l}�	���_���q�_v%C(����������t9�Z8���Vk���3��s3��//*TV�51����I�������|o"�^c(�C�R _[��^`!��_-}��z���IBA�tcDt�������}-�M?�������������-%����Z����9�7���f����l,���q���i����,M�f��6�T5V~�c��9Tv6x�S�\�aQ�3=	���3j��24�����r@.�BqZf����}+V������zL��p�V����}����^�d�_Q��Eq����������W���o�Y)��2�B��D/6�oOg�j���
[�o��E��x���p���0w������=�r.C14��kZ�DI�b5�����S@��47���7�_3�Z����Z>��;`#�U#��R���B}L�FC���UP��8�9�������}[��7{���o�����Z��~\���c�g����{\a��W��a�����ZT��nJ�3'�^��X�u�j�N�c���0~�ek��G��E�5'����X:�eI(#X[��B+_
!�B��/e�>��?��^g����M�LIZ���������r�����Z�~����p�������Q�����	%���gZN�Mq�n��M����������o3�<�B�XOG�F�We{��`�r��g����_��{i�6��7{���_����	��*\���Yo'>g����������|�����C�@[��L�l�n�a/����c���%�7��Q�J��2�V*a���i��A��x��3�l��!�B|	n��l����N���7��bW6m�t�u���`��M'wT�M�y���n���k�d���4�������:p����e������J:^�kDW�V���)F����mk-}��c*�����n���������P�K?-?_
����B\~^�lSS{���3rW��k��x�_Zx]&���������{������f�%���u\8�[������hZ�F��,������/��L(m��;�8W�{?V-s�B�Aal�.(��v�{�����F����&��1_:&.�B�5�q0[*�����������������3����<"�-E��
����(��u�(.��E��	"��E����vq��u*���h��7�mL�f����TGj:z���������������	u�q9p�|����:�Zk�����C������Cg_���
�kC�_Z�f>C�/�|\��j��7����0]v���$o���A�7_�����gWt��\+�*�?�K��������f����j������~����kz���g�����z�O�}�4v��S�Lv��Z�k/�Vu JHUp)>�"��ZAK'C���>���!�G��>d��w�^���Z�a*>|>>��j�B!��%7f\I=eM}Mt�-#F���6���71�n��tw�#n042s��9��(��2U�1�`�m���^��]�=�fq�g�-	�c���Z��j&�U��i�^ ��#6<���VNo�s�}+�<�&x�S�h�,�7�r��}��8��;��<�������b.�h����8��[�f����K�7����g��������H1`P�u�l'f���!�]��M����%7.;\k���,�j���Xs�P�\{�(K�5k=d����J��F��y�V���i}�	1O
x��I�&��u����U$� �b�>3�1qq���s/e�p�=�a��^�w���?H��V��g�_.7��Y�>�q~'��������Z��\?��BQ�� X�}���Omo3����*#�'�� (���1����s��\�s���d����s5�J059I�����
���p�B���G1��������i4��NG��\7��4�`.�����lD��
�D{���e���%v�f;
������b�>z��]!����v���_��A!���<H�����0����@]��}��!�B�Mv���Tt�k7����W�X����Xq{�)��{��
,1B!��6U�2c!�����1�f�{y&���B!��3�2c!�B!�7�,3B!�B��#��B!�B�;�-Of�;.0��������_���cc�hY�sn,o��p��1<����<8��X���gZ_L�q�s��)j�W�A����`��w0�|��c��_�%�Bq��
������g��s����7�)�|:�
-{�n��}����<���}��]gk�n}���[������;�R���
�"�N<���iB�uVbP��s��uoz���o���QFcs���:BO�0�U�4g���T7���'&d-!�B���-���S��V��
%��Q�X����=#T�Dw�����8Vm#����#=�^Z��?i���5N�\~}Z�����`������K��\�\3���,���y�-�B!�^���������P��1�5�/u�3�X<�����p������Vi}�~S3g[\h��c���8_��(#'ZP���L
���+$#��=G_hf!���j����0�w��%S/�C/��
�tQW��e��:�������P
^7v�N<f��$�Xz�c.��Q������Gw8�4\�����u\+Vt����}b���P����X�Kt>vb��Y���������"��)���+��p�l��v^�Tz�h%0w
���\��u����g���Fd�����j��!3C����W)�D�x�'O"w��9?�������t�,$3+^���S �I�x���� �|r�k���U@�D�������|!�#���1��qh�0��
jl��w�Z!�Bl��ff������{�QZ/���!�{��C�r2l�[�!�}�^���7�����������#�N����g�u���X����v��7qbdOK�7L�5�)S�=������j4����3�B����N0�Rh<~����O��+������,�o�+A IDAT�����������|�����}tibS;C���F�o�M�{�WY�W:�����r��������@�A�:�1z�e�P�Lb���J*��E���z�r�������O��C������;M���Dk��%�&g��2Z����qz���`S'CSF���������
������={�e��5�p; �N-�U��v\�uTd(AqAL��o���s�/�x������s��[�9=�`Q�j���P��X4Mw�(�p_�B!��y%�z$��p��{d<�ft��w��d�I5Fx2���0���q�V��{�#��h�}�Sn|^J�����d AKh��'�:�T��$P�|�(�L���:�#8�ro�@/�	Me�����o�<m�8M�h5���"�������r��)���(��P�{	D4�H��!u�l��Y�����l���f�d���F"��3X���U����R��v�%Hh&z��{���
�_ALV�F���4n������4C*���9��c��$�7,�/��\�<����y�I|O�c�;��F��n�OO,�%"��c�"�B!�T^@����/�::F������N����g�$=��i����dik:L4��]8�f\��������&�@l��D�u��brPd��"K�7�1�4R����:��n��Y��@
{�EX�4����f�k��.��$S��4jJ������}���O��]�`���^v�F�D�	j��;��F�f��<��B�7o��Q&V��/=9C2�"��q0`�T�E��*_����jKa�|�����:`�oW-�Bq�)x7��_�M��}���|v��7:��5���jhkdo�����DL>��6��l��P@���!��8��c�z�l���GQ��()-�S�������������
�V�x��_ARI����Dl����CU&H%���'�1�����s�/���]C�_A�����k��`	6��N��-	F�j�T\v3B!��]��qZK��}�X>��K�,Jp�!�i���V�[�a�.�1�2�p/y���A�U'��]n�X�p[�\�$Z.'�ugW*��
�F��/%�[�����u�? 
�k�~�����Qz�i��n�����"��4�=���&^_���6�zn��D4p*����\�8��:6�JO����X��z�_��OG�i������8Pc`��>��P�H��?��)�����B!n'��Hcq9es�C�k�������*�������a<��t��
GM��
��RF��c�7��V���4v_#�%l���:�T�@�<$����{���5����������V�8�$��/=A �B��^(R]S�y���q�)��������;}~���6��
�sRc<�_����L���������������	����+k����P@0�����r7����5X���C�$vE�p�k���|���2��&��H��//LJ	��Jhje9!�Bqk��81�C��4�=��x���%�JKs\���t���f�?`��J��9s ��&�S-������8Zt����I6#}Oq��EC�)j����x�����XM��r���]���I%�����.��[jN���d3����=����{z'��������Z�K�k����573x�R	�&'����N���iihc�#7>�������5��;L����Y�������?3Aw+��h|�+I�#�
���\����+k�6A"��	[��]�h��G�tw��H��"� ��K�m��%������|���8J��B����H�|�Kn�/�PH]Z�c��B!���v���_��A!�|��A���<��I�-�.5<�|)�M�_�	!�B��]��2]����}��U,���|/3B��K'	$����XuB����
!�B�fdfV!�B!�M'3�B!�B!>p$�B!�Bq��dV!�B!�'��yJh<N���.j�c�n�3k�B!�B���Hf������������E]gk�n}�����v��>��q�\Y��'�u|�"J}�*�L�����<Mh;�f+_R�A����n���=C���	-�w|B!�B���,3[�h��J��"S����8Vm#����#=�^Z�j��>g��Fp����#�v��������O!�Bq{�13�����p�����x4�po�X
l5�:��k���c<�J��f�^�9��Z�a5��#����onj���~}�a<T+vLi��K��L,N�e��u�U�0�$������������������9v�c����?g=g^�:���2����;����?����r�������BRb_�p���T;C]i��G4�:�MMk��w�EF���2J?����^��Nz�3S�}}A</�8��k?W��h��`�2*x��q�A�r�������-.4���1Khx���RS����]se��h����a6���}T�(!�BQ���Y���v��aU#�6��4�x����n��a�*��.3v���Xu�@o;��VO=�-m�G�Dsi)��V����]���B�nBi�RFK�Q����	M[q��pX��0��:�|
�s)���r&{C��/6@H���u2�����N1y^��FG
u�}O�D�I����n?
%��pM�R�	�����F�����w-_�4*��������pM�2�����FU��#����ljy��IE��99I4^X���C��jEG�<��!Ts���5����a/�t7���p���z}!�B!����Y-p��)�x���R�sM>J������Pf�Q=����
'}�lK�8�$&�� Zu9n�b����� �����������U��xaBi0y*P�azB�������/���e�wj�!5�w�2T�������N�s�-��1��2����^<������Zy:�^u�ROmB6h�b5�$g���g��D�92N*�����Y��ia�����AN��Ng�9��j��5���2�aS��f���f�d����	��!!�B!�Pp2kp��?P�^w���'���zM�������;���l�
s�����iVG��q�~����\"�R*@�7S�1�H��PHO������O�K����`��Mjr5
8�����0�J~��}��1�Tt1�KE5��\2���f�����:zr�d2EjCIf���P7�O��h-��������B�^h;�]�A�3���`0��p�e7j$JT�` �nR�B!�B|p��6�5�7�O��j"
&]�u�W��y��]������p��
��	�;����������	ju�{�$�3�-Y��o���u��K�����SI����Dl����CU&H%��Z>�A,F$j��J-���~���el���GQ��()-�S���������k[!�B��<�Y]����eKnk�K�����	�����ljX2�� E��+6�llC"-��W��1��9��.�������;�5��.����u�^J���sK�7����G�����5��������|�qz��v��s�0�\�8���/k�YH$f6\~��!��-�N��N�2Tv����M�	@B!�Bl\���Q5���aY|1����{���Mn���+�G����L��65Z�s>���i��\�as�T��:tI~��v�<�xhl����a�9)���s/iF=M(�P[]�����z���h�
k�$''X��-���OO�p�4�s��9���(���&�o,!��OL�q?�����	08\��N�L���
�z���8��~/�h`�LrY���=�������M���T�yP7�����:�gU�Kv��_��>?~_	N���M���9�I"+�B!D��^f<����?/�b\x�N���#m���/M*GGuy�N�������#F�>�'6�D{����c���!G�Nr>����	�[��XG�
XI2���
�BQ�� �V�g5�<��&�5�������%��/���}��77S�uj��#A��\��B9�9��Z������Z�>�4O
�HZ9a������:�H�������Z>"N
����f��,�h�������s�W���	W�a��V�d����/�^!�B!>������ou7��<H��<�>�Y����f;���k��?!�B!�m��������n���W�b�,.��{������S�1�����N�,�����b�`Q���`P���
!�B!��
���vW��\�D��<�~���v����n�H���0'z��I�B!�B,��Yf,�B!���!���B!�B|�H2+�B!����w2k��p��1<|���)_B��cc��.�ic�z~����B!����;�M�UBA��O����G���9�������������I}���l��n��=�M�z���1����~�<���]c��9�t����w�h�x�3�2_������I���)��16v��3��U �B!��)�e��==<��V�����{�������C�)����JM[>�J����?�������~�
F�U��Z����
�8C��yb�^�����P%��B!����|4������+����|�����������P��1�5�/u�3��]�V���
��~U^�Tz�h%0��}����:.�+:�h���>��{��Q��_�^�I��K�<M(�E�u��uc7���a�{O����\���c��V#z<�x�I�B���Q�����
<���f����4+n�,��q�:JMQFN�2����-g����"�}G��L_�a�����|���XH0�7����C�NG��(=�PF��G0�P��0�%��wbh����]F���������q��H/���u�B!���sf6�a_e%_�2��{������{�QZ/���a������4UVR�D/���#�TVf�f�k�l5�]���H�������7R�Y�VZ�h�:�^��=���M�Mq�e$n�{!�}��M'I)4?L������'�qg�)�U���m�i'Ff���Q��#�M�����6!>��C�r2l�[�!�}�^���W���ea�u-�o����?�f�5�qO2]����4����������{2����Q2����-��T���K��������&��o��w��vS!�B���93�=�t8�}D��h����mF��i���-�}�?��[M���09����� ���s3�	�+�{��N0�=9���~|%��O���g��Z	��6"�O2��4k�'�^�W�d�/�#�BmN��T2��#bM���Gb��8��V0��lf�<u����S�����wp�d�����[b�H��>�2K�=u>\�f��W����w�e�^��#���l7�.\�6���f�dx�� �q3N�B!����&�|����1b\-�g9�5�ua]h3��I�;��FV�x������E��Fc:i��e����\���q�������;������e��P������N����gnH��i0���\b����8�`#����*���q�406v�e:8N(n��g2��y�:W��g;Q�f��Q����'�����,���"�\x���U'���B!���)��������h?���4�|t
�mj���-�%Y��j����#���y������1~k���x�QY����O��~l��U��Z���w�^z��+\��������*���aL���8�m�d2wL�������c4��~c�}(����p01���c-�����Q��*JJK��w����xk`�u!�B���#�M���T��&�4<�:����������:���V�h���x�+g?S�)3��������������5�xM�R�� ���EY��������_.y�_6�.3N�N�t�������`
��=]�~��B"��tij&��[�������������bd�is��N��N�2Tv����M�7o6!�B!V���<[I�O�p����\JCKY)����crS_�e�;UM���Fq����v�<�xhl����a�9)���s������/q`s(T5���B����7�|@�Q�Fs�������Qu������7g�����+8���\b��U~bZ^kt�4m�O�Y@�O�%�gze8{�(�.OSI�>����j����th��%�������cn���3��QPj�)l7e�<�8j��Y�%�>?~_	N���M���9�I"+�B!�\����Sc4�o����i���8���0��AZ�{���G�d/���G�8�K�J�QG�Q]�qL�����O�����������o�����$��\b(�9�{���.ZNQk�d4�K���6��|l���T�u�8e7C2�����cUr�0���������qkf������Qj����!9�:�M}KN���c���h���.^����O��@���n����u�H�Qgv:����9��n�9�fM�����bf�x4Dog���-�B!D6�������:!�B!�,���LE��v�_�z�eq)����X!�B!�XIf�B!�B�q$�B!�Bq��dV!�B!�G�Y!�B!�wIf����w0�|���<-��pO~�]P�q�s-��|�-������%4�16�����6��@��g�me|���!�B"�sf���:[�t����lqD���������Q�{ko��s6m����+E��e:2���>B��
L
5�~�V�NrZ%8�K���0���v#��:�'��>���y���eG��e_O8���GsC9n��"���c��/y�k��9��T5S�+�i�b&IT��P��to�B]���{�3�q�P�@h������o����B����������;�o ���-�Mr���o�R� ������P!�Bd�_2{'Kj�DI�e
������Q�P������<�N������<�4���
.���N.%M(u-ni!~�Fs%|�'��b5.�dT���	N.~������B������0@\%��=���6r�����v��C���1:��7��a������w��[��8���
~�����X�;B�����Ev���nq|��YNU�����5"�Bq������Sgp����0F&]R�}�����.P�_d�TJ���!#�����j�u��uc7�l\#2�K����b��^�-��Q��V���l���s}L��@�����|rp:��L|r�@$��N�S[QC���h"�4G"ai�[^�D��e|f��)
5��xDF���2J?�q=�-��7�������A��G�0��7�W�uaH
v`���@/WB�`�����j���
��%:;�8{��������@���)%�R'=���p�u��0��'	�2��w\���������=��H���%�mm�����7��>�0�;���,�����\�[<~�yp9�X��GC���m���Q��_�^�I��K�<�0����d�\���c��V#z<�x��e++��������eU��7�4s����Yq;f	
�c��Qj�2r����?�\�����+���2��UB!������0M��T>����y�H%���TV.� o�Z��a������ZSn����8��`w��7��;B4��n_���z+~����Y��IE��9%���/�4]G7�)re���%�*e�����-��T��n`���
��&-{���9w�����c5(��W��<�����z���:�1��� �F����F���V����e�P#i���UV��� +���������g������hlX�7��?3w�a�	FR
��S����y��M����_Z�-������"x��_���j�l5�]���H�������7R�Y�VZ�h�:�^��=���M�Mq������d�\��5�8Vmf�����M�����F�3��6a|7�k���0�=���o��H�zU;>_i~�7OS�t�"�U!�����e��(��fb��yB�m���Gg��,�3���Kh���v&���Q���H@O����ia�����AN��N�Q"���|�,����p��
�w�������2��#���l7�5�Y��8���bv�����(�D
��KmCm��3��o�Fy�o�B}��d�_Rl IDAT�3�d-S�����L.�/�j�<n@�TRO�'����
�d4�2��!:�����jl(H������B���x����7zr��g��J�'n�
�+�����Od<�V]>����N�tA�x���R�7a�����R�tw��&H�W����5>k�J������P���FO0������@_,G|����m�d�I5F��@wG����q�{�`�=~�!)(q����B!���dV'_�X��hIPE��`���^v�F�D�	j����Z~��P7�O��h-�������Lr7���������,��qBq;���c����sE��5i�����[�?}���$|�j���������aV�F"DRV^l��w��	��s�o)_K�����W������wA"2E����`_���o(DTq@���:���lb�z��$)����A�IG�,�>g4��F�]v�X�:�-�'��7����������]����$�&��p;@Y}��<�O��Y�����j�u���^��l�
���W�M����I�A�u����O��`�k�������_!��&���Y��^�����(�����UQRZ�������[7Z�������{0=��M�b��bD�\����	�J�����0&���Le���d2	�r��\D{�1�d��>����V8��?2y������IE��)�f5�����%��
Ro
r�u3��$)�����S*u(.�c��!%�v.+Z�fmK����MK"n
m�����}0��Hgv����6�i��S0=��9v���S����*�����e(X������B!n����4��l�����,<�l:������N��N�2Tv����MV�t��Zf�Z��&R3�5�o
�j
&L���n���,$�O�z)����\Zkp���"�����U�7������AN���n��hz�h�0�&b�^W������a�
����G�hdk��^Y�4�Sf�Bs_"XYu������Y�m���l�����Y6��
�5���Q����rb3���.]�a��nZDC�.�c_9����)H<��[)Vl����5�y[x�m���>��vS
-&�?	!������<=N<e�]V��qW5U
����z<��3I�����+�i�`�����0'�����X$Bd�����_��1?Ue����q����Y���N��*�}E�S��q)��PfG��Q�((5��6��2s5�|����g�����M�|n����7~�PB�||n���z�7z1M�-1v8��]vL�����v���:��@Y��4k��n�n7n�����
'�++g�#�$n�}r�uM���(����%���&���4v_#�%l���:�T���}��Hcq9es������y�n�&�h|)
-e��3w�Mn��W�?�j:.o5�c}��i�3����<N6���*?>7�s~
�s>���i��\�as�T��:t��W�����������3���gRJp�TBS��B!D6y���������#F�?�D'y�"��fN���UF�O.l ��5��;L����Y�����[���B���w����Q���R{���qt�(5uT����F�����s��cc=u/�h����p��nN�c7c���+�?tc�s	��>�f#�l�h�%�{�<��Y��o�R4�{m/�����yj ��x��<�-�c4�i8^�$�$[������DA4��*�����)5
�8P�	H'k��kP��{���
K6<n����i�6P��/G\}Oq��EC�)j����x����S�~�!�e���@<�d�S���+)����Wh�,�e/���G�8�K�J�QG�Q]�qL�����O�����f&�nc�/4`%�t�Cs�`��S��@�����q�n�d-:����j��(`|�-�_!����;���)�PH]Z�X"!�B��������^H���������G�/yZ�Ov��f�\%��38jx��R��Z	�*c!�B����+S������W�^�bY\)��2c!�z�$���V�!���,V�Po�$�B!���6�9�B������
!�3�(�g/�B�[��e�B!�B!�z�2c!�B!�8�d����/?���[���}|o�����	!�B�N�Nf��6b��v����"�u3��L���\����e�j��?Zw���b(.���"�{gZ�����=>!6�����y�%���V#��Z:�nu7(�yp����O��t��z�\8E��V"������]lK_��q��m���1~�����W�w���7nuX�2v�������x���[����w=f�����/�~n���~���{1����H����B�����?���X�������������-��y�������&I��x�a���:�>���?�wm����t����?}
~R�/&��{���/>��s��w�������������%��~����C@���E53o������v�o��/w�R����U���'Y\kp�������3�����y}�����k?����o���_7�h<�����/�F�e�l������7^?�k���!L��JzPh��~��YEI� �+�?!y�Y����:��r�e��3��<3�yzI}
^7��>���>&�����f�|�8�V�$��� �A������l
��OrZ��(�P;w|��q���d�o����l�k��R����n?�_���]\������������Jv��������m�[5������s���������������2���;����+`\k����6�������������a�����k���������7&�,_��j�k�S���.v�qs���wH2�o�N�*z����Cd_�l|����^��M�+Ww����=���o��o���?mU�6�e���[����.��?���??�xx�������������������d����M'������>u���z�7�������f���o8���&��
.��9�3O���d�����F5f�N��
�u�|�oK�*��;���e��?q��VGq	����;\���>����a���c��MOsu�G�?I�>nx�]�3;���/_f�cg;)��������?XVG���6��p��w���q��7��������'>�v�e�~w��g3�����������]��Y�;�Fi�=�����c�{�Z�Gs�����0>�	��Ml7��_%������/��
v}�.����_�}�������/~��S�B�7O;�����|;?]�������?������]c��$�~*��'���x������>���C|�|z�^�O���\����?�����_������!R/�������'m��i�M�����;�%��g��	xpyy�?�f��H�l<S-���(�_k��~�g������������[,�C�\~k��NH'����o��/|����^�ot5S������������^���C�����������O�r���e��'ykW7��� �
b��3�}��}������?�dgq������������#�Z7�6�Z~�Wf��O~��<���5��������{����������c';>|���h���j�������������d�`G-��3L���|���>�y�f���Wx��of/_�
��������<�r�Lj���jz���������`��=����~�J?�/b���f���Y����oV;��������u������~���������'��	�>��_l�^�������~����)�����f�������x������>��/L����
w�(�O�1��+�y�����7S���N��*�M�R�he���co�s������4�R�0aH��>�@x����C]4x3��l\#2�K���Zy�)<.L�(��)lu{������w�O�9�4�H�)�\�M]�����:��y��h�s���
��b�vP5hm]�n�)����*�;``�����������"��)��W����N<f��$�Xz���[
��60��������D+��&���Y�����R�b��������L�+@w�E]�����Z3lSj���VpX����}'�[2����"���uC�c���Y�f��zq�?���-;����/A|�����nub�o�a��+��E�r���M��e�W����s�����{��?C��?s��+l/��r{���O@2F;�;��I���H��g\���}$�ag���k?Y�he�������kS��&��`x���;��e7���\��w���/����`��A����6���7/����N�k�������/�7�_�������A'LMq��orM�����_����4W~�3���6�������������_���)��,����7G0�M#��#v8��6��|����0���>������={�%�{�b(}�����>�^�-��f7�O?���t�G�}W����!v^��n��l��|�k���/�5���%��}��y��oG��G���/q�_Fx�
l������Sc������yv��
�c�c���]��S��}E�������S\�s�[��������b|���y��O=���
��{'����?[�\�����g�������>��|�{��'k��|��~�C~���������)�U��W����;j�����Zm!T�{�����]�����l�S����C������OI��(wy���r_�w}�������}�T�2����s���[�g���}�^���>��?��5b\��}�W�`Sb��?�����U�g�T��#���p9��"����A��
��1vF��
���=�8�����0�s��X|]�B������h�!��o���J����|v�Y����z{x����A�s��.�{�S�jP,�;^���<�Pt�T�2���rs2J4�F,F3�)��������.����4u�c�zY�OKV����?3w}a�	FR
��S2w{t��'1LSe%�O���q^=RIee%����lI� \C�i�`._=~���:e������Q�_���9��j3���4�o���,��6��y�'�:d_7��bv��5�����`�8�����\e>�~5���~�ym���������'��{�����R\y�;��]e�_�CO=��?r�O�`�}7�6W�;A���X������^f��#�zp;�vpE���K?E���.bg��k�������wr��c����y�����w������v�M���*����=\p7�^��?����l�����������o���ky����Odf_����p�"�?����g�<��������:������%���:���9w�r��!�B�~��������%�7#��_��l���{l+z���.e��]����W�m_���!��������W���n���3��x���=n������]_���N�S���]>�o�����?��|�����w~�_��Sy���~�x��%v��}>�<����y��D�kR?9J��x�So�v�l��>����������q���_�����Q �o�I�.����/�n�8�w�"u��|�a'r�a:G�s������'���S�H��+\���+�\m|��vb����#��=5�;���
/�����?��{������c����/M
�����$��7���Z>8�[�Qp�`v}�
�o�����������k�������������_�������uv�_�c�w)���3���[���+���M���_,������K������w<����X?�Z����=-X\��o�����Bos>��j}�K?���^���b�����j��T�:����6-p�Pm>�����6$����Nh�
*�{��FN0�j�F_���K7�Fe�?���/MF�%R^j�h�G�������� �N���������`�3����l�%�Z�dR��q�v.$s1�Q�1������C<��9�.���Q�-"8�p}��b�Y?���t��'w�>�	u��Qu.�a�����EM��6C	��6"�O2��FO0������@_����X"��jkL���ap�����z��SW0��\��G2S���b��
������8��sx���L����q��n�����\����?���\}�M�|���������$	��$�,�m7;���7V��|�����z,��}�w_������;P\�vc��0W5�fu�c2�y����b��4|��?�����S�����o��"�s�+�����������d*���=~�K,Od!����[j��5�����M��0|�7�0>X��/���w9{|s�����wHO]��Wk!?��2q�k�>�������#�U>\�u���^���!>d������c�����G�u�����"�D$Fy��H#%Y"�F�,��������F�k:S<Iu���Oln(,g���O�d��
M�=�%NM�k�S��r�p�y�LG4c�&�&Q���H9e7i�;���s���;��������������~$?=ol���}B?}{V���1z���o����G-����&k|8nd����p�	����b'�������q-�����Xv�K�����07<L��vn\
��b4��X_��C��~��2W��:�����Os���	gC1��@�����Q>h��]�p7w��������I��O8������{����?-0/�]Z�K���M���������	�Ib��D$�k��lM����|9<�ys7�����8;�6��>��}����X���"�g�xu��������_�����n����`��������)=O�@-�
�������b��Fc����
��&�
�D�a�F	y=\��:j����I|�`��o#�L������Mf;O��h"��y��F�������t2�@ @@5��^U�N�)�G]�U���K>>�k���������TU��o�?��U_O����,.����c�a�GV��AU��R��$�k��o�P3��*�D���0�$��r!�������E���[����^��n����]Z|B��}����<��b���q_�8�:t"����~�����S<�_�����&sm>��2����?�9j�N�_����������C�/^g|�0�E�/�9��%������a�����@�t����q�?�'c�DI�7y�I�7=Y�����?���p4�:��M��/�E_�k�%�����U��L�������9grA�w���&�{n��q�1�t3��w.)�B|������%��������=�^
��/f���"o]���~�z&��������{77�����
�K���(J���/M���u��=���!O]�������S�����x��>����vp��_���O�����������eb����W�y��S��Lg�6�uVr�zx=��i���{iI���}�md��~����.�KM�}�����m���%������le��*-7��7Z���C;�*��O��S����a��g�M�o�_�� �'�{�������9�$
iv����U����zhx�G�&
��pV4�*i��OZu/������qeN�f���H
	S�����+��:�3b7��x�
�f^���J�������
�B
���������7]��T��O��\����Al��S[�R��7E}W����sia�L�J�5������]_�x��@f�.�Be�3�#�~�$V2�*��?}�`$������������9��/��f>���g�s�&O
���Z�����<hA��S���3�f��>���B��|tO�oJcz�-d��O���/sm|q�j��B����<�d���9�d�v28-��J�����O_X�G����|@;���?�����N��
d��}j����?�l%o�����M�|�%6�$�1������f�_�s�����n'k���8�#i���a��H�=`���;V��i&Fr���@���q�'�D�����?������J���o�5�D�����|�B�������$c��	�L����*���>�`�.���k���������_���_�y`���-_2V���.��������kz����O5�9&>�a�c�T�0w IDAT��o�d��5���.\�"�e+.���N1���y_/�q.�>`�w�B�����'��������5���>B�Zt��3!����	0:���x��T�&�Zn���m����E���G��[J����;��]��g2N���O�����e'��O��M�Q��1?����tz���Zlpa�"��I#�7��B�@��:��p��I��C��0��R`1C��Rg�a�Cpvi4
t��UK)6��,����t6fT��'N�<������
�~a;O-���_��0�)J����eC��0��X�}�A�Z&�����-�8gv��UaD��a7�/�S�g!���3)x������G?7�D���<��`�`N7�pE3R�0A �^GI�"
�I����0�=Hf~!7��a�|��g�{o-9>��gbb�o&����/_Lq���"�������y�j�,?�0`K5�\��_�f����d���E&������'$S�E���_�����x���^�9�q��7���?r��������T���~�y�d��F��&�vt���k��g�|�^xq��{Y�����L��K�.C�����x����9���v�� �k�~���/�~Kz��|}�_�9���p���C�'?{��{F���7�D��o���N��Mp��>�/5�4�?i��Cd���2�����2nX7���������G���3&U�0�����^|�q*�pn��������?@��8���%����?�)n�bwQ`�*��q�����2:wd���Jk1���O?b���1�|����>�y��?o���~���9w�&�k���3�n�g��s�����>G��nJP��s��9���@����^������}rV?�m�F�~�[��L��?���I����s�����k��%�(ub�2��*�#&��5O����h�4����|~���U�;;�����������	����>_���b�o�n���_�Rt�l���3��~���|�T�W��������59�G�N�x���/�����j~����+�;{w|��tE^g4�	���u����d�m��0������i�������@��QN�UC�&K������)@�o��'~���"VW�o0H3�Kl�F}�?���{G�PVEy�
����J'�z4�����2C���L/N����RQ_D��O��-dse	�����8(��y�:c�.<�Q�����A��'4���r7c~��mT\����^4�;��-�����������@������\��X�}G�����
+��g���c_bRp��O�&����0�<���-��lcP��.w�O7��'=
/���6�F�?��l�Q\���B�P,���HC�AA1���d��'��;Y�
{g-�oL��7����������M����?u:��V5
;�����N3��������0��N��{/�q&��c}���]qY������y��'�����Y��^tO���K^}p�������ed��0��gd}3�W���p�����'�q�����7��?��X�q�}!�{	��J��;�`��wg�����01� ��V����?���"��[��/ ���������n���S�A{-��?�'����8�������%��N�'�?�����'#w5����c&����{n&�B��/�#���
�)��/]��������7H"��M�v&�9���>������?��P|��=;'�>�a
��o��O���g�c��n��a�C��1ca+���?@���?}�[o=���4#I����O�p�C�zS�O�a,"�>kZ�g=x��s�*+7�g�11���|��9�H^y��SI��g?��Zn`-7��������d��s��z��=?J���l�T{}����>����d��7��U�����T��s�w��Z�>Y���8�>g~���9���_����!�����v�g�b����w����J�����L���w�ZV�W@����[|�(�������l��1��i��i�1�O�4���P��rh�-������b������*#���h�w|�t�m��CT6��[
���g�R2�.mm���������l�ri���\r�\����G:BckO��SP}�)����1�o�#�m��:'���s�T�Z4+J_�.�����}t��r��rP#�-�p��
��������Cl��h���}�$�[J�� 6D�q/��{�}T���y���q�����cT���u{q�HW���F��
��s!F	r2K?>!��q�w��s�r���3���]W-��)s�����9��r:���E���d|��s��W�}/+i?qY���'Nr�	�r,��WK�#�������!����c���U�L�s����>���A����_7~���/�/����I�O��~����4�Ws�p#[�#s0�h>�f�J�3����Y���u�r����-��{*�����Tt���_OLL��7�O��kKu�����#e�{�����OG?���Lqu�|���z������,,�!n��GXu���)b����e����x�\m��>W��]��=���s�}�	�����&'>��J&�B\���l�D�	S@�����&����H2{�����C��N2������o�+���*����u���2Vq�X�(����8����>W���O��7kw����V�S>xa	�'_&�#b��c�]dr�X��z�o�rY�|�4^��
qY���wo��lDO��`'��_������3B!�5G���$+Ck���B\:f,�B�e����
!������B!�B�TW9�-��D/�����K��"�����/Sl��Tp������^�����<�B!�B\C�r2;��-�������m�>E��������>nuQ��<�Ot���9	��IU�a^�����U7W�4]Z(�'���o����=�����B!�B\���a���rJ��x����A���x��+�[����u�V��j�_+�~o��y9�x�(*+��a!�B!��r(Wsn��ACE:5���9��F��)��Le�s.����4�������i���ucr�'���z;K������]1Eu���y.oN���sm�b@���nJ��io��`����wQ>zuo�2����=T�9��h��mh�E�Z����v��=4}}|#�}l98���(�!�e(7�����B!�B����3��X�@�������np��Q�X9��L]	x[j������n�3J{�a�Q����g:+)�@���.���
B����%���L�5
Mo&������v�B{�����G_RB�������S��`k#5[k��=��~/V ��O1�,�N�sQbW��S�_
���Yp8d��B!�����0c/%zb�I|1��x6k4�A4H��B$����s�W�AiI>Jw�~�H�G�q��~
��Q#US_;S��@8�B�k�����LF�S�4��l�PPZ;�L�^����D��>�}����g?�A�
�l�P��,��?o}	vu�~,��i����`��VN!�B!V�4����������P���������t��	�h�$�,��+��c� a�hy��x�K��rK}%����1F����f,*`�a�GV��T��BXK$�fc.������]�X��P��(e�TX�i��;�`G|8�%u���4�s��B!��:�F2�y����sOC=4<���x�EE8+�q���x�gn��G��jz1�V
t���C�	u4�b���"M�k��$ks2f-��
���k�t�U���Xi�Pb�r�c8��is`�5���i!�B!�X���3�k�M�sb3i���YW�zh?�HM�)�;)5������b�.n�a����4��.�W16@0&��o����WM�+���_�p��f�d��Kg�`�8��)p$�����5.\_�=:���U��,�V��Y0T����$�B!����7g�VF�&����
�������Z]UT�
����3��Pb#7�������`)���a�duRU�d�Y�~E�VR���wQ��bv8<g���������n��O
���p9q887��lU1_G|E���!GY�T]e�E�^�&6�I���|/��,&VG1�v���p:�T�^����D�/�|:�cpbU���H!�B!�3���=�����C��ha?�-�$���cl�:����2F8��u�������Q��2n5����aK�����������7��O�
S�HG��Eq����C���Ik{^�L���77��f9�;�Ch�q��)3�Bt�s��������������axp��m��9�^C�ZO�{���
�9��Dr��RI�-�����#J�|�S���zj?��[	!�B!�u(��;��LvA|O�F�\�Ma
.�O�Y�}f��'���cG�%O@]Y,�<�|�5
xd��B!�b������<6����	���G��7�x��9�'����&���km�DV!�B����j�+��=�W;�+o���W���B!�_a)�!�B!��M�!�B!����dV!�B!��s�'��M]�Z_x���=�Koo�_�����-I��������Ko�������r�n�O!�Bq]I�TiS/uE��
r|k
���
��M]�+��q����6����SY���8�e#qP{�	s��->��
\�A�mmg(�^d$W����B!�B��4���>�.�����N�#�4��5��~U��FP�+(�B!�B��'i2�FBS	��T% K\`*����P�r��*���*��6������������"������C�S�;��PU��b�����������Wm%�5���>L�)�O��.����������>����x��l����Y���zOh����<J��C���{�����4�����m?!�B!������F:�(r9���)u�`�/q@�������-�?�������4�����{�rlm�fk
���p�����~�f4T��(�6�|���B�8�����o`Cu%N�r�?���_2����q#�N���z��7�q��S����0u%�m�����t�����p��T���c���c�����f0��v�	!�B!�\�������p�Trcpn���7c8����=m��$��WI�
�\f"p|��JD����������Y_�:�g��F<�:��f�!%@�����e�����V|��AiI>aO+�C!"����
N\���s�`��x�J�W���B!�b�������<��j)u�
��lpl�75�U#O�
�D�a��
�1[�k����w,bB��1����X=z=�t�$��e*k��o�P3�?�0�S`3���sk�5��:n?!�B!��u��,��P��
N�J�QN���D��[H��!-��
�x)p��\�e��������-al��n���O!�B�-�>�~�)T���J���s=����:'6�FX�pE3R�0-GK����h�t�����.7UaD��b���M��|�F4^��r���g2��c!�B!D��$�1�K��6�����_`+�i���JquN��{G!6�I���|/��,&VG1�v��j����`	��+��2������������������`S�G�������W����c�qI>+�B!�`9�'x|A����Q������CY_���f�����x�J���F��
��s!F	r2r��%������U4��=~Z�hH<C����m;8�k����s!�qd_+�+��Jo?!�B!���q�wL.GE��4YN���{fw5w�7�����+���|����B!�_M���S����~=11A^��P�eflrTP����,0�X!�B!�Xf�<�x���l�����t��*�B!��+m��!�B!���&�!�B!�W�$�B!�B!VIf�B!�B�8��
!�B!�Xq$�B!�B��H2+�B!�b��dV!�B!��#��B!�B�G�Y!�B!�+�$�B!�B!VIf�B!�B�8��
!�B!�Xq$�B!�B��H2+�B!�b��dV!�B!��#��B!�B�G�Y!�B!�+�$�B!�B!VIf�B!�B�8��
!�B!�Xq��v��h>V�H�6^
\�`.�c�Qj��v����_�����~5��:k)U��Yo�������-x�.�_�B{�1b�D��f����������E�yuy��{F�^��AUs�Z�?�9�~������8�����U����;f�F8����O(^�s���]�;':?��5��E�k�8�}<�� C�v���}�N���-�jA�o��3�V!�B!�3�F2{��(���+��������bq�ls5��I�4��`��Y����h�7v�L���;j�K��`��i�Z7���3�?I��~�0b)q�}�^�Wkx�S�Zq��
1�4��(�����j�������S�z,�zv;��6v��T�������i`h`v��Z�b�1�H�\��v*i� ��$�B!��++e2�j����1h(��b@����>G����5�{���l=0�37�|.����4��T��c���9^��Q��>��'q�M{�.�G�����p�����S����<J��C���{�����4�2G���X�z�p�����"��K���E��AA5Hc����]�C}�����#!B~_���4���
,�R�/oj������=�lN��R�P������B�4R�!�h_/
����Pt�W"9U#!B��AL%0+Rm��@{Y)���G���V��*J0@ ��L�<"���O
!�B!�+i���19������Q,��y����F�G�^u���Q�[v���������:�O�0�B�����n�����b����:�I������kO��6����iv�������QZ�q;:h{�����B����S��im�%�btVP[�mW
�������y���p��t�� nra��^���rn���]]%������,�r�t�>���e�(i��kcSs��~�7���SY�i�LvlF
�?<}2 ��Q`���>��u�*���C[(�"���Cf}"�B!��~�7����K��X�s����J������$��g?�C���@��U�*�14�h�����y"
i�+�b\N��Vz���Bkg	�:�BAr�`�T"�]!��L�o��?�������+�m����zx*hg��T��0(�4�x����6t���P���o�j�^B�	��J8�Ez�v�A���GGf~S�����3�ys�6^
�W������\�}�)YKq-�v=aO�%�Pf�����X��><>7ue��J}�B!��+/�dV#����P������B�AC	�8?�0�S`3��B��!��N�Q;�@���v�?���6��(����P
*�����	0073��s���Fo��Sc#J_h���D"
%6�'�(j	6��)o��2��>�����l����2���sf5�@�x�i�Q��TVWS����7�!g%�#����-����K��3hu{k.�8/���s4~���*)�B!�X�������N�3-y�P
���(�DaQ��f\%�<��Yb�)h����y3�*�I�K6�x���Q��!>�����B�^���(��c[�^�X�h,3�X/����(�g^�����4��w��?Cu�`ou5��|3/v���>��!e��h=�Y����F�������^�l����J
!�B!����3�k���L��f����
#j.�|�$'�B�Q#�=��?������=��sRj�0�sM,~���9$���o������s�]� �f��a"�����d������'9���yC��-���Tr������f���cl(*.��DK��������TW���F?�f(����H�`T��Q��.��
����J��~������N)�PB!�B|ud�u����M,&+��8�����{G0���(�`�8�T�����$6
����rb5��g����FnTa��Z��j��X�iJl�O�^^��n�d-�����0��'=
�����:��,X�l��h� IDAT���B]Z���)��F"��L��^�����#�������::�\m���a�j/������4��������Y����^���H��W�T��R�~<}#7T���������[��'1T�`�b�����3`���[M���z��8��N�Q{�c����.��B!�b�Io���>����nF��n90�bl�mt�T�bs.D�>��k���Ku���]Gs��\�}��;>��!:�{���K��zfn�3������T4�J������F�B�54��T��p���0Jp���X��%Qh�hB��S�����5�Tl/g{�f����7�.�[��P6�P��]l{)���G}-9u��Uu������u�}�TV7�h�Fx�G�3�$>��Q��������T�����X����t���as~���`��@%����B!�B�-��;��Lv���w��-�\��rs�r����-��{w�W������v��O�B!�IMN�OEg�����yy�#H�f|5Y��Tl���C�����D��J"+�Y�)0����DV!�B���������f���Si6�'Jp�����vT�Z�tR�x���B!�Bq�f,�B!�B,����B!�B1�$�B!�B!VIf�B!�B�8i-���R�������}J[[�D]4�R���c4\���A��f����������E�yuy��{F�^��AUs�Z�?�9�~������8�����U����;f�F8����O(^�s��3�A���>�.�_��������N�y���):m/RW�X�9����H:-,�B!�b)R'��6Wc	��Hc
f
��X�:��y�`7��z��MG#]z�;M{(��il��zm�$
��������{y^���Nk�a�6��;��'b�����g�	'�S[Ou���������IP�PC�~��2�������Z�b�1�H�\��v*i� ��$�B!�B��"e2�tW�P�dcc{�H����2o�b�ia��K����	������x�9UHB���^�;��H��|�}
�4��4t�B��r\��T��E�1�p ��H�1��e�0�3+5�JXU�
�(��B!�B!.�sf8��/���i��kcS���������_M~���_�����x�d$@0�����v���n�U���!�B!�������L�U�I.�S��
zw�>:2������n�q �����R0���M���L��Z�k)��	{`,���2���������O�����++���P���B!�B\Vi-����]��/|g�|o5��=����T!@�^�����z*�����������J�5FNa[[�%���mg���2�\Fq^�lZ!�B!�e�<��E�F�b1��\�����&�����-6g6������P'�[]�w������<s��h,FHQ��G���1���u`�5���4��'�*eL�VR!�B!�2K1g���?��h��K���}��x)���x�	-�P(4;����X=G����<����k������Q��\R!�B!�rJ����������
�V��BJ+��(Ls�0��L���}��K�n���>��za}5M��,����7�qC;��X�N*���E�xC�
&+V��Q:f����`m���������!�B!�X~���*=�kP��^����#�����gle<�B��C�7w��������p��	vW���72�����04SY�����a���2���Q��������T�����X�����t���a���
!�B!���q�wL^� �B!�B|�LN�OEg�����yy�#tS3B!�B!�5��
!�B!�Xq$�B!�B��H2+�B!�b��dV!�B!��#��B!�B���Hf�\4we��jr�;�r�����_������i�]G�u�.stB!�Bq}�6�����f@]1E�_M](��I��{��P!�B!�#��.p5w���4Qd1�SCxZ��}ht�����T��1�5��!:[�	�������B�������9t�[�:��<
���'�$n�i�a�E���a��n�55t*����Gi�yh��povb3B������x��{�*s`1����������/]�b EX� �����:���D��C[h�kJ�Q��aP����1��l��K��CB!�B��I�d��h��b��(�M�<__��d#��`�:L]�(�-��DrqV�S�l�'[R!��L]	t���Q�`�SbNT��fcg|���rF��������m�i���f�][M�^�����������(.�Z~�=e*��FZ�*Fg��{�v��J_���W�7�H7��&f=��z/w�F�^���U���I=���@i���E������N��S�����$�1�u���c��*�B!���4�Y���K��X�s����J������$��g?�C���@��U�*�14�h�����y"
�t@��+��4�oo��?
(�v���3.$'�pat�`t�l.38��v_�������.\������7�i����w��0(�4�x����6t���P���o�*�H�D6�����y3������8�%�(����+�B�q��-J�^!�B!��^���jD���ocC(QpX��`!���f�U��)��a@!��pm��� H�?@�'���ff}���tR����dV��;M�l�b��V��u�O�E�@(e|�@�����B����I/�Z���p�"��0c�X�%�]��;�<8�m�
c�������5If�B!�_mi����uM�g�Vj��zhx�G�&
��pV4�*i���CMA[$-��_N'�d��7~o���q�C��{�.����x��2�>�J;|�I�������
!�B!�XXZsfs-6 �����L��PaD��a7�/1>6�B�Q#���?������=��sRj���'5��~n������A ��l���]H8��)p� ���t���3�����$'�u4o���C���J��x<����Y 6�B_��M��x����0��R`1����l����Y!�B!�WZz[���h���b�R\]�?��Q�O�w����B&��M�n��P|5c���*W!VSy&;Jl�F��D@V
��6M�
������q�M���T�;L.��G�R���R����M;�PU�K+�P @`�!e��H�\�	u��_Q�Yt��%'�t��XZN�������UQ��`�:��t���a!�B!��*Io���>����nF��n9056����f*��9�AG���Xo	u���]Gs��\�}��;>��!:�{���K��zfn�3������T4�J������F�B�54��T��p���0Jp���X��%Qh�hB��S�p,u�E�����b���@b%�@�Z���lz����������!�B!����;��c2���.��F��F65u�r����-��{w�&W3/����z����b���nFd�Y!�B!�
699?�yl�����MKo���d-fS�K�]��r���{]&�O�}
zK,�B!��Jo���Iof�{;�f#z�;���jGuY��w^��B!����r��B!�B����a�B!�B!�_�d��N��%��s�#�>}���Z�k=>!�B!�h��lfC5���b���x9�Y2������?G�b�����S�t��w��_�+�����������_+����X~����=�n1��;�K�sR���jG�
w��r�YJ��x�������j��D-��q:g=�v�����7��R{������f��jt�\+?_�����������'���N��k��.����+�WS�����%��l�T&~�3��������V����Kp���+_6x�Z>[�p�c&���5�|A��J��7��d��3����7@H�������
��'3�2�p�����/�v�|��]����B�D���%��}�/��?��wz�bo ��/�����/v�/6����p�#����t��l�0~�C��N�����rJ���p��B|�7e�~7=���r����������>�2,�������������[?�r���� wo~��}�^����������\��\���=���o���?�����`���xBZ��b�I���������U�R��~|^���sL%�E��rF��R`�C���8�e#qP{�	��Vs�����K)_������j�E�}���>4u��YE�����^
�w���C@����j�F4���AK����������mT��K]�b�9�������t������5����q�?<�a�7�^��-)�";���o����?Kq��v���A}���7s�����9���gy���6�*�_8�z��t�\��MG�:Y��3�>\�{\��W7�Ki��u��}O���A����g9?���w��9���g���_z�_>9���p�m>�p����f8?6��gr����M��VH2�n�Vn�_���u��&�0�Y�s����_��Y7�s�Z�~��O������\����?'c����s���z���E���y5�~��+������9
O���f�J��)�!^� �����K�$
��������{y^���N���=����p�q���Z�a7A �~�����_��^�f����?������6�4����p��!�F}�O��O�M�zv;�4v�T������0y������);���L�3�����(($���d�a��S���Gw;�5�F��;����##L��9�/�����!��O����=G�cw�����~���M-���Ld�������9�^�������?Hf�9&�����x������M������E�H��������������?Y��?����%�l 3&a���G�o��
�|�F&���K��}����U��^G��R�7MY�L�}���o!��|t�]�u��O��R>��NN�F�	�R�z���[�~�;�I��^������
���\����'�������bbY�K��)�?����2~�����h�C�����v<�X
��bb���3�i�?�d�#��8�1>�5��[���op�{��K�S�����.�� �t7�;�l�E�f������S�!n�y����|���x�O���k�p��f��
�S
��sXu�M�{���}��i�������rZ���B���:O����C���2t�_����Gv�^n����Y}��`�t��~;�nz��c��s�Cd�Y�
|B��m,8Znn�\p�->~�G��K���i%�����GY�z������s�������6@���6s�����w��'�`x�a��0��+|�������!��ge������{�w�P{�7#��g���u-��'ZX��a���G�T��a���.z��_������-���>��O������	g�����'�I���f�wkYm���N�N���~����5�y�m��&�MLD~�����������#^���,7}w7y������O�g��>���V���g_��l�6�q��e�����w6SY�cu,��n��'~���.��}<���-������4l�-�l���]��^��7��T�z�pp���xB�w�m�sx���n��F���#���4�<�xm�7;�!��`KC|�AG���X�z�p����������pk}�(���!<���>t	?G��Y���r�����(��)������Xp�V�.�aP�x����������W�4Sq=��V�8�M����~��Z�=�#�v�}����;[��E��M'N�f1e��L�:V��^�z���?���N������Lb� �$����B�N[jO�i�ZO��N�x�������k�p������:��Xq��0Or�	��0v=U��ik��;`M��k�H��*��X����C�-��������VX��}��������?���5���j�'�_WX���l�ng�����"�����tJ�������Q����h��(!$�����}�tB�L���0���C���x������|���^	O�����T[���?y�,�~ZW9!���g��3��}^�>-��/���<�y�_|�+�U&@F~�
������
�/0]P���U��j�&9_���(���2m���o�����)<J>���g�Wo1�_C�����Y,~m99���7^Gy�wL�9Af��	��d�N�Ao �����3���D^z��������Lv�#�����z�9Z������,C��WG���S���=Q�~C�_�����q��/��������������M�a��o����������L�W�2���
sk�����~����~�FF���[L)�|��)�L�s�L��:���g���D^z��K�c20��T��JH�,k��\��=~T�u�_"���l��~������'�,:^���.�9vl/�<�g�+z��+�R�����r����0�\������~���?���	?�#s�����g����Y
�Y�"�n�z�c�:�e��X��!&���	���C�O������@��AV�z����0��W��)`ui+7��5��QFG�^�k
��|
%AUZd��<�-��������V����/����/�O����������fM�>I�������s���\�^A��Q��.Jn������k\����~Bx��P��0�� ���e���>���6���7qy�"��q������P/�����t�:x�M���^F�.����������f�Mp��d��x�m	d\*�n'��g19��y�.��������Yn���.`��	{���=8y���{h)�$��B}	���cG-��=���T�Y���6���]�6��/)!�����������	;�y��B���N��&�����f7�����Ao���1B�����Z��Zq�}���7Q����=�8�Se���������m[��d��)]�w40�aB�a%.0�����l���=;�U�L���q��\g9��q�S��v������{OU�����(�Y-���_�����2n`<�}r��0�����@��~�nj��({�I	��=e���QV���U�_�����u}�C=%�N��{h���\3�WS+I�O����W��i Y��|_��X�*�q��a��x��*��s�7�K�!��"���_k�{�s��>�{}%k�!������87��`���|�LG�V��s����|��g1���u����%�6y���b	��x��(�N���_p�����~�&�U�y&_$r��f��c'����>���;������<���WP�B�����_� ~Y��A��L����yKs�S��|����0d�������dmX����z��`^O��2~�S.�d�x������W5�]��������L������
�;��'��6�ci<�?������j�{�w����T>�K�/l�:�������<�IN�|*�O��l�T�������UsW�������w\ �_�T��4�4G��[���N��W���>���v&���P�<��+�����.�������s���c�P�>|�o���\�m�w��K��d7�\nns�'o�.��E���|/���Yw�[�w���^n2��p3�������s�=7�����Zo�\|=�0]x���o�i������q�+�Y�0�"����%���c��9D~�<�\,�R��a���3��\��\��(.s���x��X�X/7����V�����\��B����{u����s����\�mp�����=��U�x�=���U7-C�d�I-�����F�������z��������V�����~�s�{~�T���!nX�3�~��n?����d\"��6�g[
������BgX��g���-z�w��%���0TVS���_�FM�)B:=��A�(s��)u�:�q{���.'��O�AiI>����{u��X5�B�A�N�������N�K��$.�x$N��"�c
z@W��r����9
�{�Sr�&+�3S?R/Gb3m��C[��*���W�G5�������'���Y6m��Z��c3�A���3Cz��p0�O��5@�s�:� IDAT���C�����G17��"Gv�t���O-�b(^�l�������~�Zb�����Ia �gX��5Q�>��~LR���2�������?Y��B�����X�����	�����	�	��Ia'b���7�x�	�C��`�� ��(����!,3,���i�S��^���<�#_���������	p~���3+'��5�&�~��1E�brd�}1����S��a��;���Ug�q8��jt&h��.��#��x��W�L�aB��� �\����b��_1���K,���Q'a�pQgQ3���`��7�2O�~���L���{<�A�g0=z
���Q����SF;Y����~"�b���nG����'�.�C4����d2��k��(L����QgJ�������|G4���~W�4��f�T���}f� t7��~%�;
������O���[^<�?$22	�z�6�ap�ow�ay���G���X�m>����0��:�O���s�G�=;0�z
�K���F���3�����n�����:Gw�u?o~L��r���L��/�������mo�e�'`zB�����$;v�C�U&�z��
$�};��/3��������"��J�\�Z]�Y^���s�����o3��B�D� p�7;��pW=�����'������=���������efk��"Gvy�S �~s���>���6��7u�'�~&Y�n�UC��_|82���������q?�F��_>��.��v�5Z��1�'_%<��oZ�����a�=*������o�|�8�*���$��Q�i#v����.��4u.>De)�n����>jG���I�t�5$�m��!F��N}X�	(1G�`!�� {sq�dFCz
lfT������K��~J��`xq|�1[����?�
��� ��a�R_jl�[@�Ir0��x0��!��������>��%����+=�B����j���!\A��/
$�_M~	@!EQ@�~C�Ht3#})�M��
�O��/;����Ca��s<��E��mO��`s�S$<�i�"'������)�v����������w���?/cVv�M�sf�>�������H��/e�E��yu~�7nV	W_8u��Q��K���>��/������7��AM�2�����������2�?�!����|����$�0}���H6���5@����Y��o�t$�(	�&O?�}��uo�����$�v��|�@yU��W���w���|j�O+�w�|���������3�$�����&��J����2GV��JJ�d��b����J��!r���5w=�>������W���m(`���Lg��/�8�K.��
�c�}����'�T�]���O����������C����>|=&���1�~=�����3��{d��7�}d������l;��/�gr�E.������b�x.���D6X������?�Q�(K��e��F��7�	L]z�AE��}���3����L���e�������w����Y�����~����������gS��$����0�o�;���_j����F��I�~;g��^�@Kp�Z�j�{i����x3�EE8�Zp�����}���+"�)c��pke�{g~��`�[TU��+W!��4����`����}l���������D����_��q��T>4����T�/��s��H�<7�	Q�����k��!I�'#F���3Q
K��si��[��8�l��W��aM����x0����GZ��������d<z�i 33�P�w�
A��.�X���Q~����
F2�j�3��o&���s��> ��|2��ss{���zP�FG"�[�M���A9���q�g��3L�_K����v��8�����%`�w\�AT�����M���`"f�j�)�c�����d�f28/��F��y��,��4��(��|�\t?��[��7�����e�	��u���o�����~�"2�4����HU>��O�s�����n&k�j`��Q
��/0�f0Z����e��gr4����,<j���g��:�	����q�9���A�?c/�<? �n�.��l�E�BT��U�!�'d����*��D��0;�����67�����r��`�g?G��xm����5�����?����K�s>3w���������;���p�v�,A��5��<�"��_���l�}�����lp������G��o9�YD������r�:������Z����� ��`����Q��P�
&�zn��y�?Sk����L�����wp��K��.����� :���q�]e������?���!�g���6Qi��[��,�4��4�KW�>N�;)5���������q�����R���c����2�
����a(�����oT���&/M,�|[x��5���Y�'�!+F
&��6]��\�
��J��fR�I�`���Q��p���m�������Ix|�l<\���aj;gO��&���J$*�Jfk���j�#*�z��P������X��?2|_�����N��6:0��[d(tb���?]1%v=�s{h���]��T=U���<z��4����G��2���D��D"���`���{�.e'����&��Kf~!�5�0I>��q��A"�]O���p�������d�a�����*�=��*�v�V=<A�k�KL��[Q[k���L�q�i�D���Z���f�w�K�>����
a����/�����:N2�����z%��@��-��*3;o��V������(�z;�/�����D��w�Q^��U��������o���������?�����������{�\;��F��oY�/�j�j?�9�����������~��}����c����>V5����$S/�a"U�4�?��
��l�^�c�.���r9�m�d��_q�i���Cd�;�t2���r�?�6��X����;nq���r��{�xc�K-�����o������]�)�Q������]�����rCi�?�����Lr9�������^��$�o[���/b��C._����`�5�^��,��XY����u���/j���o~�����S�17k7|�_$g���T������8gBM~M+q��g�5r�����5���
L�����������f�^�����rY��o����e����Ov�����QG���Q�r�����i��
��W;������]&�|p����������_`\�����u9��D�����~A7�q��An���������gk%�C�|�1�\Xw[���{L���~]�����������j��s��5z��
J�
��O�VW5�����f6���

�+>�}y)�yC�AiE	����� n�N��a0:�W^QD.3�3��lr�P%��qVU������Eq�dE
._����
z�-o%���[������:��B��V�%��Z����V���^:���VV�D�c =��gN1\^O��
8>�/l�������[m<]���J?��P����
T���U�,���� 6`/.I���W���5�DC��d�M%�8<=H3K����D�k�I���u�,���z���qcE�x�]��l�j��l/��r#�����:�Fz]^�V���TS_������i�����x����i�����;6~�+�����^�����d�Ed����Q�����g�~�!��z���s�'���]��6��u��%���#L�=�!|�TS?�ca�v��y&�������_�L��J�W�$C?���&r�My����{�)�yG�w�a��y�����ow��U�r�L��}�(��F�����,'kt��W��u_�2�^��j=9��o�|������d�/3=�C�q"��`_5�6�JL�����{&��%3�
��\������-S�%P�_������X�k�O~��|r��O�}������	�dJ�i�r�cj�OF�2r���;L���w�&sfF�W�%���-�'��S'������?p��K8�~I��v���97�������p��m��������	��O�[����{��\��Q�<���'0�����?a<`e�ro��p�������|�z��x���?�����g��V�����O�]VM�g���}fXe�����e&��| ��w�?��u��Q�_
���}����s�=���n�q���9����KTv������F�4�G�D���W���$����7�G��S���e��L|v=�{�HV�e&�����o���D�J����d���`����2c�����n[|*������`����;����2'.�_���mM��q\[��d���:��H�V�4�h��
�*�i�6��8����g���O��8J�����v�e�hn�������:N���p���a��
��x�C��7fK.�|C<�������	�t4s�����cT�G8�3�h�M=bg-M��*�q����o�S����B����u�iF	H��Z��`�im���Jj�������ti<pZU�e�t7��<IM]5C�GP�_-�*��=.;������.j;�:t�a�~���gYW����L�%�_[���1�����j���q������k�I��26x�F������-F����Kbf��c��
����Lmg4�%rgXx����C����Dh�?\�
���5��I�Uz�~�y�t�V����|����W�l�����r�����`�O?����jb\I2�����JN����F�g���2���Hxzn�����"�'������N��g�?L���E�C��x+���y/cz������&��4g��}�7[W�#���_���r�����.{S���9i�o��)��#�u�8`�����Y��ZNShb[�rN����i�_
a����������6�'����bW4����?99I^���YA�L=�?�4Z�����n���	�^e��[WO0�@pu�����8��#���$K�s?���!V��7�B�7�K=NZ����j��}�v����n��u������*+�������(�O2�b6[�x$
(�t��:���
�O1y�~�7��?�%9��~L���K?���d�'�I���@�1�r���k��
���rVq���(����8W�k�>W�T��,�6s���������g�qz�b*�%r�V}��L.	���@�z�t�����Gofc�N����	���`����)�I���y����/����X �5�.��$'C+\
�@ H��X AZ��������@ VN��@ �@ ���>g6�I��G9��G_�I
���s6p��>��~.��D}}�-��\w��N��j����#�@ �isf�[�p�i��(ee[i]p�p$ 14 14�gm��(+k�Lhe��8���V�����SaYqR+#�E�����z	�\I�$������<{�t�a.�@ �@ ��I��Y����'���v����t������^	������]�@ �`�hrf7�{���|�����8*m����-C��>��f��K_�^`���l��AWX���6aP��������8*�Q]��b����w�c(��D��
��-����>���D�c��� �YM}������o�������������6/�+���k�?�W�@ �JQuf�N��6�sM�(v*�j������Me��}'i��G�������
���
:����0������h��1:��k�����N��p8�X(�	B�����E��-��.3>V�h����$��C>�A�$�pT�Q�T
��x"��G �@ >�$wfu�����vz�1@���g����NW��r����9���{�Sr�&+�Qov��	���|����[v�#����n\%����MM �@ IH���m��!F|c�?�}2�#rf�6,�\l�/�W??h<h��)���R�����|��kxC��z^���|-�g�0�{��Aq��@ �@ �����J�/�|�����5l��u�)�����#]�����kY8��@ ��cN��y>��|[��O�s�g%t�{F>d�H��������B��m�nq��7������m���*���,��S� �@ �8���� ��0��\v&k1��+r��$+�J�qX�g"N�e,��+u`1Y�:���{����aIJ�@����p��a9l��i��m�SUU�y�X�>i�b�c��1:�
���%���@ �@ , ��[n�N��������W����4���}l����]�v�%`)����-}��6���\����
 ��9�q�Ayq�+��j��k�Gj;W�RZG��lF=���)�bo��!B8@��b�.!���6)=�I	k��`�~=�-�����$�@ ��'���]�������$yys�{���Bu�8`���&��+��%UgV �@ �I��M���Z��b;�<�<�Nt��'���@ ������f6V�d������n�][w�
�@ ������B�YK�,�����(�@ �@�����3+�@ A�\�=��@ �@ \chvf���������\���tK��>L�E{��}�c����Z��Zd� �I��G9}���R���Ns��P�9�������\`�<W�O��(���@ �6���Ea��]<��r������0�+��5W[�O,�-U��4���GY/�H@bh@�P�jK"�@ |2X�3+�4c6A��#�R�n�����@ �'��8���[�^b���oO;M�^ �L�:���VO����<}�o��tx���uuT���0pFN��^!����~�?=S\���T*g6Qd1��q��N�XT\g5��Nl#F�!��;p����d�'������������B�N4`xn#���w����u�F�'6#��.�5vk�o2J�Q_4��^����|g���|�V"��wT�������G	x��8D�PPS�Z���/A��W�v*Kl�>�k�m0�����cOm�������IggM�
��X��i��!q�Z��l��<���E��@ GR>*��B}	���cG-��=���g:k���&������~�%%�����������������<UO�!U�g�c,r�t�a���<���zJc��s�:�3���eGm]����j��2���p	�6�m�H6����h�_o���1B�����Z��d_5���(++���8���QVVFY��YG6U��P��Zq�}���7Q����=�8�Se����Z�w���B�8���C�d`S�����O���x�y�q]��td���V>g(>p�c��^����@ �GH�����!��`P������U��+��i@�n�����n��$�����p�����%zu!���y�����L��>�P���$/��:z���2A�K�q7>��"�6���{%������}���f�:0�|�������^����%
z4�7UR�O
5�u�l)7�=�L��9(#���g����>+Z��W/�z��p���_���5����}�T�xf	GV-}��)�@ |RIy��������G�H^>i�N��-���YbD����e��b��a��oP����8c2�!=63.o���(�q�G<�!pX��1t�R�wV����qvF0��5����#�*�8u:pm��!����������W�^�,U�o��*�j��mX����_��~~���Q=?j��^���q�p<D=�e�~��%����+=��e���W+�36=��v��@ �k�����{i����x3�EE8�Zp�����
�xpa'_�H�C����GL�n����s��6���������E����GU�O>"!S����`�����"3:�y�i���f~W�o��(�������q��Yj1���<<F��W��t�h=��|��*�}R�^���S �@ ����2���^:���m?v'��=������<7��P��LsKRuf�\xXf4���n���g!���������\K��L��I! �B�f��g�{�����i�Y�D�'
�"�fl�lA]^,E�qXB�����W�o�\��Q�?�CV�8L	�'
��j��
J����D��N��JWY�#���'D��J, IDAT��@ >)���Z]�T�
����3��Tb#7$3s@���C_PJq�]ZQ�����	�(/�M�������$�F1�j�*�`�8�\S�#<����;m����b�R\S����1��a#��sa�SU'�������'��0c�0�cE��|i�_���r��GM���SnK�~�JXL��b6��Gu�N=|�D��z�����C�������k(�3���@ �4R^f7`�����H.�|C�7�
v?I��(5�N�3�g�}�����^��7�UW���22<L�6��v<�!]��%B�!�kng(m��B����u�iF	H���=�������p�!
 ��#�J4��>"��k$O"A<r=u�����f�T����&�����pU��8l��P�7���L�Zx��W�?5*��=n�w��}�N���s�������Ij���=�<��z��@ >�d�r�-�Z�~�4v�.��
�?|5�s����s��j��r��@��4Z%�D!�_ �@ �:LO/vE�[��������-5���XAO��c��n���*:N���^��i\z,�@ �k���;�(��JJ�C�[)k��R�@ ���B�2c�@ �@ ��G��X �@ �k��O3�@ ��� ~�V��
���������k�65�����~��B
��~�(=p��
�W,�<W�O�q��E�l���}8?U�[H���z�G��p�\c�8���4��<'�O�t���Z�5@^�N<]q��X��*���#���������3���uT�������'�������zU���A�A!������p���u'xj�q�;Cg���6�6m�.���b���7��=Jc��KN�����u��;lsB���th}��M�*j*7��A	 tP{dHS�j�-�:��;s.z%���,]���O�Nf����MN�k�(u���8��~:u���T���w��s�*��h#�0`������B�\�x�O�td��}�-�*���;j��|����sK\��V~4������53#�=�����v��������G�]��\����L�0e�'�s�xw�K���
������s����������>�'�����7mwU"��C+9P>�E��
F��7��]A<�m+�
u'����?��k�J���}Sm�T��t��iQsK.�0�;:�����::��6��'u���ca���SS��EOH����{�o�Xg+O������k�vj\���r�O�?��]W�����B)��z��}��?���������=��d��1���^#����IX���+���~L];f��sa��{�}�6�b��G��m�B��������G�z�X�/T~������l�F3��.~���[�\�����=�������<��9�3����a�Gj�5�PFN��1��%$�
>�^	������-
V�/U��u��n�hoj�l���������DS]R�_��MKC	�3��v��LX���I���r����lo�b��uI��J�}S�_5�+�7����4��B��z�+<R��)�RG#���cq5��������m]
�n�p��)F���h�v��[�1�cvj`�S���.�Jo:��*�G5<�����aF�����-��Jb�z��M�j�4P�����j�h����}<���@�q���Z�a7������������j!����}Si�T����ta6A�v����6��
J���/_������h�J�wbhdk�La���U�D�����q���`�J����b�O�?����=�
)��J*wR������)�q?*>��3+��"|A�[��g�h�L�9f���������?[����	��_%���������������$Ft^�����2��0�L[tH��c�3.v��9�U�@XF��pH��K��ST�5����KU�%��p/��g2����M�t�5�v�7UnB?���Gff2�H	�2���$������9R/#z%���AZ�s.��&����I������0��� HwG%�]��wr��<�-E8���
D��^f5���t�>t��������38w�a
;�@�����N�J�Q
08���1�z
K�^��������}�pP�)���F�Fs��B��
\N��0��7�@3Ov�����J�K�������RLuu%����s���W�v*Kl�>�k�mp�'���Gu��Q����qh���z�q:7�RcEjk�mhW�i*�3�(�������G�g��g�=t������q��M�2�������fgK��Z��'�3��<�������c����>���D�c��5f�������q���,�[���G�������`#�RD�}�H��W���W~�>q�����9
�;����m�`�}�����[�������������oIT�oj��Z����1!Ujp9�����=:����8Mu�������m��c�x�����={����3��m�z"��Z��9gf�z�i�]^{�,q����J��I��y����_y�C�|��e��J�'�~�T�U�GZ��dd9���J�)���ch9�d|J��J�7��O��`4�/����wz9���~�#A����������<=���c�������]��Y�>�0�q@W��~id���O1c�/��o^�����<}��'�rx_�t�	\�|j,����(�0����SH�l:�muN
�
#>3uO����=������8��c{��O����<+�J��|���h�������)ojEA�����e��&���9���������!�|�ev������b�c3*���/����V`k�8���jR��OJ���������g�j���c��q�:�)�����=�K6�l��{j�8���\�����Q���q�
���jON���l=�"J��m���=�j��4f{�����@�v��'����z
c�C#�6����p���<�����m[���������������?�SFYYee�Y5���'W���;��)�t�m�C�d�����$��I�O����c��WB���a[d�a3�Hq�nzK��:��(ee���c��}��_:X�}[N���}L���M��6�����a����L�b�7Em�|v���b��:�V�j�D���j��o-;v������fb>�B6�K�EA��?%j�:�*9���'�6�s���.�����]<?YBV,8������_�/���9K1f<���OG�\Q�&E�I�l���`���X���`2b���un����u�����+������gvy���AU�����he11���#������:��1�a0��M-��{x�P+i�"\����������X~O�6:�"�?=�����V����`�������������t���Q���W�l�d_/>��m���h�Vh�oJ����o �B�k6�����h�I����x��<�������:�\W����b*=�N���~T�|�����tc4b@!<��I���`4�3�1"�iVX���a=Gw4�M[(b���n��A�����6�gG5����`�����k�Q��������x3�C^����{���&�����xf�#C��Hl�Qv�b(��U�8(-�'�n���'(K��B28q-����M��?�<�_H���C~�%?�� J�K���$���	��������U�J2�Ca�b|���)�[��S�T[/�X�.I�R.WX���$i����+ng����-!�vg l�U��H�e�w���)B>:=2�����O�{���f�F�Phq�(��S��+"�����l���'��1U���'����=�s�i�Q��I�_+.�i�9�_�=do���	ws$]�� �g"R������Eo�f�2�����b��/�B2�!J�S��=�����n�O?������bzz�����]��jh�$���E��((�1B�0�$�|���`7���^�a#��E����X�8v���k1�)�|��&�50�\>��/%���i*m>��)�KA�n��j �����C�'<|�V��2�.
O�����;nc�`�#,�zC�}#C�4~�zc���T���o��w$Q�Q�W��S����8��������g(`��?U���<����x3�E��)�<@��mx���&��M47kz<q���Q�p5����o
���)����|���rO�����q���C����fc.�����k<h������z=zE�G^X�B��a��9K>,�do\���hHO�����Y*����
��~i ��j�K
�(�JlW$���$�Y�h!�/]�WR~��Bv�:�6Q��qu
�sX�wf��J����H�|�j�$k����Z���(��qv�x���b��5�������x];9}�������t/�����k~�����~V4���Ou��IlsE!�;M|R�W*�?���S��(�T��huk���OG�L����[HD�	
0�,!��a�_d�L.�X��P�n�+�V4������^|Q�;222f����-�w�*��6��C�=�����<�
���l�p(�	�>
1uk�jd�l�G�A�������J����k�Fo�f��;p�x�-�,v��i0"�@h^e$���lD�3�T�K�����������|%
N7�jG�.#����(}5���n����g"<����"
]�%B�E�����}G��,=�'
���Dg+�n��������|��%����s
�Z��Z��J(����s`��p(L$ D!F�b{<e��`��=9"%�M	�c�S*U�W��W���u�>9?�t����y
�������P��D��G
�wE�'cDB�����2��!�Eft6#���*���b������d�&�hn�W�>j`���4�3���M���^��F���"�U-�J�����""�,�N�r�I��z�C��j��8��~���(�DO�!Z����c.�i����T�*��������D���Fb[k�EYXef#��n}*�g"V����,���D�A|^��"
�BC�����vr����A�����.����"���������t�Q��l�@����A�2Xs����|�1��K���0&��(�1Y���A�|j�'�_g���p�i�����w���������d2�/}[����������h_��_*}������hy%�p #I�e�/-�	����������v�.v����sH
�z�m��}N��.�~���_H�%������Fd�w"rq8�-)�R����Q,E+�5�CV�8/y����20$�{������6��������9���(�eF��X�q�'�B�Q!�}��:��+��M�Z�Ob�Y$t�Bh�? ��Z��Xv��"�fl�lA]^,E�qXB������i�/}�����K����������b3C 6�Q�?Zb4cLW�9���{�lk���,���.gki�6��-X���o��+�����&:��`���
={��mO��6��%���F�'.��~�������G�n7!g�V �.y�/Dz=z�a��OO�\I�]n�~~�~����"3��#��s�(2+�rt������_����/��/�����:l���fy����-n8JM�L�������:�pK�������������=;��4����}�l.v��;(,����#n�KP�X�v�63@g�a�[���4����c�H>pT��rX0Y�Wo��@�h��M&�{CI5���X,v�k�(�K,����M*�58��	w�q8]��_I�����m Oj�e�o����7n�����p��\�S�ED��fO$�j�DLV�v;�t�v;v��������{s1�v+V{!���)1��G��������
���r�q@	�"���W��j�iA�K�=�v����H���b��_��IU}9������]����zZ��qX�X+(Z�5�R@�gN1����8�&L&+���q�5D�x8���T������d��(f��}T��L-+���C�����q�[96;����T�D�`�����������`sM%���emTA	�/?Ur��$B��!�
��r���!����5>+/�^_{�ex�1i���"|,qV�h�o�KF���Z���~/�}LJ\�qVo�D�� �@��>�������d��/$
�����vb5��g����FnH&�U�	��0;*.����|��_��E�Q�D���>���L���Cr�Db� �4R`��A��S�n��E��uJ�!�=P����un��x�i���^x�����'y�	��	c�[QCk��{�}�:���U@"DN� >���C�-H���'��ZY+�9{�g?��������
�*s��G��d�+�p�p]�']��+i���?�������v��������!��26��0��c�M�\?�1_�Mv���+�h�p����������yC(r7`V�9��;���?��z�������	�v�W�2Q����kY�v-F�������������J���g��2p�z3
N������akY/)������IK�u���h�J�!����nL��/C	;_�D�������	�>�S-����!��-��(�K���D�R`h�����9 �2����x�ZM
e[�xPCN�k�I<�MD�A>��n)@�Q������D��	���M��b��a.�Q�SC`g�sTL�"�^sY�6L�E�Q�w����j��>��B��[�/��VF}�~���M"�?���!Z������g�������y��
W�S�S�QZ^��j_���NmP1���j�(�m�f��|���il]�$�0�C}5���(�[���������
v�(��d�N
^w/G��L(�}7M�W(�,�Qq�8�wC%���|N���Mn��jz�}�:J���I
�����D�P�������]�w�>�$R�f
����MU�.:T���<"��'���B|�L������2j^��2J~Q���_R�$�w��A=�>?���y��A���'��,z���7��-~��|�8�?�
�<�E>��9��������C�~�->��������$����
�mue����X���.{7^k>������OR�w��n���W('��h���s
�O�����J�a������t�Pb�(V��}new��Q�l1e�.�v�)/��j/[�wb��/J����&5~�0x&�qr�}`�����fF��9��������@�]|u��� s��o����FGGe�L[����t)7�x�xb��������x��I�I�$���"��W�	�N��&s��������{�pu!�GFfa�T�$P�S�,����y��\�*�'13�z��\g5O�q@�
�J����bc�6������\�������?gVFF��#�${�a:��-�Y�����:_�A�R�����|q��=�pG�,�Gcc��,�b�UN�/�j���*�������Y�/(��c|d�������,.����������2222222222222���f�e�2222222222222W�\y0+############s�!}0�i���C������f����T����3����H'���u�DN��C���<_|���}	/�O�~����_&9d�&���o`����&##s���K��d	�o�W{�d�t���?g6
����^jo�o��C>'�n��s��h��
��<R��a�
36�gS}�I�vs�v��yv��$x��U�yc%6k.�5����6���BO�Y��Kj�Zh&K>��i���D�X��8s���V�3P����/����K/V�sJj(�7���Q
�3\����;SKe1���Jei&��,���G����q*+u��c���@�����Z"ma��g��&~-�}��/��e$s����Ee����"��:r�kK#M��X�[��/������u��?��tH>���fb7cH`fV����9gE�c���#i�b��/&_���w~���{�������'g�_W+���T���/bWM����Ak�cg�Ws5Ono����j�%Z% �p,r|�E��#�^���t�6P[����V��|v��a�g�%���C��{�q�g&�������� IDATv�o��W��D�t�������W����_F2��;_,�;)5�h��8�=���.-��*6j/�dW�������dkv�aA�uU":3[P������v����������������h��>�g���8��wRVdF�Q"�\t5���f��og-�9���ZGi�	��|^\������
�������bmm	�U�����f����A��v�����3������Jl�Y(��t�N�@��X�%�O��������+����L�Z$>���uF;��Fl�,5p����-n�9��2��j0��p�u����U�i�SMK����U
��}�5���	����Zpl���<��4j��������7�jG-
]alj�����VS�������
}ee66��P��:A��8(������~����e��
U�&��?��z�P��j3�����4��H
���f�r�<R���f�����7n|���p)��'���A�4>V�}PB�EH��������&�&���W�����IN�UEJ|N���k�����eB;mX(2�P��t���GB��`����\�"��m�����q-f�����HW/��~�O��/!�h��$�m_Q���_
R��X������K��L�2���z;������v����l����Y=�~�$�m���d�����f�����H�2o�I�/�X��-)���r#�����v��I���������w<R���Vg*������s�??�����Lu����w�|�����c��?�<��O&w3�v�~�."�F��3�z����5��Tq�*�������,
bo�M�;��RBe�.��S
P�`6���=+}���m��^_��*�����Cx�x+f������5
���hsz��T��+)�������R��-�~��������,�T/O��X���/4J�� �������[izz���� ~�Z�
��?�����a��\�G���Ysiq:0�����A���c����UQ��6|O���i��-��o��q�L+�� �G���_t�*�0��c�zg�v��h�����|���2����A�������CLV���F����jgd������������K�,,���j��#�%�'N7l0�k��z#F������>�~��7��]�����u�\>{!�<T���������E���3�@/n_�GO 9>�����%�7���E[[��=��J�SW^��Q��������g��h7b5�p�:�������P��	���o��/)���~bH�o��WB��\?��O��O?���T��Vs9�N�����6���\�>�@���f��_N���9�{}�	[eyB�c1��.{�v�%��WY����}nU"���b��q������o7P�� ox��=A2��/�_��_1�W���3S�Q�����?�m�#��s;�fv��Y���.
�6iq5?A�#����=���Zh��)���F �=�h3!�GK_���xf���������^�����j�c�/"�}�<
,*�M��]����V�m�w'
��"�o��O������t�&�`�qr��gD���>�����N.������A�������g�����o��V6.�����'DW��"K>o>27�c
���]~��4h��������h��E0z�HD��������p�tS_�y����2���;))+������s���H~b��r���rzQYL�UN	�7{��0���tL�����d�O5O�Ms��A�����D�/F�WX�8�+5 ���W�(:�@r|�'b���7���s�/\`WW7��LzpxD�����'���Jv��g����_|���q�o�06h_Z����������FB�O?����o�i�!v���/�����\��a�������a��p�4�����m����"����o	�_[P�]7��x@��(t���X|��o�����s�??c||o�s��=�p��k���}&*���<�s\����]���3{y�:#z���W��6���_�jW��c��������c�L8]n��Z���7�@��P
(�R�S�I��U���)�{��:���4�����b�kU��V���xG^��fPQ��B B�@ �B�P�JO�J�����2P�m�A�o$8�Zh���|����rt�R���$�g?d~���7���siT�a�YVI�E��'H~��P�������e��g�#���5>~W?�b������X[��zp�v���M	����t�2'�K����U�C�z��>�R�M��FY���ji��%���X�b�
 ������e$����R�_\f��u���J�������/F��[K�����R�;��+%�����)�_D?q�0?�������8j���R�l������{I��1�7�����w�c	�ci����V��j�����C����b�UJ������*�'gY������9�9�����g��?>>.}���#����.c�AA�j���=T�s�Frrs���a�o���3������z������n��I;N(���[���d3F�����_oO�@�<B���=��0&�\q�������d�)���n���|-�l�MN�F-����
J�j��j�jylTA0H,�y�w�����{_��B�.UX����|���$�b����f�����=9�:F
��-R��W~)������O��{<�xp�_�Qe�S/�)9>���%����)�O�e#��V{��P,��DI�^���qXJ����������@�}��Q�{���f�������^Oc�#���POO��}�6�"�H�������[�����I��y��E���e����g$�P^��-_%���]�K��~��,Bs����lJJ���:P(f����x
�f-��|�����������'��=8{:h����m
Tv�}��G�
��$���FL�'�:pN$3�*���'��* ���B��zC!�z������|7h(�y�2]7{��&�Y.R�������I�H`����D}����b+�'���)Y� �����P.����`�|d4cT��M�{����_���P��E�����~�`M!EE
L�n�]���w���wz��k�G�����A�
7�>�{>��9@�l�$"�]�L�f�G��%^����|L4>�%��������K�~R�0��q��z�,��b���?fz��{�ts���d�d�'��X?/I�o��1�S��R����b�=��(����H��e��>Te����?��0������~�\��P��~*?E�{���WS�>{��y�'��bH��b�s�7��>���8��S�B��<���
��0����C}�{������^��`�c�����8���@g���`-����A�I��Da�u�;�P*s��G"���x��-�H&&JJ����
�`�b.�zoRT��Z�u������\y����(1$P> ����^����&L&&C���"������t�rJr�h�f6��0G��N�����+���D������K@��v'XJ�,0�7X(+�G��/�y���=~�^,��*�b�����X�������ME�x]�!%=��~_���L���^��x������X[��/���4j	���\�����,���?y5�8|x���sD����*e;���g�l6a�X���F���n��H����������b����7�0u�����d@�_��XM���8h����?���b�w����_,
���O��q�����b��[1������n0�c5���M���bQ�p�M��M�~�*����PvGSq9V��!��b���r���/��~�]�z�Q���oo�#�f���[b|��?o{���������^�g����-;��/K�_����2�/�����������z'��ORu�KOS���yi����=-�VQb��~�>��^���V�e��k�g��c�����
�mue��s;h�m�%gN�����U������u�b
�pvt�4N-��k�����#�C�Oo/>�N�u�����@�T�0�?��g
rt(�:J�����x�f`i}��I8����u�V�g�nkqDWF��RJ�>~:c����_�o�f����eu���]_?���s Z{����r�{K�`�u���"��MQt�\�&1�|��b������8��������	{��[����n�Jw�����b����M/Z��(����F�R�����$�"5>���R����bH��X,��%�|aYo�c�&8�4OE6�0Y����/�Yl�q����3�x�=A��%����I�<���S����J�O,Zj�Q��`�y�r��������s`�����g��I��q�k����������9����mJ���8�E������?\��-a|������Jj��a�w���W�b��Gx=}����G��[9�g��7�g����Ru��@-O5%8��/���s�����H�t.7[u����������j����5������\�������]��}�#����~^�\1��\��
��w��=�Y<�����l�o3�?::Jf��T�%[f�8��H|��L��������/&[���u��+��\�Zs	%N{bK�.W����,�}��_�����F��W6K�~���gB��@�i��fA��������	pE�����,,��d�qww#
�9|AFf�"�����E��_pL��Ua�iP���NSC��nD*sE0�����y����������������%'�e����Y	L�3�`VFFFFFFFFFFFF�
alllr�����/��c�)�O�Ya���c;�H;^Nf��i����Jf��vu!������_�G:������.�2W1���!������RebfvA��]�%5l-4��
������vK�zNI
��&�:5J!rFYc=�D����|N������ �J��b���u�RG�~�����u��?��t��WY�{����z�mf�l���P'N�>G��v��.�3/�')�\���)�7���W�Qd6�Q�t�4���K�����
�Y�b��E��&��-Y���h��
��<R3���KEB�3���^��O��A�>�����Uy�g���8�C67��������gW+�m}�g���yc%6k.�5����6�\2�Z������Ak)��V�Y�������S�*�m)f�)
C��}�5�c���/##����f��:ve�h�f�[�����]~��vp�Xz��^�[���B
#y�%����'�����;

�[�����$�W�S��� B���{��F9u��\J�M��w����u?Gm���%�o��V����J�O��/�����4���D@��V���*|O>C�_��wUa�8��������mT�x��H>~J����$�?��B4���@q���?������M��I@x���l��A����M�g9bz��{z@B|�������O�^�6����F�CO5�K?2�U��������|�[�orTh�k��N��g�����rpw=^td[
�k�	��/##��@�`v����f��s����f�����H�2�����^�D���j�'��X���Z��=���n,{�l5�I�,��������{B�J3���1���������ZGi�	��|^\�����-�V���?(-�	T&6VVN����t�9�~�
�p��G�����)�����K1F�3���A�4>V�}2-el�Y0�5h���56a�����c�	��a���C��}����)�+�y����`�BE��-Qo���1�I���By�G�Q�u;�$����"Z�������<�6���r��j�L~<^/���l���
�)\�|��������OEN%���x�L�!m](�6rUn��T����?	��R�x��_,���j|��]����v����l��(��dn�b�}��O����N�~+[M-��[�$���o�_���x$�?�F�-����K����%����+�1��+����/���e��<�����:���~��Ww�)�C��^L[Qg�S�m����Qg+������W�}�(�Q
�Y�}�=��Z����g�����g�M>#��K��s��<��%���sj�VL~�/�����#l1�sp����r��S
��x?;���wS����PY�a{��u�\>{��<T��I{���n'n�Ai�uE^��nsB�+2
�Qzq��4&�Z��|h������D����j��#�5��-K�.��i���3���i��5����*�T�� 1�|�6*j��LS�Q�=�D���/��\\I������/�V�ss��V�y� �������p��-�yT�� ��L��n�!
�+z
�O�"���/����2����s���cQ��n�j��u�qQ���3��cT�8�@�x�o�$�?b��Z������p;l����(~�2k.-NG|����E�/������j�#'�I� �s���k��L������o��,�
�9���$���o�2I����O���?��"��v�����}\���+�(�e�r���������3���}\�6Lf-xf�$�'�R
B���{�P����[izz�����W�b�I�}4�����A���c����UQ��6|O����gd�L�?�Fr��:*�?�E�"�w���HK� `�bV�;�5wf�X|����j�?�U�a��p�4���4��c���R���H���	Z��������cX
�4U����r����T�h��������h��E0z�i,U�����G�WX�8�+5 ���W�89PC���@-}���e��fT�Q`Q�lj������2��K ��~��.}W�k~�F����}a]����`�����M��M}m[dF���%�0Y�� f�x��5l�u�y�s%�?EV��<#�y���U�7�@BW�������=������gq� �������#����<h�>��>.���_�$�+S<���o�i�!v��~N�I�����#=���|[(�Z����f�x����.��I�K����?�l2(�W�h�O�&f���/��{�9������������]��31�����x[�h|R�))6��
��������
8#zq�����f�H�h� ?�}���t�ke��2�9
�z��������h�J�EV�
T�C����U��_Q�=�E���7��B"�UFF��b|||r7���Y��2@�{*h�^|�f��^����U:�MO:��<1�����
�A� �y�/b��_9��~�(JM66e�V��;$=��m�e���C&�.7ng-v���zv?BOb����7J��(�/�|�����d������fH@@�2r]o���}�����b��y]ke[�������@a�(_���}�L����S�M�Rp����A���bn1��Ul�<�g�-��'�?��B ���B!P�w��1�D�/�����ft���Z�5�E[)y�
��{��6u�i�oF_UJg�`���.>�i�O.~.���+�I@����?����1�����@&f)���_��6&�].\A
/��Rfj�I��!@R|�VU�l��
}3�$�>�#�]� b�W�}T��R	x]Q��2P�m�A����� �g���$��L}Dg�}���A�%���Tr|�����X���?���h�U|M@P�V���U�T��c�
�����������G��x����V`|�F��N���W=T�s�Frrs���a�o����j�D�P��D���W��]�lp7S���?���[B"�w$�?Q�@��7��}�`�da��:E�yoy�c�y�����,�������A|t��]����b:f����f��Di>Fy������yv,������8M������+_T�lF������i�46a>�,�
���M�J,�	����D�L-�A?(�����@��MLG����H����QY�T����>+���I��R��������5XPi�� ��n��T��%�n�TG�~��K
���x�Io��>&�^&�������x�|2��?��N!?���:�\���222W3� �����'�e��>Ce��SN]�
��Zp���`PlMh�w��l�G��1�1*}���=��
FB=���]��U��a:{:p��_�� IDATt����#�,���<�%�
R"��j�
7
�~���"_�z[<I�FL�'�:pN��h@��������P��E�5{v6Y�������C={�x2�@BE�{���WS��N(��sQ:����t��p��na��2���R�{1�(U��<����4���d�������%_*TJ�k�\1����'8���_�/m�`�mfY/��M�D@��]i����,T���c����ls�'�O'v�R���
���U�c����w�#������(�u����i�=���L,>���A/A5f���M��,�@@�����3Y ��q�vP��1�$^aA���/xp8�(���x�1o�edd�RRR$.3�`wlaWq9Vw��	[�����PG�^���2�D�3�Bg�����QOS_Ht_��IqY	����e�h��w�����X���U�*����x�2�Wl#{��>��e���^7tl�7�8f5d��#�q�����O�N�{�����'����85��8J�O���0���3j�Y��7�!�b�T&JJ��B�BS����m�|W14w���[
�������1�I��7���B���
�������I�w��6~��og_���eU�9*h�t��h1���������-0?��R��,R�/����O�?�t�t�RT\���Jo@�ic)���}Q~�T9)����Q
��r	9ji�����OI\j����(�/����3j�������������~�����G��b�����*�@�3�Mb�J�Q�7��"�U�R��`R�3`�m t�9��m����Q���u_7���	@py<��!Y���h�k�tuPh-�������Rb�tP����.�v��)*-cko#0���t��n������R�4�u�4-����h��_edd�Z���	��5��\YII�a���o�f�f���i�`w���N�����u�rT���~{5{�5l-��%e�P��z�g����{=��G1�-�4j��>�����$�Ra�m��L��p����Y���
�'�:������J��RA?�����S���t��z���'�h�J�/�G[s7U���|P���&}4�kgGy�XC>�]8���0�C}5���(�[������~UR�#j�$�+�?LV+���fG��N���i��Bye����w�B>��������=3z���/1�'�?��s����D������/���R[���[�K�+�\��Q=�x�#��\l�H�������h��$?gr��o���&6T�Q��&�Q��_����2��(4���2�O|����"�6�[���C>�}��8�{}s���@���i���e��5�),l���T�(}67�B����ipJ��d���+�>���b������lRC���`m#��j�?�?��P���lKj`h�I{����j���l)fK�&4��W�I�_FF��B��7�8�P
s%Gj44o�-���l=�I���'r��0��\N�V�+{�fG-�LO<�V�/ud�������a���������r�J���������t�#�"���I�&���l���V���Yd����\��������C�H&>W���:����|������YS1u�E�t�p�����F����������������|q���lf���e�Xf,########3o:;;�^���.�$2222���f��f,########�����
teddd�����OrS/�,��l��\G�1�rKru�E��R/�R�OFFFFFFFFFF�����$>�M�.G���,����![�(�q;���%
c��R��#��~_�.w8��Qd'��B!���$V��
K]>���L��X[eg�Vy�����QX���x��������01+;>>N��������.�HW7���&��o%mU:AF��9_8}����`�c��J0�����3�Y��#���,T�������y�j���,��lR~�s�U�V@��1�S�Z���,R�V��������_�x�zZ�v��5#�G�K~����
? ��t�����].I�_pV<���?����g`����S\{��t.0r��{^��V�hVK�8����2k�����?1x����ho��O^����9��������`j��0�Cp��E>z��KWF1
��ti���_����U���\L�>�i�C����U?H�|3��j ��ddddd��L�f���B�������{�����_f�������������e_�`��.��<�~��b�������0�\��B�z=i���f��,�3�r�������S��J��4��F@����
L����B�2~�{(o�����/�M\s�N���d>\�FcC�|~!��U&�]_t�f��o`y�j��������O�����=|6�GF�V�q�:T���O���7�U��f��������rs%ko1q�vL]�Y�3�r��?2����:l:5�)���N��+�~����\������/}��������!��
��`��	B3�+n ��^w�D��=�6�/��k�8���+
���R	2��7~)2K��������a��F�6�/��&F�����=w��f��N����<��,t��}������H�M(��;(��h��r��������k(��T��T5��g��_ ��������/g��#�[��}�4��������k���MF�I[��F�_E��Y(��xt��K��h�&2:"�� i�e��
8�+>����}�$�|�
����������Q��~`����������W�"�n%��)����'�s�4t�������B�C�f-����v�3$=?�V���?89T���q3�'������:`�-<*'oS��J�fF]�M�	e:���0��#.|�	p�o}���~N�'��c��a����}j_�K|?�e�]���vB�m����2��'8�Q�-�X��i;~^����{E(�����%={��2V^O�2=�!x���{OJ�:K�m��va=�p����QH��������;��[,����R�����Ed,���'�����/?@�&n��7�x�O~v�=��T��/s���O��#�����]o�����0��{�h ���w��?T������+�cu�>����L\������}�����sm�)]gY~�=(8��_��a���{��'����_�d���"=2;����o�-��~Vl����\���Q�\8���n���t;E�G��k�����L(���'�z��O�]����)79���)��Z�9�o��������L�D�3�����b����T�QV�Kz�2��xv���M�%m�w��z��b�q��_?�rC���a�P<���_�aL�������j$}Ew����Cx����!5{�YE�7o�@�*���t}����z�=FF�kEy��A���l�6������������'�����Q<U�r}��4���o���|��,����~�$���IQ�����~%�XK��t�y	��c����~��Qi���?�����I��3��3�	a;K�_����#o�G�7�8��7�#��o��.�|"�% a�U}+��};�/�����������#����?�6���| ������)~o,���yo�3������j�;�O���=���_D��u��x������q��~���.�l2V�����Yny�e�]�p3+�?�����8��<0yw���Q�����c��O������e$#��\���$?#�3��^Vf�#�
!FU���	V����j@�����e���{���oF��{H�p����Yn�+#��+�4�����������.�H�x�������y�:�����7�����7�u?�A�a]������+�u������K(�f�[~����q�\��q�m��;��o��X)���L���7�\}=\�0�q/��~s������������r��S���"s���`�N\>���t�����zp�	�NFFFFf>���M���n��l�n�����~�5(���(&:J�<w-��z�|�/��m�����?|+�oU��'/F2
2��5�������<��������I�-�3�z����a���2L���d�]EZV#�Q�O��pR��"=ke���=�m���3��N>�{Fr�c/��>�
RTi�����U��_���e���=����d6)�y��?�������d��I	�.�ykxF��c�`�,R���~ ����'���<����rj�H��'~+^>���u���
�$���0�[��Q����OD���M�]$%k5�w��lO���IJ�����\{����+���$�i7mB}����'���w�i�L��sM����;����~>�']���]\���\[�G>�(L|S��hn[����,������������	����_����3�������W��`-�=?`�m�������=j����O�����L�RUkHM� 8r37�7p��"�G��w�����g������o��s�g�eT�"\�����u�0�z������up�j��pN�?zPx�O;6���'�7}w��`h@m�����������|r�1 u��	�/�~���4����?���4�����u,7�~��~�b�
��|~�������� ��Up����������0����q���"��>y�?��g�S�q�&+�����s*�|�����VF���r�{�����$L��l�V����������;x���wG��ko/Q�1��r��]�J�:��9�����q�����T�*�4c'��������H?�g��F^�����������k	%����#�����y��S����G��h>?����0>p�sIU�^�c�ci���+!�X�E�T�A���QD]����$�_��,Ra���0�C���5@x0%����'���,H��x���OL��2���L��P����6�k�����|8�|���c?y�P�(���	����=3~K���#�E�~��
~���u��:�Q��M
<��x��Sx�|gZ��<����_������5�J��v��?ZCx9�j��b��Q��(V���m�0�#0>"D�/0>
D���y��������H�������-#�X���=�r��R���R�S|�I�ug�����9v>��|�8�^����vnZ��/�����p�a���'�F����K�
���lb�� f���[?��W����~��C�� �?������e���7Py�8���y���}*��U�p��G�<z���Y4�]?������]�������;�o_i�����x�j*�n�d�����^{�[���������7�����t�?��s�91`	E��F&�M���������o�A���r���G���_[C��,���%}�*���^5
B�,$1��{��~���O�/_c�}5F��_#��[I%�H�2���yd|m%�>�?e0�H���x(2P��?����F���{�+���f�\�I��_db�o���|b�_PB3z��]EJ�r�������HU����
��Q����+n}7)qR��i����?I���_y�����2���_F�W�=d��f�����g��;|���v�������"+�����"�H�5:�=���c�-�\������7�B[�g(9�g'_f������7�~3�M���%��'�8����]�)�[`V��O4���2n�x-���3�/%�^��]O*0v�,p�H�&�K���I��W2Q�$�G���}�������e�c����������JJ��Ij������2�_�i?���,�����vU�L?�����:��T���~^FFFF&.g�����w'����Y����J]�
����c EDx�S��4��>��O|xy�/g����zw��������}'�o:��@��z�@8���8����Y���'�P<ib��%�,k)�e���|�����\S1�HRt*��c����o����~�
�p^��q�;I������������}�\�{�U7���"m�
���c�r�_�F��"�[(B#�)3����'Q����Kc<�}/�H[�a|@�|����3�PE������yFb-9���
��~�c&^���������9����8��_k���{����wbf�����,[~! M{/��gY��4 t���,\�*<w3�9���e�m>��Y����4#�&�|IY�:<v��
>|+�t8�����D��T�0�v0�!�=��3)+���
o�^�,�p���E�3W=@����a�=���[�*_"��E�Jg
7�������m�Pc<�G�?F<���������5�����=�6��.|��\uO$3��4������;�������U�����N�).��zZ�=X��d�<;+###��L�5��H���=w'�Y9\S��(Y�gE]w���:2������e��hV��Vo���"�_��hiO��=<B����EF~&,`}9
�����3���f2�c|��^7�����*�H�=�����f�^�^�8����`��6w��r��"3;�?fi�5(�`TiB����������y+������,{��z��Fd�.�~EY�&<�<�6k"K����ww��n
������_B��Q�J��9����������~������(2���n������Q�^?�H��I�\���H��9|d�&�l3��"�Y?���_q�yP����;�xRt��r��~^b,����F-^�
��s)��p�.��%����^���cG�����IlB������18�;�t�T����O>`����]����:�!�_����?���5�/\�r��0r������O�a���_���l����a��_�����Pc�=5k7�?C�?#c�=\�}/������������V����+����M��kV�#�@|`'*�B�?��@���X�-����������e�����A�zEx�z����rb������b�W:+����������@��w�B��^[�h���s�?���> t>�,�������1�������2���w����5�������Y�3�G�:��_?��[���g)������RJL�b�
�n�<�(#####�i���3�E�{���R�����'���i������7=�����o%}�
N�3r����a<�{@�����������~���ww�mVa��y���M��tnm$���UZ��
���
�BY��2����*��8���a`�PA��(��m��!�u2�[;���~H���i��7^���L�7K�t�� ���I�����$�<g����c/��H�1���HX��
+��=�u�����D
z����I��7���(����l���}�P�Q��-�PV,��~����R��3�������<�����o��MG�P����+
s�q����?�P��T����I�?J,Hvk���nU����7����:��Z�vm�������/MP�V����E��uZ�	�#zm�	��Y�Wz	�)���a������s�(�}I&�>f~��?B[VC�v3�������������;p����"�Vog��[z��z�Y�
��5R���7��7~��)� ����$N;��!�a��d�Se,zV,�TV�Yf���z��]h#�������S_�6�\�����~7!���'_�R��iU�i]�>��S���M�?U�I'����f��M��M`�G�������z��k0���#�������>�z�n��
r����z'�Xva��{���*yRw����7a*1V��d��a�3�*>�������������9��K���b�� B�)�B,CQ4�������C�O/[�0�I:�F���{q�����:$�oDRV���M�����j��:�������������NJu�U�e���$?���<��&��us�����uP|�ME��bwc�2�T���Os��6����n�r�hnn��g� �7���S}�����([��{b���L)'On��FE�"3���D~�S��A(,�1���Lu:���FJ�o�X��|��MN�������O��7/va��'85I�XI�M���]��� �PPb2�����MUCe(��7� <���*������#A��Oq7��8�F��mIDAT����Sh��g��������dG�&�:�A��<Q�o�n��x
A�����O�0���8�e���r�>A�MFS[[[�e�� � � ���viq,K����B!$IJ����p>b
�IEND�B`�

Screenshot from 2025-10-27 17-17-27.pngimage/png; name="Screenshot from 2025-10-27 17-17-27.png"Download

Screenshot from 2025-10-27 17-20-20.pngimage/png; name="Screenshot from 2025-10-27 17-20-20.png"Download

�PNG


IHDR��H2esBIT|d�tEXtSoftwaregnome-screenshot��>0tEXtCreation TimeMonday 27 October 2025 05:20:20 PMml� IDATx���T����7?$l��I���HN������w�d�Mp��e�&���o������[��s��koY|9a�c�|��a��YgY'f���(MM���f�
���t[K��:�b[�F����!���f@������9F�����cF��<3w��w��������:���Y�H^��Wh���N��j
��������h]����R��g`�kIf�����:������K)+���U��+��h��/-%?#u��c-,�&�KJh����������/��XK^�tL�d�{���9Z��N���fS��0��M������Ytl	�=OE���K(+�]��B�	M�EW�]~����s�=� � � ���{�"qs�{+�X{��-���rhc���L��l��4�K��Zzf:���Q}8�>1�AAA��y�]$�z��1���f)*T)��������g/�����zG"� � � ��w������L�^�?��&v��_��T{C��AAAA�A(�B/� � � ��qsM�AAA�����>�L��V�`J��������{���F�OAA�Y�{_^����Y�Kb����q;���5*S���?�IR���55�P��A�]��"���4[�o7z|�{R�?��2;+t���p+SY)�\��C����g�����+�9R��`����i>\���!De-�H�N��^�
%���m����U�4�;��s��_��k��)V�^=@s[mmG(K��!��}]����W���'�~����5[��Q�������HX�R������O_���>L�,���HF~��|U}�:;q�6��#���W�y�������������N�6@�X�r��}s�gR�O]A�F��'�����������}�M���s3��G$=����6���T���[�����jg����'Y��Xz��D.1t�_����A���K�Ox�M���j������,}��{�����>P5C~������o�gbN� �g�����?x���Q���{7<N�^�����\��YY����X���>�-_�������s+����� <����Zx��:�JJh��O��*�u����k�W�I�Q}0�������yh����9�b��p%��fe�������lfG���i����C�n����ZDq~f�u����B}]+N���7��H~�G{#5M=������B��w'�Qve�2f����M%4�fO?#c�_�����1j_�
�i����@�o�+�f��b�|�?1N����#���m
��������������H��k������o'1�~'W~}x�S9,��W��_�����_%��H�?>�^�L��q#��o|�����H\�����g{y�F���k��Ks���?@������?v��X|�#Nw�,�<U�V#�|u)��f����I�J�7�$u%���xe�[<������W�|��G���t�j4k^f��^�~�P���A�.^&n��%�RV�c��?��X<k��L�ij�����j��f���~8_o%�o��S�e����^� �� �q�|O;h1d��y�.^
��b�������#
o�!v�q!%0gf`��3<�p��5s6�_���6����_�����.�,38C#����]:y���������F�R�/r��X7�C�8D�;v���^��c��;p�m_�G�q���[���_4����t�^\K�FOg�	>�$�<�����O���>���+��r�*����Du��A���x��=>�5�����b��%��!�?=3%��o}�E�">y��w�<J���Y�����
�u�+g8k�B���G�
����+�O��5�o~�3�������z��H���^G���0��0��T?���oX������~��_���@:���+H|(��c��T~�'[~����q#�
��\�+������'��7J�[�������4=���OC�?���_
�$���/c�`#��.���,y^7�������Lq<��I��c��VM]�����t)�|���C���w3x�,����Y�>���t�k*�ci,$g�@�fF�A�G��������r���,O��t��CY>w�;V^���<N��M�$�&�����l!x��.�����s!��{��\/�n:Z_�����\T���?T�������Q���N�"��&p��/�O*�8G����pi
�9G����8���s�>{�'���Z��j$������e	�2V�����2WN�-�u|9~��
���o�p��??����Q'���79��O"�O}�?�A����=;XU�$��(���?������y����a���Y�q/g�bf-�������v��Q���~T���S������GQq�+���������?�c,�v)��FT��Y��|������+���������������5����H]����l��`a��w�r��":���j8����0�������]�ki����vn�b2�VS�:@�zp��Q�Zve3E�*��	_KY��sq�?O���Y_ZJ~F*�@/���X��WKx]=4����V~�v�����k�`�7�>�*�y��Mvj:L�o�b���������p-y;)��`����N���P�:��U7�/�[�A�A�*��^�

=Q�G��2dRT��ZK*�x��i�����l���g��\t�QEm�����+��.���b#��rj���?R���K9��:��q���=�*ave3ERMX����=�����`!cw/Z^k��8�{���&�+�l���K9Xf���b6������O��E��r����Y�Oa���!)��H~n���.Gkic�5�<����*^��
�����HH^ng��A�_���;����X�0������[`�����=>HX@�����]�l����'�*�Q�1�:���E�M�E�����?�HMOPv�Rb!���{��Rg�A����1S��%F�2Q?�h�W*�O��S��G��}�j����3��&)Z����%���#��$>�K�7��KrIzd|�	�;�e���i+�j����PkH|�1�������P�P�Q�Ab���W���-S����{$�M�����?�H�JT/��^�@�����[��q����u;��oA�@"q��~��_ET�HX�^�_�(:�L���������cD������#R����_"�w����?!��EF.�%��'��-�����'S���
��'��������Y\�-
�FX����{{<�c��5����T��w`K����Pe<�f��G������ft��^�����������Ir����+����eWw�Y���������s,��d��Y��I����!�����o'=� }�@��q-:���=�s���4�e�p{�ZQ�0�������X���x�)�Y6�i�3f��������z�;�a��S\�d�� w����w�������#���'b�S<�m��gp�qi(I>��\v����##����s����&����7�@I��s(_����}�Y��?�����f��et�����%�~���_	g��O�G�7�je����gf���e�������������g�q����������&��	.}z�a����������L=g'=��Q���M��k�f�b�f[t���iS	�u-��p�h-��f����-�T����br�5����M%�i	X(~i�
o�L/=�f����-�T�I�����!�|K/�?|�����y��1o/;s�����dS	�[�����8�m��q;�}�[�V���9�G/�� ����&-L���zki�n�&��T�������J�5�t�!~\��:��u�������� ��P����_���=�l�����f����t�r6>�4�� k�|�&Jrr�y������9������|�.W~�������c���������N�C��66���Dl?����2�O�H���	���ptt+�������2q�b�Z�n>GNms|������\L��9���������"��U����n�g��	��d��d��������p'p��/�3���������)�����n�����gk����{����C��������"0���H'��;�����o�W���/.2�N'��_2�!�f�g�$��N��$�&0�f��	����TS���>��-I��DF~��W��Q���O��/�_�vq���,�YA����Z�����~5I[����C��7=jQ���/'����}�+�'�9������[�������?������_Mz��\���|�dD���$���~����C�o�^�40�E|2�������RW�P�v'(����q���U����f$y]�b��p���7w�$�O��������Ig���D���`o��g�G9��$��<���di�������Gz��������}[Xd�s�e5)k�p�}%�#���ag�g�h���Vn���@�?b���������#���7�?	
;��L�,��{�OH"0t7w�1��h�>���F��������$��y��e�Os��������_�K\q�����+8�����O����Vn_i$i�.H2�O_�r�|�??e��w�r��
�Z�Mw�;_�'�?�(C��1�/�}��T?���2�����8�'��}�{���%�G�g����Q-[���N���?;5�K2���I��L��?a
+
��m�gh�e����������l�8�+���Fr��|�>9>���B�_D���z�7^ku)����'���A��QU&�V
��:��~�C]���������T�������id��"l�*z:e�U���
u�:��7eM�2Gk[q�g�::{@���\�C���Z�i�MKV3�uF�.�9Z������k�.lY)��*Ev��@hH�a��S��m �j���W�|�&_T������~���Y��
����_>�mf�M��U�l�O����$�}<g{���P�b0M9r�#�?G���v�qj}HfMN7:���I}l��Ci��2}���D��s��2K�3�������h���
���R��&{�-���2���`�>H��e��x���b�Y
4(�I_i��l���t���������s�-��?S��Y�4�����}����{�P=N������;3������X!����H'CW�F]_0�J��+�����1�J���^ ����Sgz����'e`�th`4zXEh���"V1zz�K#����$=�L���$�����K��A������=0�5��Q�
���~�����E}�/QMZ>��2��P~�kS�g��S�A�7T�����?{�)��:x�'_�����?Z��~��i��~��zUIO���_F�/<u~�go��53�A����������j1<�xxg�}n��_��C��,�|��l%>�	��S���hJ������(�����e���U�?�������j����}�����nT+�({�� �C0:$��/1:�����$C�?����I8of�r�����X�����z��!�-��<�S|y>�%�s\q�et�	a����nM1_�xw����p���������y���4���?�9m�>��Q��|�>{�D����9�:�0�s3�jT��
��?�Uo/#�4�=��n��}����������]am[�p��O�;����g���������������S��������"����n-f���#6���T4L�L�v���40�p�p9:i�+<y�7�W���Ld^)|��1����8�i�=�������)s>��kR��I�O"��t^}����A��i�[�m��h������;)�`?X�@���
fxZ"L{��M��?���������,���QKZ<W�)B��m��5�'������/�@"I�@
�C�`Tcg7�������E������m�!�w�
��w����=������e/1�������b8i�����j�����7��a�����O��_n������g|?��eF�������<8iN�4�����	����?����Y.?�u��a��4���%~u*�����f���<�o�8me�zN|�����$��m��������Hz�>�	0���2��5�$}k9�8Q�`����v4������U&���wK�/K�>ME���J���fm�D���L�`��%X����%L�zF�w�N$��|4OM���4���z"�p�ps�uzO��{��?����Gpr�W��W����Ic�we3�3z�#.w|���{Y��&:���fY��~AP�����u��[<��{�$%�5��'�����.��Ps�/O����s���w<p7$L!����c�8����^~�+���u���dW~��5F��yx;���2C_�Szeo��x`��9`�L���3W^?����b�n���s���>}�M,�Q��f����x/��nU�>Q��a_p����z05�Hq|��Z*�V�v�9vLCaq��-�+���n���V,��I���ZP�-��g�CW�����n%Ap����SD�������}��}��j������Q�� ~��%/5�p����4���^"���_"���O�4�����4}��_\-�x�1�� n��I�����t�z��	��{�#��F�V~_�G�Yf���T�M��(YY�k�U�m���k�lX��qh>��7���@�'q4p����g�?6���.1
���d�jc��I�N�(F����9��FK��K@n���w����%����^���t�x����������Z��GU�\d��c�^A�������TP��+���-�����_z@���!��������m��+oA�w�EQ�r"������|m��X�NB* ]d�����W�Eb���5w�O��������7o��+S�X	1
�]��m|
�?q�F����������!F�����i_��h��p�"��p�l�S��������'��Y�}�q�������D�r7��|0uA�c,}.w�]�<C�R�$q����MH�=�B�(�L�p`�jk�
��p���
��?������o<L�:5�z���+C�������i�I�L<�N6���d��V>xTy�� n�m
N[�}W&�����D��Fh�$�i� ��S�q�Wf*�\�g��������J>��,��W�F.���l�r��;TK�N�3����/�c[�,�w�,C�z����-,�k���3�p�!B'`�0��}	f�n;7��||c��t6���:��&�+�
��l���Jc�No������'�`�z�cz]x�\t:cO�R�
���X�z�
��H�J�]S���,��)�����d@�t��u����YtL<��jj�
& \W*+&���XlN`{��������bMA�J��N:���X�������?f;B�D�~JCq���|��G���%�?H�TW=�TQ�c��+M�������g���n����Z�����4�V��>��}��$�����v�z��kTf
^*fmF
�r'(�����JTt���,�+����)e�������8x����_x���tW%1L*�����:	~����d������e������|������e���-�����_����/���)F��a�����<�@,.0�d�W�A����W'�@��?n�#��� }����i�����K�������P�=�t\�
��/�H��^�O�1�6�������+�?t�����},z���<�C�~cP����l���+����������^�������T�����'��?�$����?z�c���S���������.F�y�C�����#�_�C������iCS����e��aFO��+����$��F���n������y9�:�:i�����x���}�����	�����8	z�sNO�
���O���_}_s�������y	��Y���$���9���)����m������������a��'�W��r#���
	�i,U>ye��o>D��s���?L�%��!�����QnO{�@�������\��f�G��1���������\M"0tV~0+��_<�0���9V��iPb�7/syl���G����A\�������A���P"�OA��m�� IDAT��}�[������1H��n��O?�K���si��]t��|���dep���Ie���|v.{��6��������{Co=P��6����L6�[�\���I��^���o6���y���e�]v��@���G�Q���R��V�:�9������Y&��]������B��MfJ������~�td�{�f�rs�3tU77/�d���9h��c���Gv�kA>�@U=
F��N�]���W��UO�d&?��eS�	�p�����]���ir�Q��X2��w�P?�)���N��0�`�A���s�9v���m���C��u������!(�N����i���������r��������x����P~�	��W����DA�px$�e�b�i�1v����7[���_@��_g��mT�������6������������M~�Z��_��
*�Z,j�N&�����[����`�|��*CAz��|�	WK
��]7c�������AT�o�Mtz��?$�d-	��$|�C��j�O\%�|��-��������8)���^��_T�|�i�n�$lJ���������#��0�s����8����%x�AE�w	9M)l��4�W�f�+������?�C�%������:��/C��}�^�%�����Hx$t�5�3'#������������,#N=��@q�!~)�7������.�b���
���3<���F�D����P���Ts,���W���S��Z���9��V*I�K%��I`���2��*>%�9���|�I�����%���������?v����itS�'�������<Ko�#���a�t�#��23���$���:q���'�����s���������t-^	A�\x�e.�/�t�Gs�hV>��{x�,��[�������q�m�|������|?_��>���5h��`��&�~�����~�,2��~#0����/o���C�9s���(M�'��Z��f�63H���$��L(v�|f����)�9���b'�p���w��|��|���������$C�X���?#!q�������W���?[������$��,��Q_�����&�����MbFg����f��r�j��F��%$�\v�U��3�mC)��+�Vs{

��4���Q]�%���.���/�5�H���������~����;X{��h,-���<�����k�����/�GUMa�^6$����Uut�������Wq������z9��A_�I>a�����@�;��O��������W	��cx���w�����f�O�F��IM9�S�Z!Z��9����E���1���z��K��*��\���/oo���c�;m��n��A����C�����j��9��G��W�N��t��������n@=)~������/�h�/���)�����Tn���W�!��������H��.�*JB�g����=���
t���hb��l�����������
�l�+�}D��R�}E���!��;�T�������&����������,�vR�)���1H n���t34�S�c�kP�J�����e�sGY�~�o������$w<_����P|orl�$�s���~������?������
,y�w���W?����!���R�KGM�,��UXJ9\�������J�����V��v.o=�u|��/G�O�-�)�����D���h�x�u=x�]���f��y�q�/�3O�x�AYC������3�/0A������Q��=z�#�Y��,���,��1�&C�7�,�&������z��~�w����q;+�����s�.2�xeR�>�k9x��3Yo���p�%��|+*G�M{�+�
"%oW=]J_I7���K�����[��6���"N=�Ho���&7z:�
+i
K���"	����^���n�������H����;�W����q�_��S����I��5,�'�x	z�G�|�XvQSq�W���z��o�P�E�Ww�k���kA�p+�_�{hmX�u����c
� � 7U�c��/q�O�\A�kEL�AaV���6�jAA���� � � � �b>����i�[�lSY�8����wS�Sz������j[l^�$� � � ���5�>�F��<���g�s~Y��:TtM���C��j�Pz�����X���J
�������;0�������� � � B�n�|,����^���=t4�q��U&2�Q	�[��cxAAA!J7�^�������Z^k��8�{�{?�X���o�d��E�������;�"?Uz)_Z�@:N�3���
��zZ����I��B��4�zp,��
AAA���c�k�$�)��d�
��z��A
��R�_*����� {j��S��Z��2����FT)F�e[H�w��^�PAAA�[��1�W�c�G������e��a��y�/k*k�we�H�����#��� � � 1p��U�l�6�����V=���:R���l=L�w��?�n^y���I�`W
�?<�Z�Fv~>E�6��[�]�� � � �-����*f��5[�8|A���>�Mc�m,g��x|��0�v�����Raz3�2���ky3� � � �pK���A5���������s���/����N
s����$�J5�{���<�sI�Fs}o�AAAn��;�)$/��sf���<-iVc�o�����y�����0e�b1�D�8����E���b���Xml��O����9AAA!n�+�=4���pm�U3���iaGq5GlA~/��v����y{�(4Md���6\G�P�0����z��Q��
�'��.*�.��a���^��Z������z���[AAA��;��s4�jl����tIC,�AAA��-���7�zAAAAf$�� � � �p��=�R�1Z��QAAA�b~� � � � �%��AAA�����E��hkk���[_�>�����{�3(O���9���U��`�@���W�|��y��/���f����~Oe-�H�N�WU[�����������AAA�u��x��7��b��w�{����TeR�V����e�P�M�M��z|���M'�u����+x�#AA�f���	�6�V�����Nm��BAA��QL��[�)�2�O��gK�N 4�(P��5=�/������8����	h��/-%?#u���c���WKx]=4�������L[u3��1�5d4�n�u����
�Z��|+&�-^WMu����H��}����+��HE����tJ��yc#�;���7o=@��NM���
VLZ�;�X���~#��lc[��_;hk���������o>��Q�%o'E�Z5��I{���|����������`�VZH~�	M�E�U�v��W��r��uh��T=�;�Ux]f��F���v��0{�J��x��|��i�� � �pc��C�Rl�l����6m*��������]l6yh(�BIU;��,R���E���ogS�nZ�_�F�&V%P��� 5ng����b������?Y��s����%l*���c���b�n�+���8\`����d0a�xp8����6��o��������45�������rrr�:>�t�rrr��yv|�m|r��7��egn2�u�l*aw���]�-�k%�;��K��?.�N�C�����{��=�l�����d��2��=r�r�sLfe3�X��W� � �� b6���R�����������N��)�@�I�U�������i����4��Y�x�u4���y��i���b��C���_q�����`�J	�����.n>���Cv\3&e�������1����skf�Z�t~��������V������GA�F+������JgC���*��x|��iq��3�/m�����jp������I{c}j#�9>dO�Y�k3������� � � ��b6���a�i�L�3�����CYb�	��O�gb��x�J���	��;���C�_M�I�s�>3	�wR��<~�R�~T�l�6�����v���O��r��������g��R�m��.\�������I������7Z��'G.~�	�6���h�65��O+�7r�/��%|�����j�s����fQ�M�Zr���z�{����������T� � � 7����n���V,��I���ZP�-��g�����^=�� 8�K��)bw�x6�i�-�x����Y�:D�;_46��O|U����:����dAc��X���=*�O��P^�i�6��\�F+��d1~������o�IS��a��_�}cA��Q�W�N
K�h}Q���d�� � � 7��M���l�������`���:(1u���O\)���JZt����*�aby�C_ �Y?�>�@�V���t�4*j�
�����t^O�����s��5480��puu��G\����c�ea����������������r���n���.<��4�n��������p����uONC>{���`N�B�AA�����VD�-�.���uY&�����r�P�e�;d�e1>�vb�	`���/7/�����w���S�n@g���8K�{O_�m��r���HfqV�w�C��'�%�Pi�L�_A���;]~�YV��n��x,���������o��}���p�������l��%��[wR���_>f��G��b�ggO�]nEf�xP�?����	� � ��+fS�L���.���^WuU�����^��r���G��u�m?No��{���������G� �����ib���=�j
���!��.����+f(	��cx���w�����f�~��iaGq5GlA~/��v�,���[��p�w
8z|�����Z���`l�/W��Z������PBE�������'��������i����_�����m�p��]���:������p��+��������}�3��OAAAn~qw�y��\�o�l����>�/_�L*�*&P5�~r%l���{+�X���{�-E�� � � 72���s�B/�&m�A���o�i���#����u��� � � �
`�S�^��a!"����g����Q� � � B��y
� � � � ������ � � �p��� � � � ��^n"E��hkk����{���������P�����_#���)K_��Sl�4.e�����q�y'��U��Szxl�l�����	7���f��`�L�R���g���-��dWr��������q�����L��|��p��bgY_J�-�^K2~\�c4�5�5�mr��J�������:;���o|�\������N;e��c?fcm���6J�1�L�j�8�<Cy����G�/��klT�U�eZ*��=�hW�2t�����������s����(���J
�����%?}��4����	�y�3��������b�����QO��.����>��)�����&_����g������E����'������(�:��lc[�l9�8���&�oe��}c��8�����'�?4�J�����&K��t4�Q?����s��t�_�8W��x-�����������f���%��.�����'�H��Q�����E��r�P_��Sy7d��:Tt��E )6���W�<��1m�P�1�Z,��D?�ln��+�����������?d��D��+�
�4�Tmj��7}�&���SY��7����i�2�?J~������2dR\Z�����h���i�����
��fg����7`��������W s|�����m��I�9�h}��;z��;;P�=G�=|<�����]hm����������T�Svx��M��:�Lv��
u��+oo3�?�^���?*!W�3����`���k�x�y���_md}Q!�*���P��V���������b+.�l���K���z�RQ�����Q^�
��.��}w�b���*������'�c)�2_��[�F&�Mz)[��I������h���T��,,`G%�o��#�|���k_�e+�eY��5PS�"��a�*������q����&5[;�.�*�pI��r[Y%��������`��6��v���������������F�Rg,~d�Ov��������UTq����_���2�[^��}�F�X��m���S���!�������n��')�;ymG��C���BJ6`��������y�s�����A_;r�_r�������%V�Z-x���-��4Yyd���}�q�������H"�~emF�Q����K�S���
3�h�v�77:[���[M��������=���.	��@V�f�_�������K/��}�������\>0��Ux�=~z=������sp���%�������k|��x�P^>�/�*o��P`�d�,�UEm{����;��fc���}N��<8�[��Vj��*��m�E}�b���-���d�8,;���]G�������R\l���(��hg������M]qOk=s���-d�J���G��#�/W?��������}cQ8qDX����G�?�|n�>�*A0���d<�*�EM��=��O���Nz:'��
����r���U�����\����!ow+�pA:���.�4��V_0������u������3�V���Y5�t�/\U�X?6#����9cn3��XP2)*�g�%
~���4L9m�VZH~�	M�E�U�vN�U,y;)��`����N���L�!���2f�Q]l�Q[NmW?��f��ctk2�0hP���^��g���p���n��Uz)_Z�'I��zf�������I-4a%��G�L�O���������Z�C<u��cW�r��-~Uz)�Lx<Z�����Q�����h�]N�#��~���~
�/��c��-l=\���L3��)=\������b�z�j��������w4���)b���|3������A���������PPY�������W��
UNve3E�*��	G���W��,�z'W���A[�`��o��'(���_q��]!m���un�s�,8��O�����Qt|�����1^��#OG��u��Z�0�.��������RA��'7�Cid��M�YEW�T�&��|����?z<Z�pb-;�.��=+h������S�Y���tH�-�������������}�Z�R����l��w�..I��<���S��}f)�Hs3�G�wg�X��;�����8�zl� ]v�����,-����F��=#����wD�?Z�bd]��d��w�1(���V���.=�����#����t�u?������	��2��Dq���g��n?ZK&cUn��@�wp���}��3c�Jx&�������)��,���Y)��s��IY���N��Ml)���W�a�Y�dk.��C��d;u
�����2��egn2�u�l*aw���]��?���{�m��q;�}�[�V���p=����-:j���d7-�/m#=|;W����99<U��L�D���t�r6>�4�� k�|�&Jrr�y������9������|�.W~��IV�j�1{z�d�[q�l����f���\?r"�����������L��E&���3�jC��^��99OS��K�~cX�X���O�����1�}Q�]�����l��[��l6�����:�:>�t��p�>;~p/�\i����)��p���iS	�u-D8L��*�	���Ef��l�z#�~u���U��#�)������*����J*�fN����#�Xbu���o��.��W��t���*��c����Q��G��z=:��$����M�����kM���]8�����Wr��|���
�5
���;��6p��Zb` ��h?G�b���F;����_wo��awy9U�]�
�����5>����?�1�F������]b�eW�Y�AI�������q���w�|�"��=�u�0�	���m���\��Z��w��1��t7������
vq���^���~^�,�������,�7R����H�'�]o���z.4������o��Z4H�\�p��V�J�G��7\��R�(r`��9�bX��:��j�����q��PK����#b IDATIb5�Z����������>��P��!W��P
]N<>�����X����,��i��0G+��W�=��t-��R�Y�x�u4���y��i���b��Ct$�}<g{����@��+�����������w��t�pxQO�?�Z?2������tx��h�����t~Soup��Gh[rt�u�X����&��M�����c���j���������3��k��S���������EC����������i���~��������l��#�~=����<�>��*������b+����{����@7���y��l������e�[����T�M���4���������0md{����IvcLc�bR��i�1�S�G��rtW��@�r��ZJ��-"�E�r��*V"+�U�\���	�)�7��f\0��+
���q���=�I�#��s���}9�3��]���RBDD��P ����KtAI9Y)�"�+!��P�e_�p�ta_������	V6���s:�v��U��6J���9[�T(P��1"�K����4�������}~��{m
@[��l����$F�n���v
�W��q������}G��E��2���c��D��B�&����*�S��������%���7�1�#c� >�����N(�.���2��~Ib 2��H�����o�v�f��J?|���x�~Y@����N0�<N����>t+���t�^~���YF���|�����'���bY��q�n���F;v;��"
N���3}1�qY�&��$6m���l��e�����#�yT
$�8D�8t�B��#�C�N��
{��*�����:�i�z@���_��VA�u�k���U�����h��{�.cB~�J����ha�
B�H*���$����o�r����[���}L�����?���n��c����y��"�����i���(f3B�����\PP�=�3���l���z�Z�?���e|������`���H���D�9���#��lE/����	�Y�&,�Z�a����ow�Qg����h�~�>Q�����n�Tj1�c ��R`�V��\b<���=��#�����C3�	Yu
S��t;N��b:�Oi�6��*�f��E���`�����f��g��1�Y%0Y>3���!�P���q�6��p�����O>_��S���b��"�����@�����nL3����]pp{�+���p����n������Y_d��4��5�74�m���6�9H$�EI>d��[�)���<���R����vV-J8�P�*T�~�
j{��b�r���^��XU/]9Y������W(�3�O��~zQlW2���y�N�iB��wX��L[��&�w��[q �A�J8�>�D�nl^5{��z�v�R?��~u����M��_�r�u{����&������$��>`
["L�}��ARE��_���c�u��>�w���W���������J��5�"�P��~\��o�����%'TJ=�j�
����WUU�X�P,.�{���<\>�����M��Lo��o�~��m����	��@�C)8���+������>��������q����tN�������7D�Y)�z�T����G�?��W�k/����7�����BY;�<�G*�z���A����o���{��LO���[d��c����5��V����+�w����st���Y���C)� ��j�r@�t����o2}���S*{h;v���s7G;�wBM?lM:����|{N������[���P\G.9��������]��~�����K�����_���d�W�)�^^�`e�)������tJ�!TT+ND_�|�D�:Z��h��zR�Nf�[��'�o�>g��:�l�
aB~@���M_�m���!�n�� �r��lq`I�P�\�a�}{T�����k4M�����S�[�i��uC�6���\��$�/�z�=lyo���9&���`�����~���������a�k7����=�#�[���6��<���-l�#���'�vTP�������e|�Og�cD=@�r���^��v��T*f?�|;����v��������UV��dT�nU��6H�gG~��n���������&St4��~0��X�Jz-�.���|�<�`6#C`��!�'@4DP2g_Z�����22@�I���x�8��-_3������q6����A���*_��?)@��Og�cg���*���U-s��C�K�4<d� z���)bb6B��L���V�o�������F��!�>cc�e����6��(}��i������!
v�9p�|�<��'��<#��}<>?����%r�[�aD+��l#�8�FV@
�I".��`t�����?��Y�v�����V��?���d>"����06=�W��������'�fil����3����C�����'@l6L���������H\��>�"�g��]"i�����=�m�_���.������w��	��a06F��� ��=��1�q�#�R%G���h�%J����7������}��UopI�S �kY���\��/@���H�G��@��yb
0��=�_+��?#�f��?��_����������������\�����'������DC~<�>�\>��^��J;W}��n���V��������zl��W���pa����x*2��+���6x�v�����j�4�d|�4>h��fK���
E���x$��a$�����o�~;��n��Z�#�(J�ydU�-�)�Ulw��
����1Rq����/x�<�v����*>h��%��w�n�N7���w�j|L<dU�r&��k����uJ�,��is����� Bp���K(���%����c���s��K�x�����b������Po%�Z���S���J�aB��pN��1%%�����5w���D���OY��>Z����2[+�W*�L��~�@�.1{��k������L��	2�<�<�h��q�]G;��������������}��c}a{l�����:��>����Z��|M!�~��7e���i
�?��Xb<��=�N� G���C���I%y���K(�5�������=��E��G@����M�������U��Efx�v�VBU�y_>�`[��3s���ME�M�t�OQ�N����$���rvT!�+�%��/y�M��z!�����a�n�z�?;(�3i�'�j��Z���,;���u��A�[��\���}N�K���2���u�7��A�M���Y�{"r��2yu��6��3F�q��k��7�A�a����9�/���=r�����+���,���)�!���JE�3�8�|�G�t���c��4KQ''��\Xk��L5d��}��h�����Za��*o������;bh��W�[��X����R�
�#�m.L����(�5_5��4n��V���a��]R�s���<r�hJ����a���o��Q=��S<��SQJ�+��@�H8K���B�!`��99��E��-f�������?������������e������� ���,��_jq���>)���L��
b�����!�`L[��v�_��0��6��d�l�v��7;�	~�������������{�-,,>?:v��e��R��*�f�w���vk�naaaaq���2Z��a�\�	:�d���1L�Cf�s�����%���:�x�ok��5V�XXXXXXX�J��_koaaaaaaaaaaaaaq����[XXXXXXXXXXXXX|!Xx�/���}C�l��l�:B��/	�>�6����������ag���R��}��T>�@�wf�������������I_��oo������;���Yn�|�����������=���$"/rj���G�s|�l03�aq�����r��y&�%�8N�IL��"���v����3�!�������{j���/�<�������<��.�5R-���������Xz�0����c{�'���>��;�y��$v�|�C"<eT�"8k���'I�1pG���bHW���<dn��42E$4��-�DC�?��\o�?��7,�ss��_<�|y�ES���f�7�O;��1����������+@46��t'�vY_I�L���5*��)<�9�Y�a�|�j�5�?n���Om�|������_x���A�7_������N�n_~������N�o{��	��Y�������rt��;J���
-13�G.�`Q������n2}�E�:�T����F��ep|�g�p��r��s�/�b���x��%�LV!8}������^����4R$�t<N��S��:�J���{���q�#:����uS���oE���a$:�����?�.d��O1�_�P������g������|-:��~��qB|d��D����gu��\7����
�kP>s�T���������{�R9N\����7\���k������o7������>�@^h8��Xb<������Ja+����Oaj6Oit��k�t_/�Q�#����NI��N��k���cdj������/97����!:*!
v�R���������I�[`u���R�,�!&|i�_}oX�fX�z�<�@�7y�H����?���U�������B
[�	�k�>��]���I'W�����rO~b�x^�������d���S��.p`����s��?7C���M$�pW���R��+����?FQ�~`���3���S|7��:�I;��|\�>�e�3����Nf�/�gj#���C��-�������a��\<�Eq�y�v��_�p�$�� sd��F�Y-�A��a��X�����N������p�~������r�@}I�"��$����'uOel=�a�TT������{��g�{����
M�w������i�JnU���V�L��	w��M�o!S�5���g����&������~�Fxu�M�W�e��-�/CL��{"C��3��{��D���~��w�8��3vv��dy�r����0�Q����UT�B��������)����~?|�R�"^v''I��W��g�������Q|y��;���?��g3��[�j�u�G���7�����P�T�<�-1���I2*8D�FkqTU!I�m,s
6��)��J��@
O{���2�*b�<��L,"#�G�c?�|����KfF+d�s,+��S�Y�'����w�m������E�������]�6I���|TE��o�iC�����gy	��xB.��hh�u9�/�;��#����);	D���OSz�L��NN%��^�Yp_�����{�����>�CNm����Vt���E�o���3��5�O�(U�PQ
iv"!	�Z7�dH��d�������Q������
Lw� #+pW����;�E/^����y$��}��U�_�������f�~���V���_�����n��J�_�I���d�G�	��d�w���5BH��[�5�����B���o�SQ��+��x��4��;�e�e�8p-$x�x�o�%K�qr���@��}�
�w��;�l���#�F�m�����}��?L�7��������F�����:����
qL�{��5������&��f�5�����g��G�|��w�y���F�g��1��v)��1���*gH%���iTK���+���6x�=���"y�Gd*�V�m�_���a�~&�~�s���������g�n���n��0��n�P���oo�p�t4�j�%��@����5ZY�x�W@5G�~Q����+�������(^�+���$��������a����y0����#���2���l?���&�N�-���W�Eg>P`��.�
�����e��n#��V���os����E
B���.q�%�w@b(x�Rf�L��+��}%��������2rj�}w��*�l���o��@��� N@/eYI$/F���~������ ��B����Ly���~�����U�F��Y��g��u���|
��#�d�1�
����y$\�ri�y���C�$�Y�5���+�������y���;'�h2k�,��Q ���Z�����!1��e��p��<��Y.��~�n��w�Q�F��2���T���E�'B�h`�:���g��-�4���-���F���
��Q�K�-oS(=b(�����@$�P]&��v���k�Wm�7����U����������
61���q�g����c��9�m*?u�P����d[����������0l���������w#��U���/i���h���~��~t1�ks	�
@����l�B����
���zF��C@�9�����=CD����S�TC�n/n���z^���R�k4�\�x��bg��{���q����kGhZ�J����r�����~�wW���zM@�W�`s�C��CG-�;G*���~��T������^��A�������h�jn������E"Dc!v�������/"�DP��.���2��)�r{��������oI�o�;����5��Z�G�f��0�Jh3�"��|������-&�S�r��/����/�����	���)���^;A�Mb4�F��i��P<N������BL���.,t\�v��u�o���V���ut��k}��}��o�����M��)�+�5�u��%&�#Z��&��~7������<�9���+���E����aV���o3�_���(f���1�Ou/���
*��������;�I�m*���A���6�o�g����f�������oW����f����_7�u������W4*�G�[&��$��F�?0@`l�P0���������J��(||6�]e����r��d����t��X�/}e0|^��f�W��q�n�hh���A1���4`d�Mk/�f�jQ?�?�<�3c1���&�F
zX��	�4>m�-	�����X,B�HA���3B<�a�.*0znH�J�X��#���Wp����E���HS��n���D�����9���W@�^*�5�E��w��1D$�EI>d�����L!m<����^��VG`��Ca3�8���cB���������'�Az�F���O>_��S�L��1������>�,���?��v��H���n������-��~�J!���i}.*Ge�
2�W����c�������o�>��gEQ(��%����U��O=�����i�~���m���������&�]7����7�]����W�PJ��"�5^�!�m#�m�98��t�!G�[��D��
j���F�	�������\v(�J�(.�
j�emn��K����.�/����\F�/��!0�=
�3�R?dj{��^{��6��7��~FTT+N$�r��C��AG������D���������3����m�G�� F�Mb�����>:@���t�A�|�2?���
_)K��^z��u��.U���S�CV�q������4<�����6}P7�s�A��^�ks��7^hkk�`��/xo��lxr�����,����4�jn���b�K��^<��I��u�<�V�w��w�+�7��}��z���R�^-��O3�������}��?��7���k���um��U9:)@_p	�tA�+�Sz!_#�F��|t�����}:�O����
�jct.�&�3k����w���y^������������i���������&�o�����R�n\?�����O��k���_+<�(�����>������^Q���S��\��9�"���2�16<����G&_A=��
�4�<�����ejHBt�x�AF&f��m�����6x�v�1O��cL
�=�� ����}�F�3Df7{�;c�/�%Fb��W�S�l�=L,�C������M��#@t&��`I�!BL�F�?���U���b������'@l6L���������e#�"W�x0������
3�dU�E��2�y�/(�`}�#yUf���ek���_�`�3L��X(Ph��X��@���y�8����� ���hY)#$��D��q�r��L�����}F����aB>���W���!���^���>|>>��TU��W��K����S��Y�������J��0�#������g��H>F�5���4�/��G��F���j�e� IDAT�7��_��a,@�$#S<�P�m^�I�S���o������������i��6?c���O�?4��X�A�������Y���o3�{dr|�!��g�hX:0������������V�w���������@�n���t���5r�L5d��}a�r���f)*�����#��v%�X����+:�y�V�x[�]�JE+!o�"{��~��lNM��.�2����n������U��Efx�v�VBU�y_���(�)�
�#�m�|/g,����.��]�T�����et-��2N_H���m���K8ASr�^H��w�4Yu�qo��/�(~��[�Q"L��n��]?���s��5\F���������h,B��8��Ol^�er
�w {y	���{�w�Q��x��������Y�{"r��2yu��6�[~���������@y����{X�y�kY������C�7�3���p����x����O�E��������A��s|X�����j��/�[�����+�����;��?�{��1���l�3��;�������&��������@��-?��}��?��7��Oi��1�_7R8���	�!���L�L*�7#_��������:�?��Oc���� Bp���K(���%�n}��~[����� �Z`mj����D+|��r9o���_�c���w�����w���F�<����������w�������?����o0�-�S��g�a~�m�������ML��a@N�c/�f�+|��+Mm0/�7�o������s�������c����e�_6_��JSl��=��2Klq}|1��-��~�����/���^!��Bx���K&���;b�~����Wn7�7��.i���
9������7.�o�k�����/�~�m,�������dd���g��'�����vpu+����-��~u4oaa�y�m�G�c�l���N.����r���������o_���Q|n;�������h���
�W���k
�-,,,,,,,,,,,,,n;/������������������Xx�/��w/������������0C����_}�%>lLq���Ko'7�~�6v��9��r�H),�"����a�������E/1�+@46��t'�vY_I���N���drY[{����b�m�����j����?�����7h��j����<�A���]����L��V�b����\u�W2�[�:}=�����9x�/�n��"��7;�;�����\�5���w�~6L�WW����w�����L���<�w'�d�V^�����g"�v��	��yV�g���y��	��V���g��_�=���"��OE��N�P&��$u��'���>��;�y��$������
����^�6��Ga��� pLI��N.�)�7�_����0�(�^B���|u~Qu[��U��'��={O�Z��Q�eF?[����,,,,,n����8!>��d�"��i��:�M�_g�zO1���M�f��3�����]S�j*��JY�j�28>��y8z�
��a�g�.��K��Gt��������O1�_�P�}�K�_Y�,�M@#6=����}��������h&Wh���>r���@(6E|��w�������q�f��
_��5�P>�<F���G����2��>i�H��x����l��
J2��Bp�^�B([���r�Qt
��J�|)�Xx���eT������
��&M��}\g9�P������m����~�������~�_�i�~C�������U���������b�X��S[_�����@'��p;���R�J2�}5���G�����������;�+d?^�m�)�>F�?����+��M��
���8����;���2�%�A_ �t$�W�))9����I'?4����E���GU%�z������&�F�+���F�������Rx����(��KvS+�}!�������o�8�k,p`����s��B�X��MC�~������ ��[~������������ �&X���e��
��2��������Y~#�t�_������m���
��cr�]l�����r��~�x����B�����$�-��{�>��xb���U�X���7U�6�F�:�&h����V������\��O�����M�p����]�����.W���j�
20}��s��^"�D~��R�o���~p.R�/���Ho�_��a��������������C|oX�fX�z�<�@�7y�8���_�������f�W�{��6���|���WK�4����
�������Van_����~\���n�xP0�l�/��t��'���}��;Y�X�0p0C 6�#���"2�x�;�� ��_���Ul�8���[���f�~��c�`C��"-�������1J?,����s`�B���G��^�'�[>��
_zF��F�I�*�
!
@<����V�$�XV*�1����O&��_�����nq��v)��1���*gH%��
����axH�������!$U�-\]>�M���(bv��M�qmL��h�t	q8��^�S�����^�^����
e����Y$�Z�	��s����\k�4����M�o�'B1sp�7EF��57�?#�2kFt���?�2���������|'�����l��ni�~Q#$
�����'8�K�y/�b� ���>�� ��e��������
�A�,�]K��?j[��.��H��~^A>�"
���`����O_�w��Gl-?!Sv���=���x�|�����i$����;�������~���������E�17�/oS(=b(��6�H/����\}��Y�����J�v�
8�SI2�#@%� 0m~�����W������!e���`H/dxU���Y��!|"�� ?`�,����j��z-����Q��Y�������V��a��?+���Rz�dS�����B�\�&y0>��,|��|	�-gc6��KYV����xFC�kY�^z�Mz��<��Y.���.Ia0��&�'��vFk��5�_����/C��W��"�r��PF�)�E\r	��0�����{�k)��`�������3(��d�JDAj{�[���~��3����)�	�k��2�uM���!�A;����|��I��a�����h2k�,������;;���P��'?�Jq!����&�H����W��i�;"��J��L�?�<� ����� �~u��n�^��~�D��Sb(x�Rf�L�W6��6J�o#��� �V���@��.���@?M�_���X�zb�
�{�7���������.�����/f3B�����\PP�=�3W��7�{fu�����q��w�U�����-_3.@sIh�uQ���u��::v����Ou���dn/���;�;���H�`�����l>�P�P�c6D��&UKZ�-���=v���H�h,�nb�RE%F�n������BL���.,���v!Hl��]��R��� ��3Q|�+��(��|��/ct�U�u��.�jl''>����\��if���kGhZ�J;i�&���}�����������O-��w�CK�xF�H��|w������*�����9��`�y�������^���W(�764K_���S�_c�����M��
��9[�m�>���W]F�k�:D�8t�B�~�jv��n��������-������
3��q����=�?����|��f��4�>��2���+��i�2(n��~ip����%B�$�'�;B�j��1z���uY���w�5�g�����7"m��XJ/�w����B��0�SQJ�q	6��7/�P,RPlx��dX�f���L�����6��p�������|�����>4C/�]~�2:��T������{��2v��p�Wr����p�����Ru����6�@�	
M��t:(���l��8�r��4�����f�����?#��?C�N���#��aym�b�!"/J�!�����d
i�9�E��&����W�h������-�������kTt���rV�(W�!������i����O�d����SE����Yi-_������O��W3�l��^�������g��5r5*G'�����PI�9{)�m#�m�98��t�!G�K����SI����r��pl����S�{R�*P�����|����6���:^.;,�<C-���l�Wg�K
�.�/��p���#�%�
�y���P�Z
N�8.�^�
`��&O�5���[d�q���6?��pd,o)f^�^)�����u����rl�����u���z'q'��J>�t�_�bV?������Z0��BA}FX��L�����^b���������_���r]oH����u���O9����1�o���|\�0=�\�R�l���l�HG���c��@/�,��[u���a�
�Gu��P)iP-���=?���?��������\m0�'rG��r��/T?
�k���#���^�u��-���gaaaa�kL����2y���3D$2@5�@��!��P�a�������n��S�]����
���y������T������Y���N�96U>�D���7���~���Z�x�P���� n�WO�m���=����f���E�8��)2�<�3*/��Li)�����A��-��W[/!����e��_?��8�N�;D�$JY���/����2+����aB>���K��u�P�g7l"���t�Z�`y{9��z���8��$�ZP���#V���-��@��(����X>]�W����f�7Q���c�g32��SCiE 
"(��M�_��M��->���!�:��Q��r��h���F[$Z��&	yQ���r.��6�2�h8F������72N�Y�c�\n������ z�����T��Z�v7~���5p��s������f^g-�E���aDy���A����~�f�2��d���v���9��S�[�7��~��O���!��cr�LY 0A��Xht�I�H��X�]�/0�����	�������������`z��F
Gu;A;D����T�S�Q9v��L�prLI��\X��;����6W��lNM��.�2��������_�0�=���vP�?fr]5W�j��Z���,;���_��gme�g�%���T���.�7h�G{,' �{1���a���&.�O2W�3����	Z	U����+^���� Bp���K(���%kK�+9J�	=�-8��'�=��p��/�[����Uu��(g�H�����|�;��t�K*��2����N��H>��Wk�}b���jR?[��Z���L�E�<�%����
Y��kK�M��u��jy?�����Jq7e�����g��vJK�}�a����_|��������A��s|X�����:My������&x���p��g~:mC�(�_\X�L��'����6�
c��<�����Uh�:�m��X����,�D-qO��C������FF�sJ��o�������O�������4%���$���/U?��������e�e������,,,,,n������n��L��a@N��j/���0��7�����? �&�\�f�Z�����c��������B������kE����������?���k��>?�\��uVk�V��v,���������u3�)���7]c�����������������5������������������_��w���agg�P�M��=g>������v���U��]�m������1����[ ��3�)����8���B��.��-ch�S=���������5z~��g���M�m��r[t�e����8��m����1}
�42E$4��-�DC�?��\'Ww�46������=���Wg�����6x>,\x�������bj|�������O�/�mU������W�J*��������,�D���2���{n^�����.��?6�x���vb�kw�'S��W�y�$�� �����|3������W��?d*���H,�����	�%���{�^�14���@���~�$mRA���c�^����}�c���T�Q�����e��IRu�
��#��%�l��W���{���gd��������x���}�����o'��p�uJJ�t2I�x���@�X$�O�WJ(���6�6���hN�$����ur}t_���a?�����]#yV�g���y��	����$�X?�������{��<#��g��u����5z~��$"/rj������q����������.�+)re������	�v4y���t�>�hw�O�
f3,����5���K7�u��	������,���9,��~�"S��m���<����L-����ka�����5���o:���O|c��W��Af����<����5����f�C3����G�?4��M0=���uS���oE���a$:�����?�p����#�L���bS�g�|7�6�@?x�\*�~�?t��k]�6(�d�
��{W����W2��� ���#�3����4B�����O1�_�P7���Jv3�R�������l��B�S�N��i��,�#����#��<($��(�v������������K �>wOL9���'���Bq�T^�mR�+z����|�#05��W&9��'���3�Sz���2��	��A���,�*6��������qe�L�Ap<F�y������78��g����X\Q��"��|.(?[1�~�VWo���������{�p�r�����M��	��Y��n��_�g������9
������|7�N0�w6���*����
���z�Na;{���!��n��_3�����'��=a5�cD��G��;�/�k7V�F�t�mF����Q�����������Zk�����)��B��B��5�7M�cL�������$�z&1&��
���VwO5��xb���U�8=�S��Y-�A��a����|���\,P?�x�������6,?������{��g�~�\}��_��Nq>�]��.������������kT�n�S))����
�B�����Hv�WV��}�+������+�
PAU
:��l�A���H?$�Ii�L�"{��<����]�2��[����Y��6k���;h��;ykqEd�m�P�A~�Bd�.dx�:]P� �5���~��8H4��t��'�/|�	M�	zqT��X�;�Z)<CtTB������+ �����`���y5�j����"�G��l�"���������
�<����
�����OlN���?/|���@T�"M�Q����������q���|��I/��W��IQ��Y�m�)������O<&����0�P�ZL�.WM��M�g��F�?�=��������h&���F���,�U�M�a��a9�%� �WM��a"��}{]�F�?]�]�����!I��n�WR����?�,��r������Q
�I����;;��c��<d9_5|��kZ��3�[I�����A	���@��WK�Fxu�M�W�e��-�/CL��)�x
���l���0a�v��Z�u�_W��A�d�����P���	�"����U ���o'�O7�WUAp�w�
�<�c�-"P�X���@���rV��X�p,���7�o�`�����&t��nw�^�|���P����(����F��O�&Ey������L�W{|;(����k��T%��5x�9�SQ��
�Js�����g-�G��<I85�����3���l&�\�|*Tt;���E���V��H?���4��&3Op�&�I��-���s�����
���������i����W�Q���\@��x�E��!������Y�����$������|��'�<�#�}erF��k�2����`86~���~�������?L��uL >���|��E�8x?��0 �o>�����tB�i�j�����t��O�ar���D��4��v�j~������-di���2	~���<�_9������WY/�������������oT~�6��_X��	F������B��z��1���L��P�5��_Y����H��u��u_�b��t@��r��w,���o�g����e�k+ IDAT�:�?�w����<}�����P�!����s'm����������,�7zn�}���/�T��$?�0I"��Q7{��EYi3�U��om@^�/�������:�������{�AT���f[����<���ReT]D�zu��W`]�����C~*W{5K�m�oG�O��WKep�q�D|"�������R�t��]/�}���� 0&�QF���L|�
S���?v6�wH��}h��'�! �u�����������
8����@y��+�,&,�rT�������!���0�����3�_������kh�G�,��E:����ng����o�^��D��"�"�3>�|i��=�O5�Z*��|x�;��O1�������������CR������QH���^p��vFC�wWI$I�"�mtp]!8��\������!8�	n4�R�2�)6j����<]���%��&W,SV����>��le��t��Z`w3����Ol~��(�-��+��U��E�����d�'�q^\����yU�QV3��U%B�>@b(x�R&�z�HY��^�Dv�q��^���_�����������kE��"��2�V ]("�%���C��1����5�Y�!�p����
Zl	��pq�����m��[���t����A�|������si�\d/�F��#����m$_\}�)��U�e�b~��L�3`7nAG��\��|��JN������^N	Uu���?����i.�+2��X������3�9�p5�2�!_��u_�(~���:;Hl��*�i^�r��n���R�%��.��v^��������������+��]0�}�Ia7���n�7�x�~+Gwe��+GY� P"Gh-��+gW��X��b�
��MhL7��7�"�����J������3�C���Hx��9�y������?�x=.\����L��0!_�@���H ���ubV]�G�S�	'��[�pe2];B��T��m�����fB������2��v&Mn��]� �~��<�����g6���lc!7��l'k�+�������������j�Q-�B��[�� ��(�x���N��?�[�vM�����~�[A�/I�s���X|�F�3[���~[����F�����"��dpl���	����`�z��J�t
���'�������	YX#�0�h�)��4c�� �DP��L��9�4�=��b�n�^D��w�/vg.�:�@�D�`B�g��u�m��Nt�r�t�ZUI��[��p�H��fg���}s{Y�ZC�c��~}�m���/��j��:��]:j5��~ik#��k�.cB�J���V��hA�
�#$��VEAy�^�R���]��G���D���-&�����c�A�J��e.vk:�oS��\�b�>�^���
���F����
\���9��wke11���}����=�e7bC]��wp�6M��Z'�����>���?s�3�?�M������P��J?c�o����t�����9q�tDMf��>�8��c"G ��Zd�^��\f"���=n����?v<���d\���l\UC��8�J�$?g���j�q�&T�2~\�
��T��<8�;��&����\���oT~Gp��Ca+��m��~|�-�g�a����?yz����o�R*��T�����%��J���E����|�C?���;��;ll1�j���:��6�H?��cE�zQR������Ti�)wGD��5�:h:�}*�'�n�N�h'�]��Y�'T�*5����m���m�T�R�����4V��Wv^�8�Ln���,�v_����
y~F;��}��,C���c�	�]k�v��D�����y���+��.=Po%���O�3��\�������+��������o+��vH��s2�
�/��!�� ��Q�/����J�@��a�����J}��0��a�E��I��u���p���~\T(l���m�JN����V�Ko�#z��j�B�T�TZB��"���r��E�/7��0��������^���T��@�G):�"�UZ����C>��������ue���t��u�;�B?<���;�R2����j�U�H���y%��2J��8}l^/n�T���]�HEUO�:�,�Q�f��.�]����j�%�<�������:(�I�V����r�]�b�a�_��������I��C?M���U����G������������C�9lP�G)� 6�k�r@��Y��"�fGll��u���
�
��)hn�|��g�������G/+�����z�����j�\^fgu�����Xv�b�:L[�K��~����������[�����F��N���W����K����]��R!L�h]���tl��}d��7�[)��+"F�D
��f������s�vZ.s����q�
S>m~�n(�_z�t.�!�v���>�NXk�L�>�/�p'��z?��v7���~����Dq�$Td�J��a��<���F������mc��m����QE-�6�o��u�~L�c�T�d����<@�z2��1�����S��u�N���~{�mU���B�,�:�:���m������QC��������O���)z��;����C�W�U2�9t������9sY��L�=Ab���={��6����l��� �O�?<��x�a�����������D/>�q�f�|�����>����m�����/TT��1��?�^:���&� �S�J>�`���(��2��v�=�o$���n�����~.�Q��������9m�~b�-09:�������|N�]�0=_���N>m�cdG;�
H�8aI�%��Mt��'�������!�>����2����{��d�W/{��#2��FN���������O����FfXB�x��#���������-��Ez\�\��1�>�k�fU�����K�#
1:9K������+YV7d\c	b����;������0'��^����}��3�q���(R5O���|��)W��:_����}*K��#�:����`@&����f����
_(�~p��|�*�@��~:&������L��s|�c��h?��![����M)*����m8��<�F� _O8F,���`��c$����\�8��|A�-5��u��@+�w�G�$���4Ocj���;��]�Q�������!��Te��W�H���~�����_z��-�6?�����f���'����x�1F�><��X	���}�7�~���^�Wj�� �(U��U�P����n#�>���1~i����/t�<�N��������l��)�� ��n�N7OW�j�O�cM�J6��k���2�u��++s�k��s7Fp����(����:8k�3����r�����<��>���R������po5�z���U���j�2���qN��1e%���tg����3��z��[f����`7�g���?�d�dy���}BN���Q&����LymF;�l�
���>6W����.�A��s����_Wxq�h�1�����|'��G������tB�<]�MY��T�����:���/��S�|�'���&|G�K$���g���e��:��Ot���>+I�����M �����-�c�������Y�����Q��V�*�Q���<����1�S/��{���<��F/�l���mUL���m���s�:AS��\Lq��"�w�h�@=��]���2<��N�@f=Gbb����tt��A����\�^>�����N�X��.��������������w"r��
u��6��3F�q��3�?�!>�����r{����O�<_��l�?V2���^�����.���=��O���7:�rL��I��Z\���l5b��s�o���2~��)e���#���L��8���m���!���v�L����HT?�>B�a��~D���g_t�� �7�d�����\���O#�|{����q�������N��o��vk�����xJgGZ+J|e�2HQ	g�=�WR����;�'�k��{�l|�S�O������?��=�����������`b��8����M-n���;��y�����A,�XX��`b�qm��{���#F���{S����e&_����d�	�/�~o��������x�O�������}�������`�
��yB�n
�-,,,�E�/W�j�;��S1 ��S���w����O5�����X��Y�����[XXX|�X_��c���������s�������;�z�/koaaaaaaaaaaaaa��������E��ek�q~��H�3������EoX��#~�7w��=�[;z��3��-����Y�����������g�m2$��+���������s���?_��}�����21w�I�����dv(��~7g�~�H�L�H�r����^}A�=
�s�?��D���vb��g@����5S��l
�������o�8�H�,������;f�u����.��<a�'��y������$�W�q�0����$�9X��|�>Wp�xtI@/���L�0w�z��#w&������B���c��Y?Gw#E����+��)z��F��^7���@�����}�7����o��O���8a��?nP�\�_��������on�~}�G�t�����_���>���<�N���c�������h\�ef����,)��4��
�Oen�h_�Jn+�R����M��d��@���n�eP���v����,��)�f��x��e���`o�]�`!��;8	m�$����
V�
U�Q�,���2��������g����LxeR��|�H�3���������^��Qz��������l�&������-��M����?FX���r�������u�G+:�M.3��v�qY���b~���,5W�<6B��C���b c&�+
���<�T�l��-����i��[������30�H�Ybc�`G/�K����6��Ba	������+��|f��1�6�7��eo������ ^4y�{�L������NxQU�xL>��-%�P�^J�Q�g�������*�R
<���,H���l���sr����W���oa'M���"�v��� >7���u3t+�Sl��X���-h���X}A���1�:�#k
�88���=���{���lg����5@�rd��HLn.�^?�Ru�������X������?���#���t�����	��O|��I+��W�����Q�a����|����J�HcN�Q�A����G�#���<~qj�E:>���"�;ItR>�![<)���wG"zm�Tj��7���-�w����w��m2��8j*������h&L�W�m�����~{�o��l-��
��^���m�����;��&������.3�������8&��=V
����I���^�g?��^i�����Fg?�?�o�HQ/{SSd���Z~&�S;n���C�������?�>�6�?�j_��n�h���@x��l�L�U�!�5�K���zIr��`
�S���O��U�5Rd���8�W(� �����Tr	�h"�-{�(����sf��dS��(U��8��9�GSl�����m�����o�����7E���]���H?\E>�!
�
�����l���#s<�YO>"yJ,`C�����L����Gd+N����3��P(m�W��<l��-p L�W���|����m�����S;@)w~�����X~���N�Q�&�#
@1�^1J4,�|:%z�a	R�)�����1�����;���;>���-ya�RH+z�:T��#�V����wg
[0�f���J��;��S���{p�?�W'��q�;?��T��(a�F~��;[�G�����o�����z��Bm_���]����Y��r=�AI#,
��)�'�������y���:���h�3���������,�'��WN�6�g `���������W$��{�����D�k���Ho��Kh�-���i����_s�6�?���a��KR|���z���F�o;�tj���7���}�������������Kt��xG���}��/����������u��m�s�������^��0>�n�h���^p
�V`�P�tE����Z�J������4����,��/�!���t�l�PIm	���������;������2�;Kl�����.�O�+������~g!Xd���*��t����]�A��V�/�o��J�%�B���)�p�e��@b8t�rv�L�;�[�~#��Q���w�2q��n��U��k7/�:�`���N@/�XM�.wF�����7�%�����������P�8�Er�e�c#m�5pG�j����o/���OQVq}���?�tPa��>6��KS.Mn"�^1�O�����W+��y�3��x|j\���w,L~k��!��$�xBe*I�D�Y9����B�R�&��;1��|��d�Ga���W���{RMf�����i�M�����E}�T����
�!_2h?E�;B��o#�o��U������/F����Q��o�m��j��C��iL���c�~��t����R���y#��\�?2!������0��g���7�^w�6��=���%m�{����6�M�K�,��}���!y��l�Ib������y����p�w�a-��j�^�v�C��"UU���`��(8��������W��}CG����*�.�^)^����Wv����z�68Dn9t�bC0s��Q�3�u���Zw���
6Jp;��z�u��F�<�[���|��0�p4J,f/�c����	�'���r�]5�E�M3t��������o��-}��������k{*�C�������Jx+�"����poS���	��ZLM���X���>{�����:+u�YZ���������x�~�����"���sab��"�"��*�����9�d�i���%���x�N��u�o�>t�����������V�gO���o}�W�^���b�1�m|`���}�����n������}�(f�����~�7]a�}���i��4�J��	�����M������sO���W
��CS��^�������__�!����(�@���2�P���;2	�	�rgQ��\���:+����������t��
	t�b�����j�IQ2N�U
M��t:(e���lC�:��i����\m�g��Z�wf��U3��(�!� !�����)7)�J��J�?�$�YVzu���t!�S����<�B��H����D��l7�M����s(�E�J8�C��E�76����T�k��v���[Gp��Ca+�0(�jTu����j�T�O�����g�(���l��zQR����W,��H�O�;"����R�����7�>�h�~�L����������{��3��vm��Kj�M���������g�Q|�F>��o��M�\S�}�}��3�������V����
�s/��FZ����F��q�hC�����;��;l
��9d����R����Z�d�Q|�����i2���e��>��e��zY�[<�b���J.(6)�Q�Kt��Km�����4���.���^�����-1��U��U'��
�z00 rK����s*���� >�>�����[�-��h;���IL�_-��c����_	���#�����s$���]��R�nZ?�}du�@8��R��\���Q5�n���s���2g���SlC������j�(�'��n�d���P-_u�f�k�zqS�����q��/X�����1����~����}�����v����R#�ot����V��j&���������5�_R�n����
1�?^7]���94�O���<��^-�g�}�������uxE�f�����������-�<~���m�?w=~���=����G�_��
x�1ba?�.#!/NM��|���S��6��)U�*0��3�1>:?o��O�PE;K2	�4}������czXBt�x�!F'g��m����7y�z�qO�rY��L�=Ab�����y�F�3Df/�w8��_�%J���H�������{�x��v����M��$6ct(�$���a&��������^1!��[����p�<�Gc�}
��/��%��
���.<�LzY���������t�l� IDATo�����/D?8�H>@�F� +����L���W����)������>�`��|����%F�t�_=�y/{�#cr��(����������;^�~B�A}������O�?<��x�a�t8�AV@��	K".��Pl����\8�:gX�6����������#����������3���u���U��������d��2�Q|���f����}�o�oC����MW�/��ik��7=��V�3��[Q�'�����	�\�<C�"������������!~���m�?w=~���}�����V��_=v����prLY��Z\�r�D����#6j�W�y_�M�FU+#��!{��~R�lMO��&�
���n��KS�W�Ggy�v�VFUx[?��(�)�
�#*�|�d�,���-��]�\�����3����e�����U�2���u���y��"���+r��2_�Q�F�[�S&B��0n��]?������1<F�{�������X<J������e�
c� ����L��Ns;c�����I�i�?��ND.T�V���p�&cb�*"�w����?��]�������V ��#11��wv����x����]n��w�gw�����\������Y�_�V��<fQX v��8�(���j��8�����v��2J�%����o�+���D� n�9�X���+�8��M��f���a6���������o��F�
�k�u�h��L�/-�������h��0�}�c�6�T�C���Y�4�0�9�����~��w
�E���_~M�z����������'#n��w��V\k�61~;�d���^���}q�������?����b�����u�������]r���>�@���|��M��|��l>���~,�s������/�}[��j�_6_L����\X�7�+������G�|�U�����Bo�A'�J}��w��������o�o\�.i���
9����O�7���>>����c�o�vX����sn��!F�|�6l�h���bW_$��}��Wj_��oaa����r�{��K1���A=_/V����z���7�/���>��
�`��Z���o���
��������;}�Boaaaaaaaaaaaaaaq=Xx�/�ow/�����D��?��	���%h��T�/���2�6��z���e��'7k~�7w��=�[;n�_-������������L� �/09"q�qrN]z�������&B>D��^?�3�f_��@���V���l��c�@���>&�E��E�V���<.�����{k�+�J��X������x�� �v��k�����W�|wv���9���K������d�������������
��������\f��g����?������L���N��Ln#E�����;��W�9��4:M4��p�����Jm�?�$gX���c>��XM��nd���7�����o�f�[�������o�y�J5�s����7��N�caaaaq��������
�O%YR��is���6�����(������8O����/i_���v����������'��k��{�i&�|�/�nh����U����V	l��8��9&�/�e�l���}����9%xwA^?��Y&�2��E>h�h��D������"�m~I�8�XV�X<yA�������������hl��;w~N�K�=���)"��!1���������~��o�~�����v�a��^�?�If=���!Q:�V``���������z1=��%�/��wM��9���L�2�(Ba'����"�v��� >7�W���M.3��v�qY���b~�j��r
P�5��
���D��W���X(���OG�,�s��W���sEx�zo�>��.� ��W��c�D�xE���'�J�-��������d2&�q�Tr/Y�?��l��X���-h���hx.Ef��I���\d/������\l3Zp����W�".?��}��}������k(�r����d-_�%�BIU��z���^�;�����jg�6�4�^TU�'���aG	8���l����i��f�g�e���+�g�v���j�)���%�����~�D'�������o�qw$����N�n�z�B��H&��9�I��Db\���Q�^��_���j@�Lz���0��
^�
��>����}�j�~=��|��7�J�K�n������� �m�5�?��~�:e�@&�J�����_�}�������	���"�&*1����caaaaq�������P�����(�wML����a	�v�R6W���23!�^�"��C�j2��d�d�����������K�HD1n����Za�{w��l&�. n[�J��;��S���
�}���R�9�"����)��B���:\�I�-��r|�|~�d 60Db�	��:�G9���o8�(�D�3;V%��gE�"��N��?�b{p�?�Rs�|�>{�n�b�y��k�r�t*�� ���bn�U�M�1���k������F�
�.!�����|�����~6J��������l���#s<�YO>"yJ,`C���O����<��_����L&�o+�t8b������+E��*�<n��pP�KC�d
G�	pi2o���`�;A�R9�3�k�Y�:���-�4������F�����P�V��Qi���U������F������~g&t���#�'�X����V(������>[��	��S�]p�>;������C@��h�5|��YY{�G�
� '�L�9�	���������5Z��B=(���m��m���9�"[<TR[A�3�a�~�O�� ����z�ph��
���^���p"��^ul���#w	�ce1S��Z��QOh�sw�Eq�'6�'�����v���$WW��^=�T�yy�P�T��!�NL�0�?>�:�^��;����.EN&������}�"%x�b��{����`4^������)]��h%�E���S�K��2��'m��}v_?��N��3S.���?e
����/�DBv�k��V���}����'��d�Ya�qp8x����
?���Q�%���-�X	�C�|��p�;�a����bp�����������k;��|���q�]
�O���-���3��Y����a���~�/!�t�lk&�OO��b����������l?���eA^��s����������&v'k����*�&}r-�B��[�� ��(�x������K�,��}���!y���|�������������k��MUU)���f��
�g����}��"	
�!�����y�::v����Ou���dn/���;��3/W�D���:P�3�}O))V���eHkaf&\�
j��>\=����}���;v]f[�t\X��01�`���r�Q���h_���j��:��U���Nv|6c�-���������	��_;��V�g9��/�g,�Ty����/�Q���/������I��&(?\b�T�����U�M���	B�-~h�@lke11Qp������j�z[��$�f�7U�>�W�W_
�_���!r�����H��fg����.��\�K����qV����)���_^�A�0+�O��|UC��8�J�$?g���j��[K��J�?�$�YV�l�]�!����(�@���2�P��Y��f����q�����:�71������,r��0������v,�������asS��*
e��6��O����?�5�t�O��^�}��}����o�k���(��el���;���5�A.���w�#A^�{��7xv+��o���U��i��)�������w���|A������p�zQR�����.��H�O�;"���]���?exv�q!�R������?ep��U�6��_a�z���_��~��>�>�l��A�~��%�g��i�|
����P�O����_�5��4���?�O��|m�����4���o^	���A��Ov�G-�� ���54��L�aG��K_:����P��p�l��^6�����O��
o���'h|�� ���������%�jQE��]�
_VPu�A�����M!l����M��V�V\b�A���6?�g�qd��l+P����^���3���Ery������1��0��z���W�$����������W��g��[�g���3dtl���6�/c����)����z������8����0<���;�R2M�I5���� ���t����r5����;�_�714�OF�gU�c���sC�>��%�hm�����0��z��$-�����������0=�>��������(�XA��@���(�dJG8�E��������1��4��yqj�+��4����1o�L���"�}���$��`����X$��_6L�O/S�:�
�An�~VUQ��AK��1>�M
�����o9�!>��J���]L�-x�Uy�cZK��5ln�Pw~�t��~
���x"����k�����8\?�~��
�����G�����G���`_�dY��<� ��"}�IBp��`s����J�l_�Pw�g�����W��g�����e
|j�k@VuFBcH�m��)�f�o+��H��D4�E�^�z,���2�X$N������7:A�Y�}�A�v7~���1r���zTc(������j]���u��!:��D�
j�g�h4@-�H�t�v�M�o�������)��������}d$g\N����H�<�n�U��	�h����7��}��������bz_�&Yr-0[��N��ce�>����L���a���~LY��r>�����c���1''�S��W�s�m�Q�p�V!����4�o"P�pxp@�{��U��.���������S����
d�s$&������1[�W�y_�M�FU+#��!{C�+p��Jb�(�gh|,~`��B��1�|5�xt��n'heT����� �����x�iG?.�^2�2����}����u�y���o�O��O��z��C:�ey{��KQr��g�������>2zKT3i�m�o�_g|j�;e?��v"��w�7�����#��D���e��P�X\�'��q����"�]�E�h�;��gIP��?�n����n&��j�O�cM7R$���	�G��_�J���l��~mA�������F����,�f���D�9w��)y^.��w�}�Wi�6c�f��^�o�	����ZXXXX�����?��7]�N�|�K@N��Z?7���0�e7��j���!������z����U��;�}ZXXXX|.������B��2�[V�bqX�gq�|��7���O���fm_#_��?w,������������[XXXXXXXXXXXXX|k�������.����������������1������������������[XXXXXXXXXXXXX|Xx�/����������Lx��K�C���eC�o���|V/��M�m���y�9��=�[0��w��)����<m��,�7] �����wlN���
����w���������6��t1:�K-�E�X���@������������K��D�<n'�������cQ��&G$n9N�	M��8�n��?�����3���po�`���0����^n	v�����xm���TG����HWR���c~��a�&�V��W,�����x�c� j��!���]��!�J��^=�����[�F�7c���/8�����K�����$[>F�1��^�N8.��2�b���]f�rPX�q��I;���x�s)������>��c����}��m�H7�'A����4S/����_��}���V���+�������M.3������<�T�l���]��h�(`��Q�m����Um��ZY&�����V���:���O�(��h�H��{wXCbzs�q��|��7�u��$gX���c>��XM��?���&����sw�����7(T�:.��7�M"�"�W.�l&>0�_�h��&�� (���m��2M����y>D7���������l��Mi?�A�of�?3��X�1�yv^����`b�?O���E2[��lAf��CX��j,u�|���
]��l~�s��~������k{�|y��	Q��K6l&>4�����7}����l��������+*Gv��	����9
�+�����T�%E �&1W��������[��y������m���,��)�f��r��|=���e���`o(�4�BT'w`�������*us�������(������8O����c�6�k��A��<�����$3v�p�'A��|E��:�pb�	W���4r�G��IS����n�p�'A���[�*�~��c(��g�������9��	f	��S�4�r"��~���B��B>]������W��V�&����)��F�C�<{2Hy{��U�)�
�sA������S���v�������%���G�	M�����~j���6��
� �*�����4�P�����&qQ������]�P�>�\�vT���39������N���c����I��~L���h?��p�ky� ��'����`�|���5����Vt=��P*����������Z��k��K�}���A��o�
���M;�����d��?�C��_O$�E�P!� �����P���ga&}^��PU��UJq��"`"0��K����k�+�"���p��~�b��^����{���E\~������R����N��w�E���A|n�O��C���_���z���5�Z)Q�AjU��"g5�
�����S5W*)���=�W�U���)v����c$#�J�����zE����;a�kc�Rc$:�=�����<>�i�c#���'y�����H�u�p�Aa�Jt�6�y�>�bS�(�7���}��8D,��t��6.��	OO
yqTr/Y�?�*Rd����(���E������P���q�Z�����wD��8D�j�l�W6
�r�8��@&�z6���������6�`��K�R/�#�o�!����QS���������z>��$�I�+I��*�����6�4�^TU�'���aG	8���l�5S�1���L��H��tf��������3�����K�5|��X�fY�y��
�@�������o��w��Q�^��_���j@�Lz���0��
S��Q�m��L��_������%�?y���;b�E~X)�����c�?�.ri�vw���~��J�fx��|M����v������`Hy��G�3����/F�/���_'����"��f���+h`.~m�A�d�����UPK��	=&����W ����n��^�W��#.�H0���`��{���E*����@0�X�c%��|<�oo$���f�W[��������N��T��>J���
�������/�����7)��w�x���g#H�Z��E��hV�3\�������]f��!�V�RZkwh��_?i�>�6�adX��)(������V�7z���uG��nG\������~�����!HC��:���4�r
lA��C���o�x��
���2��D��]>������+E��wP�8��9w`��Y����!�O���G?�`j��C;b�[$gp���/S�H�F�gk�<����99H�3��K��s�{��^�����	}��~����N�0\��/�;3!��<���%����3����j�5���������Kp��I���=��:�_%���;��W
Y/���;��s�;w����oT~�6��_X-�	E�W�������z{��V&�o��:������W�^�$z�:T��7�v1BT:$������=++���c��R-���������6_��i�������p{�����H8'�v�K�7"����~���������Y����@x���)~�q�dj���Q��%Yi.�V�A���A�\z-���7���=^l��
H�N/�09t1/UF�E$�_�\|��g�?��T9��+��6�oW�O��+�\��M�'������p	8�eN?�����r S����be4�H�f��v����lFwx��x���{{��]����/�;o^MtQ��ChA^NT�y���R2�b:Ou��NX IDATp���>m�i���Z�F<ca�k��
]��w���(�m��`�7��������&S����qM�mW~����~���YO� ��wo~���i����]X���v�-��#����d���A���������c�[#�\"���u���� �@�z���v��Mp#������O�Y�p����y�"��%@��b�|�BE-���F��%��m�2Jj����|"`�sw�Eq}��|��"�,���G<&�?AJ����������e5��|M"$�C�(gSlJTT���-dG�p�����Y����i��@���(�V�@.Q8��kE2��\���?���&�w��
(�*���?�l}�b�H�4��q��k;�G'mI�/��}�Z����8|�S0��Ix�.�����,�6��Rb?�N��#��#�6�/��6
*��J���F��/`7nAG��
�m|�k�b�������'~��?�gpEgyR�^9[�}F-O�
�foF���o����I���l#���nQ=�����������We��.��v^�H�^�����r2Sf L�WF>P����*�q���_��
�S�	'��[�pe2];B��T�����W�3��x*)V���eH��L��d���A����y�;��/���[��I���(����=�03.r������O��G��
��o��G���a��;}���n�kD��=���uKeo���V��F�K��c�9�Q���3���v��E~#�UX�eYs286�����Ku{�S=Xg%{:�}����dMn�oZN�+��|�c4���{���n}"��m�]�W��8��b�n�^D��w�/vg.�:�@�D�`B�g��u�m��Nt�r�t�ZUI��[��p�H��fg���}s{Y�ZC�c��~}�m���/��j��:��U���N�������{�1��~���~+�C�������Jx+�"����p�^)^��j���ZYGLL��{8�cn�|�����v��������� �Z�Q�y}�m�_���R���+rQA����v::�a�K��Oh�����8�?��n��IZ�D�9Y��'����+����&I�t�R�h�Bz��1�7�~{����.SSU4� n����l�\��'�zL��T�l��Z��L��1����bZ��c+:��n2.�XJ6����v�N�L��3�m�QT5�8}��B?.��o��[I����i��z�����L�F�wG	:��l�f����'�Bz��/���w��|����*�R	J%��������j��}?$K���H�#�HiFdM�����\4m�?���m��/�d�����������]|Y�����83�3�\^Z3��0U���/��K`Ms�U4m�t8M+���������t��0�q>ky-�s�>���~�w�}��a}������.�l��\�;�z�K=?���CD�����GwPWq��U�.��}�������J�{+���4�����>j+��D�JY������l2 ;���H|��[0���N�H�X$D�R&
$�l��1�����MVcU)t����/��v&P�����]��+2�&�y
�����_��I����
�����1. R��jGg���Li�	����]�@)����,������~tEF�����."�?�l�u��_�m�������(�o*�i��`|�UV������o�f0H�F�J��t�[���,��h��[%mn;��8��A,��(�H����tz��vz��r�V���"��SLF	�#Y���(��3�F�W��0�i�n�����K"��q`sj 0B��#RSFiTG����Nz�������:�2�_�MD���eg�h
}����M^��� $���~,�����v����|�%�X������b�"�8�/�A�B���W"���2��'�N������xj��3G$���9|��F��,Zt�����}�+���3-_N���j�=�zdb�8	�jt�R4H�b#B`0'��q`�A4� �gK�O ���\����"{Bz�e�����rD��d.[�����d���z�������O`�`/~s�,���7'���8�%B�DE&�z��$��)6HDRmB�bq9��U>m�s"��! I��,B�����W�>A��AMX]N��1��~�eU��J��P���/�(�qy��������o���-zs��OM)V������7D��]��"��F���CS��9��d)x��0Z���.�d��\��2�c�~�<���� ��V�a`�V��@S�HU�����tQ�`�zq3����e�i��C�Q���������X�E�D4��E��S���J�W��#F��������/��$!�����i��L���L��������"�m����Y��[��X"���d�a���Y�;�^��z�+l�-�n'��g��C���R�w��*w`��)��g_���G��j����jBh�Vl6�"�"��-\�\6������.���/DD1� �&w��&:�Y��m�.���Q2%�Ml��U�r������_�>J��
����f�l��F��f�f�����_����rJm,�R\��4���n&���'�~��;9�{M#.���F���>�0/��g����rf����z�������A�����=��b.<�'0T��o�9��Q�5����������AG}9v�Ki
eK��+���x�$�8hl��a1b4Z(�r��)H��G�\�Fs�����^N����k3�?��`�����-���V�����h�����I@`�;���H}���NUc��O��@
����g��1}�s$J��#�Vg5vs�)�_��+M�����`s:�FG�F���@�~:
�����BWZE����n�Q��>w1_��N�Y��
��"����n���3�����TP��[E�3������~-.7nW)cEF�N+��H&G}��"���^�*�(m_R�&����1)Jxu�R,Q����7������7������Ga�)e��b?����
`���)�.7U�6,;��U�	%7E�]��f;��Z�)�(��D&C��CS��6����;.asn��$�X!�j�I��l
����RZ���}��F#�n�J�aO+��{����%�D�����_Y���K�mta�k��B����;����zv�����k;8P'����`z�rP~���5���o��d�����5��U`2��JS�>���Q~�]:V��r�+�O6���m�}l���@���V<�1��j�6�
�Z��������7%-����d�#����h�����Z��	?��������gi7������
05!0�~0���Nvp.�-p�G����>�A]
�<�����~n~�������]�����
;Nw�U:�nr��Vp7��x�&�g�S��/����h�u�9d�C$��dx��������
��n|M������[�9����<8�����ij:hh9D�"G���
u5��hH�@�����O���p�������6����1r2�O'��.Pz��2����7{_�E�$���d��(}�x��0�X�1q[4��=f��c��'a�^SG�I�	���i�=22��\�S"a��i[�ya�)~fv��x�1��,}��4{�m� ��3��-��c�d����k�:�����������x]N���Im�|� ��Eq�� =v���
�DNv��B���N�L�#�tZ
H��3�xv������
�O����K+�;0F����az��`hl������@������.��_�D,��#���BB`����f�J
a,n:O|��4��lQ�����'U|L��;���r��SQQ�*�_<����^�@S���5m�?�Te�qu��.����x�u��GE%78ZNPi��L�W^�kx��2��Z�n��~t�2��'�8�r�����4>���[�9�***W	-%��1t�#�g��0��WzueZu�����r3�;rO$w��]-������w�5�����.�\���5A&�Q�����r��~aN���k���s�^EEEEEEEEEEEEE�ZG�B��������������r���UTTTTTTTTTTTTT�r:�/����jr�ej���x�5������m�*���7Ji>1��P�_�K�}����5�^9����gUTT�iV ~�\�9p�p�SCC
�B��s����7�����WQQY
e��+drxg�N������e��~�c����J;���9�=��NO��0:p7�Sa/F��az�� ��k�#ta���^L�-�
	������-r�q����'��#���i����4���[1����v�����������#�����OUR���+��K�����G+%����aDL8��EsC6��b���c�_cm`��g��s���o:��\t|���$Fne������f�kKm=.�(�O�2�RoK��k�������'�W�p�������5���_��Y��m�e���}Ur����\�`���`�;���
�u��5FW����u��?`���LK[�G��W[����\����=�1�l�AK���M��-�J&=tu]��f[?��;�w�`������{];ZZ=�,�I������%Q&��(���J������C !��R�P���0��a���@H��
������&%��3�}�>��M�����:	�hT���bL����w`���,
��4~�==>��[B��WS�����J�W�{����!�?X!2��o���i��������=�_���1B�������G����:V������<����6����++F�{zQ����kv���c6h�B~�{�S�>�.;��������������~�o#m���a����RW��j�����������6s���(������q�Q�0��������W+
���}O0�z^<����iz�����Vj�{�j�-��a�}���`��3���������,s9nw[���������a|���K��{��+�6;v�B���G��O���J�Fc���/�v������R_�Ji>����V�����WS���}��3�?��+���C���>���.�@���x&�_�l��\z�YOht�?.�H�0��5�X5��9hl���<��c����$>gC�����]��ni�~T�M�b"�#�t�(�?
L����a�[��q�g[	6"���h�)��0���������2�+H/�~gI����}�A?�nVQgA�fxR�u%(�of�WA��@���;�����v14�����V:�b���B��V�KI�����������j��s���"�:+�MM�'�
���������������+�>���������$�g�_�Xm���(`9��x_��`�:����3�p.5�G%v������
��=�"��{M3��	=��X�ml�����;u-n������n�9���(��=t��4��!�l�7(�>+4�������Hc��_yc�5�Pg������1�������k�=�*��U��x�N|5�p���|��/��9�@�N<a=w��vz���`/>���Bo0�\8mQFO
���DkKn�%	Ik����r?\�~{�@Q9-�(	�{��@����������Q�������r*�0�����.���+���=��2����l�[ax��f+V��i��aR�76���G��8Z8��h|���G��N�KW���^n�r,<�4�]�TP�G�\�rJ��&��,�8���.ovc��s0`�%��@q|^&i�+�o@_V������'�������3�QO��Ee�o��Y������)��+j?�'m�W�^���+���7S�U��1|���]_���<�*������ix�#�xi@7��Yp=m�W�\��N?���t�"�v7��jD+E=��WiT�������oi�	�Y�9�g?c��������r��.~�x����������J��i�o���������l�k��o
�c.�e�����cD��K(=,\����pJ�@��I�&9L��rVSf��@9BO7�$ �����#��J#w���j#��O����$�g�y
W��������Eg����L���_d��O��a�Q0@&Q^�d� BB����l��F�H�Q!��F���1
!�[��
g1!��9��c�17�R
c#1�GE�Nfks�'��(}Bl����F�����
G�+ 3������'����2�t��'��	���R����J�1|1�9*����J�'�>-��� �s�]i=�������1F��/��l��O+�lK~5��!�9���
��X�Vzv��������u2�U��b/�L^���&��]q|^&r���7���px,�K�����|A����� ����6'z@
y9��-;8U�J�/=��_����L����}3�R��$�������7t�\m����#�
�pd4@0EcvR����6x����DRY�Mi_��C��������;���e�+������ �gk����r�7��{O��K����o������]�`���x�z���q��
�  �����W��������������E����Ia\�7DT	I	Pp�l�c��=�v,|�T� �>gHH�I"�(�ev����+��+_���H��0H�ht�3S���I����DDK��#"b"��[z�
����h_|}�y����ysKCC��)&���B&���\�~��mfRN���<�f*:|#Q��vb�{���K������G�������Djl84pURfq����8���%��-
�L7��hN�,�g������8�\;BDOI�v���zf?#Z;�x����P ������7�I�/SZ��X/���m?�����gN�5�PQW����pkn�zYi��EA��#m����c_h�T;MN{�w=J�����$��������/�����q���Z����x�����O�J��PXS�w�H�2�?+���B|Ow=[��t|���Y�������o��o���5��]�`���|
|p��ob/����G}.g7�h]���1�!���������q:�� �c��aO��� A,yc'i�.O��C� ���f�|�l�F�HZ�z��V��4�T� �d�W*���OS�L��\e�)���W���i�7R���L_�����/�����~tEF��a����{."WA�����F}x�b�T:(����;���.g��"�VZ��"u��vt�qN������X
��E������.��)zY�O�������J�{+�������'�Q[i�'�b3c?�*���n{�S�#<���F@q|^&r�+�o9R����/�`0� ����u�8<t�bT����%�z)�~�a���:G]�>��/-�������r�s��j
}2�@��
�j��X�]O�������?���+\�e�o6��dR�����r����c���e�G���#�#������:���1����Lv���f�p�a6���Q!�P��T�����zZ��<��G�uQ2Pb7�	���/"�Mfb#��]��K��Xsf�c�����=��������L�KtL��$"I�8|^��
'��l���%�_i�&e2��"g%vD��|�U�_�/"U�����J!x��[Iu�[�K�?��  I��,������� �;(s9���x����
��������
�!S~�)z���������:��94:
1�~��2'�XJ��m�<�t�(�Jd��R��(�t�-G:�_u��D�.���iYQ���?
���r��|���_+n�p>\�V�^��G�u9n��y�Xf����R�����h�VLD�G�T������J�o*��b��z^���Ii�9Y�t��"~�z|_F�Og�t�H�W�����Y����q2�(��d��flq�q�J��(2��tZ�G�+2��K�D|�K��$Q1j��aIdb���9�el�X{������2����q�#b�i�����h�b/�j�n����I��Op�X���x=8�i��a�8p�����g-W>Y���\����1��T5�aO���E�zM�4�����0��������z�������}�2b���NBH����O2����#F���*7.[R6�a|!;��%����P�@
-�\ IDAT%��j�����W_��@���4:��0�h���Bf���H!BQ���)s(�O������`bY���Y������U�����'`�i�e7c4�(w7���k�DQL�f
�B���[Z����\�*�?��!���j?��n7U��vv���mu�L	x�6!4[l��&t��l�f�`.R��������Ga�O�^��N~�R����/(����Mb�t�������F��W�e���G�� ��hJ���6;���kt��\1�:�����J�o*b#x|Ql5��lF��r�5��S���?����g`�T�h�W�����9��k�?����?��|tJ��nnz�|t�����J;�3���v5v��+F4BF��O��n������@4���(!�I��`o{�-�����I�b`���#������j@�$����������lww��V"������
N���(����454��V���#����(��Wl����v_m�J����Qm�Cda�y�z2�h�B�|L���
��:4` ���}�D��T�w�L��Gzz���f�K�h��aYm��x��0�X�1q[4
�����
����^ p�)�z�����KKCCk!����m��K���������/������s��N���P��y��U�J��C���bE�����
L=Z)^�#{z���������fX���Z�8�O��oG�����E�?ez��{�T�K���O���\X�>����5�]O��^?�P�Y~��T?)I�?}1��R�����"M��aO��K����i�����a�������w�������y��D�i��+�>������X���`�/�����>?[��cU�CI�@����5w�y�e����\�/�1���'�3g�%��Lh�������M���_{�	��O*[��
�����Yi�^��}�����~��\7�������o�#�P%w\7��-�eM��9�S��Pd��uw���w�5���n���Z���5���wh<�M_�j�����Ya�^��}�����~��\���RNU�
s�M���:����&��d��_+Bn�����,I�Cl�N�v������TT�]T�VQ�vQ��JZl5t4Wc3�!0:@OW?��XEE���:{N���������������\��|
�������������������UTTTTTTTTTTTTT�Voo���S��1/�Y�h��S�q(;>W�:������\y������&���4�bh(����[)Tn R���C�O��	���_-b!�W��9����8V�D����\���������J�A�p�1ZSl�h�~���M�y�g
�����g�	�0�{�%N(m>��J��|"��ck�X����f7�v+�DdB���M�l������c���d����5,��UC�e��9x�r���^����v�u�s0{��/9�(��K�����G+%���aD�+�^�L�������4}�����=?KV������]�i>��|�!#�Y�y������
"���bW���lfc��}e�W��-A��S4�#�`�:���um�����*****���������b�j $��p�x8�m��f��D���^�������?f�
;hi�x��w�i�${z|��JHq���y7
V��=������������S�2V._6������se=�;��.^g/c�c�M�FW����u��?`���LK[�G���7��GD���!��X)o�g�^�|�0BJ��
�������B�����Zy�G��s�
�)�T��J����� 0�s�^�_�H����r���_�����pN����6���j��JjT�D�O�,ko4Of�Hg������b��q��6^��f���� �����W|�,��A���IS!�@7{�`����������!�o�3	E7;�X�H�>��{���y��S���qPm7���x���52��[c.���c��B�3�&�����]m�l�"������-��.?�)�l��\���9hl���<��cs����pX%F��a$����<�b��w~�K�F�f=��A<���#}��V�Pb�0�)�/��-t4Z�Z��%|)&(��H���J���"�g�k)��O6���5��k�"�l�)|��h\u���o�W���O�WR>Y����e_;]���==^\l����K�?�{����g\k��V������Iw�k�|o��z3��
��!������R���S:2������w�w�'�� �K�?i�;%�B��7�"�sp�N���z0o��k)l���a����RW��j������� ,m��i?!�������s7H���?UTTTT�2�������hlc�u������kqS���������HG'Z�\@\��`�:����3�p������H�)�z��t��H(��^�L��FBOt2�h��ee�[��[�!:��:���rZ:vQ8N�N/��[��������x����bp����������^n�/�������r�7�1{�90��|!1Z�%��$	Ik����%�����S�h������!a��`����"��I�8�0FN
��^���i�J'Vm�39���Oo0{���_�}����6v�98��_�>��2z_z�T��\�_J���c3C�3>�. �^�M���
;��(����h�z����r��l�o��OV�t�� U|���o�xF���Tc
�lR��u|X`_����� �l�b%<�EX����7��+�s�J[ns
u�>z�9�0�^��A.>���6���pN2��OX���B�����d,+�T��9�SEEEE����L���S��!�t��O"�};�Ig1� 2F�X���du��9L�0y�����2+�%C�����xc��"VW`3�/��Z��������&��H�)������������8���)\�z{Zy&`[���M1�Y��b/�L^���fQ�)<�?��
�|�8�\X�h��G82 ��1;�mh`o<�l���X�Vzv���������u2��e��[8�b�d�>�Pu�)����r�uA�>���K�H�Q!��F���1
!�[�u@�.�|����?tZ��T��E�:�t�&��(�@j-�hs�������s/���S_c#�m���r�|X%�S"��K�<2��9������_�gA>��C�6���%��;�X��������7�\8m!�#"DEF�F*fz�����2Nv
"$�&������f���N����g����c�17�R
c#���U��f�O�����#{e$�D&+&m�qq����"!I�>���wm��Q�?@@��#(N��T��V�[1���D1�W!E��SZ���f����7'3Y1�Xw|��/M�
@�`��s�]�h0�moOyK_�q�-
���a|!�DF�g�{���?j��6n[?=~���@�5����"zJ��S����3��Q�@kp��C�V��t���Z�|��B�o��'K��G"I�@J����@��Y��$���?�����$�]�� ������h
%T���nt1�:x�}����>��h��l���[V�S��K����h.Y�Y�k��O���@�(>,^c,E�/�l)����l�����%��/!��������5����������r�W����)�I��$����h)��`�����������*****�Wo;	b��IR�[�$8H�7��WQZV������o�z��
m�l	��)B8:'�2!�XJ
-�s,��f3�^c-�f(�~�U��/l{�S�#<�� Q?{��GWd$:M9���"�,�g4 D)F�tN�����2��o���~b��fF��� ��v&P�����]�O�����~������<;�}���k�$�*k����-��@���E���o��~@SN��)�+�`�A�
���hqx�L��W�>A�����$�������rd���mW�?���J����y�*���4k�gY�� ���o�(u���tU��$9�2�3������K��v�g��i���ys�������u����. $Uc4j���kL��/��FF�+���*t�N��Io��(��f�p�avO���M`�/"U���_�>@������r��I�)�2�|�iz*��_M)������9Xpot2��"g%vD�S|��X������������]�4��~�>��@/[�9T?JSo�����]y&�r�/�}�Y�e�R�L������/���^_B>;Vm����_jIv���Gq���o�a�������sHh4��TZ�K��}c���gt������.*[��CJ�&!����q��Gv�=� \c�~_#[���d�_9��P.>���Q�������`��L�A"�j��%Q��F�O������tTNq���I��]Hl��6Z��a$�f���=���Z\n*��h�&�8��#���)D(��V^
B�����5P���?:��NL�P:IZ<}���4���q/����������5m4Gz�"hL��NL�Nz�b����-6tZ:��m�hHD��!�I����4�	�p\��*jLa��(:Kuue�|���M)�-eD�Bb
^)�
Nt����"��i��c4��V��C����H��{8���������D���9
&{5uv,����O�>Y�W�(��T�WP���'������i���0�r;1����?be������������Q2%p0��Ty�QKD�z��6l��?T�|�\m�KB%*������D+��4��Y�i����:�b��Ie���R�m�1]��A��i���u��8�`�����W��w����x�W��_��+�~
{'�t5R/t�	p��a��h_r�����y}��������A�x�V�$�gn�u������Wj f|t��u~3��CC4X��o{u����h��N������m@�������������6��2�������W\1����0�����#t������
�0�?C_��x���=���vs���H10�����������x�������gxZAOQ�����j�"������U�K�mta�k��B����{~�qo{'��z���D�Vbj����;9<�[�R���={�q�Bc�_�aL�]|�/�THD�d�����r�������N��D1����������=��7�e�����!�����i�i��G�\�*0�h��~���ax�8�RjB�7���,I��[��%(x~�\m��e���--n��T�6�~��/��WTY�C�������`�f���|�`'O�d��>~e��N;��i}����N|EV�_��?r�W����<�AM
-���C$��H{7��Q����O���5w�y��Ll?:D����]���KEE����&7���_<���DSo����A�Q����������J���s~�7�P��S;/*���*����E��|=�r�Y����j�k�?UTTTTT��x��K�_�UTTTTTTTTr���P���<��U�DEEEE�z!o�PQQQQQQQQQQQQQQ������(���[W[���]��z��u�TTTTTTTTTTTR��|^k#��d�Y�Ky���v��������/���6}���O��hJ��?W���>IU�k�k]>��c��������F�v��Q���8h�[��R�������l������#�"}��j�sC����������XR���~������b�<X�Y��"������wWY�������{���������[����������?&��Vn������_U�}������i������/�i�z~�N�~vQ�?���Y�bt}��������=?���M�4����F���_a������-p��s��q���2�r2��4���e���@��������o����Bx���s�}��������O2?��"3S�qa�E���^�2�a���������y���5������q���_���������|���F ��TTTTTTn2
��?�����ao��6��j���\��?���n���a��
���~q��4��=���?r����/�
�Z�U���f���+�x7������+_6�_��S�����i���{��&�_B^����e��7��e����;
��r�u�}�S��N&�45�����h������^���ob��fn��~���lL�x!�G�g&=�w?�n�����8��r���k.2}�C�l������w�x��w�_NY�w��i���\
*�;�s�0���D5����_�SEEEE�z���������[��?�s>��O2>>���%�>��|\~����5wP��4��yz�41����"�<"]s/����	������2��r����8�9�~��<�4�?8�������] ������?C��"3�q�X<��'��u�m��o�{��p,��Y�	4���h������w������>���O�g�����3�������@%�_Z���'����A��w�h�[�~�S��7T�f�_��o6�����H�g���!6��t�W�?��kl=BA�CQ�/!o#��/�h�������=P�-��3��i>�Et?�"0����\��LN�K�.�#z�<k���uO��i�]>��U,��K)�j��;a�w�3=?��XD8<�|9t�>}�l������3S?!x�y�6���i���+va(��-�X�u&����?�<���_d�~�w:������c�y���1i|�OV@������V.�9@��Zn��!S?x�
;���@���O��W�m��f5�s����o�������A
7�N�Z�9$�f'}S�����Y
�{��������.��F�~f9����w�y;��a!������j��������r���������_��;�o�����{�!�z�������w��/�6��f���/w����]�E�������^�w���p���l��������\��M�����gm�����:���-"�gY���h8������&�I����l�R3,4�Y�]��'?����������[��2�!��m����B��W&������s��vQt�
�Z�~�-�^���q��|���"�6���P�h����y�)��TTTTTT�I������'��_��]3��/PP���s�����.�%��)�R1�&��l�����1�[
]�S��k����w���@���)��D��j
?�~�6���b��i�J6���|�S��VG�c_����������6���5����=���.�6�`p�.�5�+!_�����^{�K�Mh��F�9������������Y������v���Sh�/`�n���,������[!��/��e��o�_��������X�>e��q��.������/�������M������{�^{��k�b&4��d�/KD��t��X�/;��K#�t~�|�B�HP�l9y�1.����������8vl����M�����@�
�.~��i@o����s���?�����1��L�8|��Mq�7����.��
7�����Y�x����Ls�7���pa��������oFH����k��������7�.,a}E'�k��&����l(y���[H)����e�C���ua������z�so'/��_>���g�\p;��bCB?i�_�g�����������t�|�so����\��\�z����s��~�D�?d����B
-dP�t�����x�
y;�� Vp������y�\��os�}��=�7H4|��D�e�gcC���a��p!�L�����R�����A��1� �����[��>KQ���a�M^>�j6��0��z'��<�t�������esA�/!�N��q>���������hf;��r4�]���?��������>M��>������E�>�}�7gX��7���M��?�5���9f^!v&�Z�if�����^���N
��H~q>������t��)��xC����?����[���
���w������G`X�]>k�����F�7o���k���o���O����?������/�V�o���Mk�?�����7'��\������_(,j�)��[�IDAT+"��/.�|����}��J�|2��O�;��-_��3`�WH���OF���5q)t�5����Lk������k��M���'���K���4O�'j��gc��7��������6f>���X��������I��
�/�[�|�[+~���>�{�k���}����C��v��<�e����9��	���.��a���r��m�j�]�	���~�u�=�����?�mAo�����A�E���}�<�=������6[��b�	�\����x_3�/$�o/3���+�\J?���������<��R�
�����������An�d�p�>�d�O����Z��}�������g,]��w���I^eQ��?�% o���/�~���2����<��	�������Xg��������������o�[��y�m�
�n*��?3�a�m)��������������l�5��������!�|+�|�4���Ms`�v�{�������(�`����F��O�����5��w~�;�.3��h>���{G�S��e�~T�0������m?@z3���r�O\fy���;\z3��{�h���|f��w���/�	�$e��;�����{�����{������FR����/R�X|����G��\�8%e�i�H/�1#�%������U�*��K|������F�t}����������RL�L����������{��RY�S�O9���wP`?���(9�ON��2���^�h�@��:���P���q������:�_�Ol|6/�e��y��E���T�GJ�-�:�U|������E�S~wt~�U��'��}��k�/�#��'�>��/j���������=�;���%>�<<�E�����]h6�P���E.O��i)���3@b����7���!n)y���m���~�d�kK�0>���.�8D���������_�:�Y.���'V�������b���>�y��=~�9���,��3�x��W�L�y�W�3K�)�7o�������rv"�Ba&d����(�������\2������
ra�u���"}���c7Z�u+�(���D'�b���%n��~k�������������o�'e����f�*�`�z�����=�c��������dH������q�Ffff���~��]���X��Fi�+>��1H���-���p��u��������_����=�=PL���l�H��������2E\z�m.��i
�"?�>����)�/�<�O~�<�L���Gas9�����O�L3�xA9���M�Z�*#�3�{��w
�f-F��7���Ii�k�,���N��������YS���5�KR�	����Z��%���G�^Ot�Y��� >��o�o�
��|�
�=�6����'�W��m.��oZ���������:z�����1Z?K�����e�G����q98@t�!6����������m+�3Z����/3��Y���4w�����K%�{�f�x�������!�C���-��d.����6[(��!���oV(�%|)���(��v��K����ov��r������IZN��>����u&�O���������8�,��������.)j����so0�����.G_W,���=-m6N�����BG�)Z=Y�����������~;SSS��7�I���y.y������wo���w��e,��F������u�Lr�gR�A��q�����b�����
�|f�L��vA��(�����d��$�1��i��H���h��1��/��x#k&~������{ �A�-M�LG`�I����c�wp�{P��;X�9U�r����w�����]8S)���t��__������|������`�������L�rEl�����c�l�S��lXc��rb�>�;�`=0��	�E�3�N8�c�������13�j9Ejobw���a���L�����rn�t��?�~C��W��B���������w����d�����_����-���>x`�kk�F���.4KL�N��O����g�_|�";0����dk6l�����!�I|Z|�������?��������}TTe����03��B���&�B�$��K/Vdij�=|i��by�SKk�k������Z��h[�����2�UL�-������(8����1J��\`pd{>����>�����w�y�}�����y�$����L"�������Z}���Wz7"p
���,X��@�uE:;����:��_��34���o�Q���p6���'����z�;��b���rK����k��.��\�P
I��������{�\
�q���/l7`�"uH]�=�w�%�,�;�G`nl�O�A�R=z�s��gp4�Cq;���q�2����"l%cQ��'��0g�ap0�a��\�����k��xf>����G�3�W��}��*��p�s�E�Op��w�g�!W)��v��Q
���<����x�v|�}�\��C��
��<<�;{'�(��Q=X�C���?�Bv�5#
yEX?�F3g>�s�e���z!�%w7��8�y����c@�f������������;���Ji_�8��?v�w5�`^^)-�W�����ir���{_����n���������
vwI�=8���G1���� oh��2}��U�%�����aY���{�C�g�#;]�9�}����g�x*e�v^��-M_F������;�|5�?}�Y��������������@W�J����f�L���X�����^J�W� �[@�.5�eZ��W�XN�:>��������
l$��%:�>Z��	�o�����^�����d��uO�f��W{n�z��[���qJC:�Gc�Z�XN��r����$n]|YP��q�T�nD_������:���?y3��hA�4�r�4�����f���+���#����5�;���W���5���'3���X����HU������E=L��H���p�CQF�:�*�=�7iA&�����m �"��4���Ru��yA���=lUX�`��#�#��{l��E���b�k����(G��3��5IZ�f����@05a����}a��*\�p��QM��T�m���$}?A1m~�������Q����/5���=�u��8	�����//����!��4�c�8�u5v��g�X�3�qn|��_��:�/���D�E"�}��'��pd�q|�np����0��[��������;!��I�C1���B?Dz|U�X?0�4�Q$����"B���������6�#� r���]-���q�4��}SAF���0g��Y������N�$��s\>�I��|�Kw��dE2
�g}1M���E�������T)ve$��P�5�\|���~LL��������D���������\��p��v
��\j�b6�M�%����uE���s�&���d�Q�(�FZ����������U����/�j��?�at���a�5� 5���~]�qx�')?�tu~%+�Ra1.](2�9Z����3���=]����������I(-�4�Y���.��5������.Gh,;p�l�����g�9U��H41��j�l��_M���YK�~+��'.${��Xw����� B��
:T�w\j�g�H�C�}M�����T�{�����Z���#7������^�3�o����pt����7o�\�W�S���\���%���$?>��������_���'=��+&<F��=S�.�� ��7��a���Pr��?������M�kT� �������5#��s�&��3P����@Yv�{+���]`� ��z��q��6_G�9u��LA~� ������W�������_�_/PM��!�2��:\�H�
j�{�7��.� 7�h��@��;1oYT82�����%�gq��~�R'�?
l����n��f����5?��_�t���ZgS��9�����g���b��rr,�js��k�+z��8�+o�nA��Y\\����� � � ��-e}}��cAAA����1�A�By�|�MG�(�� � � � ��h�� � � � �� B����{e;2����P�lk�R(���\.G��y1��3P��Y��jF���0�d��A����� �h��F��0����L�0����WUA����V�Z���P�o�0�f�b�����?w�\,X�����J�"//�a���:�c��	<�������j5�7o&((���X�f
���^�J��_� �Z�f����i�J�����������X��"r���k�y��������`��U+_a����R�=��ibG��h�R)�c�A����o����dgg�����y�x��w���$0��y`�A���|���p:����q��A�l������$''3u�T�-[��,22��_���@�f3UUU����]�v�����AJJ
J����F���+


hnn����^��b������"��}l6QQQ,X����x\.������SVV�!>�V���3Y�dI��111���������'�3E7��je�����������m������@�G�f��E�z��\�p���|JJJP�Tl�������/���������	���l�������t:.��;�������J�y��>���#G2{�l������9~���u�1i�$V�X�����t�[.��$m[�T��SO����J����s�_����r
����K�]���l���,Y���
9�^OAA�?�8k��%**��z�|�	���g��u�������q]���T���C@@��96l�j��9���X���HKK�n���zK������RZZ��M���/[�����u�p:�������������#//�����;s�T�GD���"��/0ft,������z���_8�n�0�'������ijj�P����2��G:,�� ���c�N������[Z[%��"%9�{�IFLcc{������]���������.��h]]�&��={��}w�����y������������-ZDNNK�.�n�w��?��>AxMJ�IEND�B`�

Screenshot from 2025-10-27 17-48-42.pngimage/png; name="Screenshot from 2025-10-27 17-48-42.png"Download

#14

Melanie Plageman

melanieplageman@gmail.com

2 months ago

In reply to: Chao Li (#12)

7 attachment(s)

Re: Checkpointer write combining

Thanks for continuing to review! I've revised the patches to
incorporate all of your feedback except for where I mention below.

There were failures in CI due to issues with max batch size, so
attached v8 also seeks to fix those.

- Melanie

On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:

3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+       if (++(*sweep_cursor) >= strategy->nbuffers)
+               *sweep_cursor = 0;
+
+       return strategy->buffers[*sweep_cursor];
+}
```
Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.

Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."

Yes, actually I think having these helpers mention the sweep is more
confusing than anything else. I've revised them to be named more
generically and updated the comments accordingly.

5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+       uint32          max_possible_buffer_limit;
+       uint32          max_write_batch_size;
+       int                     strategy_pin_limit;
+
+       max_write_batch_size = io_combine_limit;
+
+       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+       max_possible_buffer_limit = GetPinLimit();
+
+       max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+       max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+       max_write_batch_size = Max(1, max_write_batch_size);
+       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+       return max_write_batch_size;
+}
```
This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);

/* Clamp to io_combine_limit and enforce minimum of 1 */
if (max_write_batch_size > io_combine_limit)
max_write_batch_size = io_combine_limit;
if (max_write_batch_size == 0)
max_write_batch_size = 1;

Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
return max_write_batch_size;
}
```

I agree that the implementation was hard to understand. I've not quite
gone with your version but I have rewritten it like this:

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
uint32 max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_possible_buffer_limit = GetPinLimit();

/* Identify the minimum of the above */
max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);

/* Must allow at least 1 IO for forward progress */
max_write_batch_size = Max(1, max_write_batch_size);

return max_write_batch_size;
}

Is this better?

- Melanie

Attachments:

v8-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v8-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From 4640e59ddb4718771219b321f9c269e707f3f707 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:42:47 -0400
Subject: [PATCH v8 4/7] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.

XXX: Because this sets in-place checksums for batches, it is not
committable until additional infrastructure goes in place.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 217 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  23 +++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  34 +++-
 src/include/storage/bufpage.h         |   2 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 286 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2de68e78b4e..a5eff9fa2b6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -537,7 +537,11 @@ static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 static BufferDesc *NextStrategyBufToFlush(BufferAccessStrategy strategy,
 										  Buffer sweep_end,
 										  XLogRecPtr *lsn, int *sweep_cursor);
-
+static void FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+							   BufferDesc *batch_start,
+							   uint32 max_batch_size,
+							   BufWriteBatch *batch,
+							   int *sweep_cursor);
 static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
 static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
 												   RelFileLocator *rlocator, bool skip_pinned,
@@ -4311,10 +4315,91 @@ BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
 }
 
 
+
+/*
+ * Given a buffer descriptor, start, from a strategy ring, strategy, that
+ * supports eager flushing, find additional buffers from the ring that can be
+ * combined into a single write batch with this buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to write this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+				   BufferDesc *batch_start,
+				   uint32 max_batch_size,
+				   BufWriteBatch *batch,
+				   int *sweep_cursor)
+{
+	BlockNumber limit;
+	uint32		buf_state;
+
+	Assert(batch_start);
+	batch->bufdescs[0] = batch_start;
+
+	buf_state = LockBufHdr(batch_start);
+	batch->max_lsn = BufferGetLSN(batch_start);
+	UnlockBufHdr(batch_start, buf_state);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	Assert(BlockNumberIsValid(batch->start));
+	batch->n = 1;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Min(max_batch_size, limit);
+	limit = Min(GetAdditionalPinLimit(), limit);
+
+	/*
+	 * It's possible we're not allowed any more pins or there aren't more
+	 * blocks in the target relation. In this case, just return. Our batch
+	 * will have only one buffer.
+	 */
+	if (limit <= 0)
+		return;
+
+	/* Now assemble a run of blocks to write out. */
+	for (; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum =
+			 StrategyNextBuffer(strategy, sweep_cursor)) == sweep_end)
+			break;
+
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or NULL when there are no further buffers to consider
- * writing out.
+ * writing out. This will be the start of a new batch of buffers to write out.
  */
 static BufferDesc *
 NextStrategyBufToFlush(BufferAccessStrategy strategy,
@@ -4361,7 +4446,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 {
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 
 	/* Set up this victim buffer to be flushed */
 	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
@@ -4371,19 +4455,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 	{
 		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
 		int			cursor = StrategyGetCurrentIndex(strategy);
+		uint32		max_batch_size = StrategyMaxWriteBatchSize(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
 
 		/* Clean victim buffer and find more to flush opportunistically */
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, sweep_end, bufdesc, max_batch_size,
+							   &batch, &cursor);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 		} while ((bufdesc = NextStrategyBufToFlush(strategy, sweep_end,
 												   &max_lsn, &cursor)) != NULL);
 	}
@@ -4528,6 +4615,70 @@ except_unpin_buffer:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (!XLogRecPtrIsInvalid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		XLogRecPtr	lsn;
+
+		Assert(!BufferNeedsWALFlush(batch->bufdescs[i], &lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with bufdesc for writing. Returns true if the buffer
  * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
@@ -4688,6 +4839,48 @@ FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 	LWLockRelease(BufferDescriptorGetContentLock(buf));
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index c9cf21bdae1..b70851e59d7 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -775,6 +775,29 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+	uint32		max_write_batch_size = Min(io_combine_limit, MAX_IO_COMBINE_LIMIT);
+	int			strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	uint32		max_possible_buffer_limit = GetPinLimit();
+
+	/* Identify the minimum of the above */
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+
+	/* Must allow at least 1 IO for forward progress */
+	max_write_batch_size = Max(1, max_write_batch_size);
+
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index aac6e695954..7c2ec99f939 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 409b52b3d48..799c447f88e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -418,6 +418,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -431,6 +459,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 extern void TrackNewBufferPin(Buffer buf);
 
@@ -442,9 +471,12 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern uint32 StrategyMaxWriteBatchSize(BufferAccessStrategy strategy);
 extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
-								 int *cursor);
+								 int *sweep_cursor);
 extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+								 WritebackContext *wb_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index abc2cf2a020..29a400a71eb 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -506,5 +506,7 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									const void *newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos,
+										uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..e1210029b3e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -350,6 +350,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v8-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From 704280b5ce18697b24051f0413b97b48f0fcf1f3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:53:48 -0400
Subject: [PATCH v8 1/7] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with
standard for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 189 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  20 ++-
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 111 insertions(+), 103 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e8544acb784..1fadeddf505 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2325,125 +2321,116 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned pinned and owned by
-	 * this backend.
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		/* Attempt to claim a victim buffer. Buffer is returned pinned. */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr, buf_state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
+
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
 
-	if (buf_state & BM_VALID)
-	{
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
+
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 7fe34d3ef4c..b76be264eb5 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -779,12 +780,21 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held. We must hold the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -794,11 +804,17 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
+	buf_state = LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf, buf_state);
+
+	if (XLogNeedsFlush(lsn))
+		return false;
+
 	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
+	 * Remove the dirty buffer from the ring; necessary to prevent an infinite
 	 * loop if all ring members are dirty.
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
-
 	return true;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c1206a46aba..8cffa0f73fb 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -421,6 +421,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v8-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Split-FlushBuffer-into-two-parts.patchDownload

From e6c8f6161f70efc9598913096245bd1f95ebb1f1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:54:19 -0400
Subject: [PATCH v8 2/7] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This separation procides symmetry with
future code for batch flushing which necessarily separates these steps,
as it must prepare multiple buffers before flushing them together.

These steps are moved into a new FlushBuffer() helper function,
CleanVictimBuffer() which will contain both the batch flushing and
single flush code in future commits.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 140 +++++++++++++++++++---------
 1 file changed, 97 insertions(+), 43 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fadeddf505..fc63bb50ca4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -533,6 +533,12 @@ static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						  IOObject io_object, IOContext io_context,
+						  XLogRecPtr buffer_lsn);
+static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
+							  IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2388,12 +2394,8 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			/* Content lock is released inside CleanVictimBuffer */
+			CleanVictimBuffer(buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4271,53 +4273,64 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
-	uint32		buf_state;
+	XLogRecPtr	lsn;
 
-	/*
-	 * Try to start an I/O operation.  If StartBufferIO returns false, then
-	 * someone else flushed the buffer before we could, so we need not do
-	 * anything.
-	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (PrepareFlushBuffer(buf, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
 
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = buf;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
+{
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
 
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+		return;
 
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber);
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
 
-	buf_state = LockBufHdr(buf);
+/*
+ * Prepare the buffer with bufdesc for writing. Returns true if the buffer
+ * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
+ * the table is logged.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state;
 
 	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
+	 * Try to start an I/O operation.  If StartBufferIO returns false, then
+	 * someone else flushed the buffer before we could, so we need not do
+	 * anything.
 	 */
-	recptr = BufferGetLSN(buf);
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
 
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
+	*lsn = InvalidXLogRecPtr;
+	buf_state = LockBufHdr(bufdesc);
 
 	/*
-	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
-	 * rule that log updates must hit disk before any of the data-file changes
-	 * they describe do.
+	 * Record the buffer's LSN. We will force XLOG flush up to buffer's LSN.
+	 * This implements the basic WAL rule that log updates must hit disk
+	 * before any of the data-file changes they describe do.
 	 *
 	 * However, this rule does not apply to unlogged relations, which will be
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
@@ -4330,9 +4343,50 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * happen, attempting to flush WAL through that location would fail, with
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
+	 *
+	 * We must hold the buffer header lock when examining the page LSN since
+	 * don't have buffer exclusively locked in all cases.
 	 */
 	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = shared_buffer_write_error_callback;
+	errcallback.arg = buf;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
+										buf->tag.blockNum,
+										reln->smgr_rlocator.locator.spcOid,
+										reln->smgr_rlocator.locator.dbOid,
+										reln->smgr_rlocator.locator.relNumber);
+
+	/* Force XLOG flush up to buffer's LSN */
+	if (!XLogRecPtrIsInvalid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v8-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v8-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From c17abbb030a4ef996508294168941ccaaaea2bfc Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:15:43 -0400
Subject: [PATCH v8 3/7] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse them. By
eagerly flushing the buffers in a larger run, we encourage larger writes
at the kernel level and less interleaving of WAL flushes and data file
writes. The effect is mainly noticeable with multiple parallel COPY
FROMs. In this case, client backends achieve higher write throughput and
end up spending less time waiting on acquiring the lock to flush WAL.
Larger flush operations also mean less time waiting for flush operations
at the kernel level.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which do not require a WAL flush.

This patch also is a step toward AIO writes.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Earlier version Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 238 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  48 ++++++
 src/include/storage/buf_internals.h   |   4 +
 3 files changed, 282 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fc63bb50ca4..2de68e78b4e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -531,14 +531,25 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
+
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
-static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *NextStrategyBufToFlush(BufferAccessStrategy strategy,
+										  Buffer sweep_end,
+										  XLogRecPtr *lsn, int *sweep_cursor);
+
+static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator, bool skip_pinned,
+												   XLogRecPtr *max_lsn);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc,
+							   XLogRecPtr *lsn);
 static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						  IOObject io_object, IOContext io_context,
 						  XLogRecPtr buffer_lsn);
-static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
-							  IOContext io_context);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc,
+							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2395,7 +2406,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 			}
 
 			/* Content lock is released inside CleanVictimBuffer */
-			CleanVictimBuffer(buf_hdr, from_ring, io_context);
+			CleanVictimBuffer(strategy, buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4279,6 +4290,61 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state = LockBufHdr(bufdesc);
+
+	*lsn = BufferGetLSN(bufdesc);
+
+	UnlockBufHdr(bufdesc, buf_state);
+
+	/*
+	 * See buffer flushing code for more details on why we condition this on
+	 * the relation being logged.
+	 */
+	return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}
+
+
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or NULL when there are no further buffers to consider
+ * writing out.
+ */
+static BufferDesc *
+NextStrategyBufToFlush(BufferAccessStrategy strategy,
+					   Buffer sweep_end,
+					   XLogRecPtr *lsn, int *sweep_cursor)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum =
+			StrategyNextBuffer(strategy, sweep_cursor)) != sweep_end)
+	{
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare and write out a dirty victim buffer.
  *
@@ -4289,21 +4355,177 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
  * bufdesc may be modified.
  */
 static void
-CleanVictimBuffer(BufferDesc *bufdesc,
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc,
 				  bool from_ring, IOContext io_context)
 {
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 
 	/* Set up this victim buffer to be flushed */
 	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && StrategySupportsEagerFlush(strategy))
+	{
+		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
+		int			cursor = StrategyGetCurrentIndex(strategy);
+
+		/* Clean victim buffer and find more to flush opportunistically */
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = NextStrategyBufToFlush(strategy, sweep_end,
+												   &max_lsn, &cursor)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, return the buffer descriptor of the buffer to eagerly flush,
+ * pinned and locked, or NULL if this buffer does not contain a block that
+ * should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+static BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		old_buf_state;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/*
+	 * Ensure that theres a free refcount entry and resource owner slot for
+	 * the pin before pinning the buffer. While this may leake a refcount and
+	 * slot if we return without a buffer, we should use that slot the next
+	 * time we try and reserve a spot.
+	 */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	/*
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header. We have to lock
+	 * the buffer header later if we succeed in pinning the buffer here, but
+	 * avoiding locking the buffer header if the buffer is in use is worth it.
+	 */
+	old_buf_state = pg_atomic_read_u32(&bufdesc->state);
+
+	for (;;)
+	{
+		buf_state = old_buf_state;
+
+		if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+			return NULL;
+
+		/* We don't eagerly flush buffers used by others */
+		if (skip_pinned &&
+			(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+			 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+			return NULL;
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufdesc);
+			continue;
+		}
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufdesc->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufdesc));
+			break;
+		}
+	}
+
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
+	/* Don't eagerly flush buffers requiring WAL flush */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
+	/* Try to start an I/O operation */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+
+	buf_state = LockBufHdr(bufdesc);
+	buf_state &= ~BM_JUST_DIRTIED;
+	UnlockBufHdr(bufdesc, buf_state);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index b76be264eb5..c9cf21bdae1 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -156,6 +156,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lead to better I/O
+ * patterns than lazily flushing buffers immediately before reusing them.
+ */
+bool
+StrategySupportsEagerFlush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -307,6 +332,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Given a position in the ring, cursor, increment the position, and return
+ * the buffer at this position.
+ */
+Buffer
+StrategyNextBuffer(BufferAccessStrategy strategy, int *cursor)
+{
+	if (++(*cursor) >= strategy->nbuffers)
+		*cursor = 0;
+
+	return strategy->buffers[*cursor];
+}
+
+/*
+ * Return the current slot in the strategy ring.
+ */
+int
+StrategyGetCurrentIndex(BufferAccessStrategy strategy)
+{
+	return strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 8cffa0f73fb..409b52b3d48 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -441,6 +441,10 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
+								 int *cursor);
+extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v8-0005-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v8-0005-Add-database-Oid-to-CkptSortItem.patchDownload

From 91c591f6f075ead014360252bbc008ca609af726 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v8 5/7] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a5eff9fa2b6..a0f7f686c42 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3408,6 +3408,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->dbId = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6804,6 +6805,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->dbId < b->dbId)
+		return -1;
+	else if (a->dbId > b->dbId)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 799c447f88e..a2cdf73ab32 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -384,6 +384,7 @@ extern uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			dbId;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

v8-0006-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v8-0006-Implement-checkpointer-data-write-combining.patchDownload

From c21d6828eb72122313c8f54692592be1d4c6c919 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 15:23:16 -0400
Subject: [PATCH v8 6/7] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: BharatDB <bharatdbpg@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 221 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 195 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a0f7f686c42..3273e1c44ec 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -513,6 +513,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
@@ -3350,7 +3351,6 @@ TrackNewBufferPin(Buffer buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3362,6 +3362,8 @@ BufferSync(int flags)
 	int			i;
 	uint32		mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3392,6 +3394,7 @@ BufferSync(int flags)
 	for (buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3532,48 +3535,196 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = CheckpointerMaxBatchSize();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
 
-		bufHdr = GetBufferDescriptor(buf_id);
+		while (batch.n < limit)
+		{
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
 
-		num_processed++;
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
 
-		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
-		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
-		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
+			{
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.dbId;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Min(max_batch_size, limit);
+				limit = Min(GetAdditionalPinLimit(), limit);
+				/* Guarantee progress */
+				limit = Max(limit, 1);
+			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * because we want to start the next IO with this item.
+			 */
+			if (item.dbId != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			Assert(item.tsId == batch.rlocator.spcOid);
+
+			/*
+			 * If the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a few bits. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false.
+			 *
+			 * If the buffer doesn't need checkpointing, don't include it in
+			 * the batch we are building. And if the buffer doesn't need
+			 * flushing, we're done with the item, so count it as processed
+			 * and break out of the loop to issue the IO so far.
+			 */
+			buf_state = pg_atomic_read_u32(&bufHdr->state);
+			if ((buf_state & (BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY)) !=
+				(BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+			PinBuffer(bufHdr, NULL, false);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				UnpinBuffer(bufHdr);
+				break;
 			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			buf_state &= ~BM_JUST_DIRTIED;
+			UnlockBufHdr(bufHdr, buf_state);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
+		 * - otherwise writing becomes unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, IOCONTEXT_NORMAL, &wb_context);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -6406,6 +6557,22 @@ IsBufferCleanupOK(Buffer buffer)
 	return false;
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+CheckpointerMaxBatchSize(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Min(pin_limit, io_combine_limit);
+	result = Min(result, MAX_IO_COMBINE_LIMIT);
+	result = Max(result, 1);
+	return result;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

v8-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patchtext/x-patch; charset=US-ASCII; name=v8-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patchDownload

From 95a2b7a7dcc1d981e1fe5784db25304d5596ec72 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 16:16:58 -0400
Subject: [PATCH v8 7/7] WIP: Refactor SyncOneBuffer for bgwriter only

Only bgwriter uses SyncOneBuffer now so we can remove the
skip_recently_used parameter and make it the default.

5e89985928795f243 introduced the pattern of using a CAS loop instead of
locking the buffer header and then calling PinBuffer_Locked(). Do that
in SyncOneBuffer() so we can avoid taking the buffer header spinlock in
the common case that the buffer is recently used.

ci-os-only: windows
---
 src/backend/storage/buffer/bufmgr.c | 96 +++++++++++++++++------------
 1 file changed, 56 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3273e1c44ec..726b5dcde28 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,8 +515,7 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
+static int	SyncOneBuffer(int buf_id, WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -3998,8 +3997,7 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state = SyncOneBuffer(next_to_clean, wb_context);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -4062,8 +4060,8 @@ BgBufferSync(WritebackContext *wb_context)
 /*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
- * If skip_recently_used is true, we don't write currently-pinned buffers, nor
- * buffers marked recently used, as these are not replacement candidates.
+ * We don't write currently-pinned buffers, nor buffers marked recently used,
+ * as these are not replacement candidates.
  *
  * Returns a bitmask containing the following flag bits:
  *	BUF_WRITTEN: we wrote the buffer.
@@ -4074,53 +4072,71 @@ BgBufferSync(WritebackContext *wb_context)
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+SyncOneBuffer(int buf_id, WritebackContext *wb_context)
 {
 	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
+	uint32		old_buf_state;
 	uint32		buf_state;
 	BufferTag	tag;
 
-	/* Make sure we can handle the pin */
-	ReservePrivateRefCountEntry();
-	ResourceOwnerEnlarge(CurrentResourceOwner);
-
 	/*
-	 * Check whether buffer needs writing.
-	 *
-	 * We can make this check without taking the buffer content lock so long
-	 * as we mark pages dirty in access methods *before* logging changes with
-	 * XLogInsert(): if someone marks the buffer dirty just after our check we
-	 * don't worry because our checkpoint.redo points before log record for
-	 * upcoming changes and so we are not required to write such dirty buffer.
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header.
 	 */
-	buf_state = LockBufHdr(bufHdr);
-
-	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
-		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
+	old_buf_state = pg_atomic_read_u32(&bufHdr->state);
+	for (;;)
 	{
+		buf_state = old_buf_state;
+
+		/*
+		 * We can make these check without taking the buffer content lock so
+		 * long as we mark pages dirty in access methods *before* logging
+		 * changes with XLogInsert(): if someone marks the buffer dirty just
+		 * after our check we don't worry because our checkpoint.redo points
+		 * before log record for upcoming changes and so we are not required
+		 * to write such dirty buffer.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 0 ||
+			BUF_STATE_GET_USAGECOUNT(buf_state) != 0)
+		{
+			/* Don't write recently-used buffers */
+			return result;
+		}
+
 		result |= BUF_REUSABLE;
-	}
-	else if (skip_recently_used)
-	{
-		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr, buf_state);
-		return result;
-	}
 
-	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
-	{
-		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr, buf_state);
-		return result;
+		if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
+		{
+			/* It's clean, so nothing to do */
+			return result;
+		}
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufHdr);
+			continue;
+		}
+
+		/* Make sure we can handle the pin */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufHdr->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufHdr));
+			break;
+		}
 	}
 
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Share lock and write it out (FlushBuffer will do nothing if the buffer
+	 * is clean by the time we've locked it.)
 	 */
-	PinBuffer_Locked(bufHdr);
-
 	FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 
 	tag = bufHdr->tag;
@@ -4128,8 +4144,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	UnpinBuffer(bufHdr);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * SyncOneBuffer() is only called by bgwriter, so IOContext will always be
+	 * IOCONTEXT_NORMAL.
 	 */
 	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
 
-- 
2.43.0

#15

BharatDB

bharatdbpg@gmail.com

2 months ago

In reply to: Melanie Plageman (#14)

Re: Checkpointer write combining

Hi all,

Considering the CI failures in earlier patch versions around “max batch
size”, upon my observation I found the failures arise either from boundary
conditions when io_combine_limit (GUC) is set larger than the compile-time
MAX_IO_COMBINE_LIMIT or when pin limits return small/zero values due to
which it produce out-of-range batch sizes or assertion failures in CI.

Comparing the approaches suggested in the thread, I think implementing (GUC
+ compile-time cap first, and then pin limits) could be more effective in
avoiding CI failures and also we should consider the following logic
conditions:

Set io_combine_limit == 0 explicitly (fallback to 1 for forward
progress).
2.

Cap early to a conservative compile_cap (MAX_IO_COMBINE_LIMIT - 1) to
avoid array overflow. Otherwise if we confirm all slots are usable,
change to MAX_IO_COMBINE_LIMIT.
3.

Apply per-strategy pin and global pin limits only if they are positive.
4.

Use explicit typed comparisons to avoid signed/unsigned pitfalls and add
a final Assert() to capture assumptions in CI.

*Implementation logic:*

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
uint32 max_write_batch_size;
uint32 compile_cap = MAX_IO_COMBINE_LIMIT - 1; /*
conservative cap */
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_possible_buffer_limit = GetPinLimit();
max_write_batch_size = (io_combine_limit == 0) ? 1 : io_combine_limit;
if (max_write_batch_size > compile_cap)
max_write_batch_size = compile_cap;
if (strategy_pin_limit > 0 &&
(uint32) strategy_pin_limit < max_write_batch_size)
max_write_batch_size = (uint32) strategy_pin_limit;
if (max_possible_buffer_limit > 0 &&
max_possible_buffer_limit < max_write_batch_size)
max_write_batch_size = max_possible_buffer_limit;
if (max_write_batch_size == 0)
max_write_batch_size = 1;
Assert(max_write_batch_size <= compile_cap);
return max_write_batch_size;
}

I hope this will be helpful for proceeding further. Looking forward to
more feedback.

Thanking you.

Regards,

Soumya

On Tue, Nov 4, 2025 at 5:04 AM Melanie Plageman <melanieplageman@gmail.com>
wrote:

Show quoted text

Thanks for continuing to review! I've revised the patches to
incorporate all of your feedback except for where I mention below.

There were failures in CI due to issues with max batch size, so
attached v8 also seeks to fix those.

- Melanie

On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current
sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int
*sweep_cursor)
+{
+       if (++(*sweep_cursor) >= strategy->nbuffers)
+               *sweep_cursor = 0;
+
+       return strategy->buffers[*sweep_cursor];
+}
```
Feels the function comment is a bit confusing, because the function code
doesn’t really perform sweep, the function is just a getter. InvalidBuffer
just implies the current sweep is over.

Maybe rephrase to something like: “Return the next buffer in the range.

If InvalidBuffer is returned, that implies the current sweep is done."

Yes, actually I think having these helpers mention the sweep is more
confusing than anything else. I've revised them to be named more
generically and updated the comments accordingly.
5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+       uint32          max_possible_buffer_limit;
+       uint32          max_write_batch_size;
+       int                     strategy_pin_limit;
+
+       max_write_batch_size = io_combine_limit;
+
+       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+       max_possible_buffer_limit = GetPinLimit();
+
+       max_write_batch_size = Min(strategy_pin_limit,
max_write_batch_size);

+ max_write_batch_size = Min(max_possible_buffer_limit,

max_write_batch_size);
+       max_write_batch_size = Max(1, max_write_batch_size);
+       max_write_batch_size = Min(max_write_batch_size,
io_combine_limit);
+       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+       return max_write_batch_size;
+}
```
This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_write_batch_size = Min(GetPinLimit(),
(uint32)strategy_pin_limit);

/* Clamp to io_combine_limit and enforce minimum of 1 */
if (max_write_batch_size > io_combine_limit)
max_write_batch_size = io_combine_limit;
if (max_write_batch_size == 0)
max_write_batch_size = 1;

Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
return max_write_batch_size;
}
```

I agree that the implementation was hard to understand. I've not quite
gone with your version but I have rewritten it like this:

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
uint32 max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
int strategy_pin_limit =
GetAccessStrategyPinLimit(strategy);
uint32 max_possible_buffer_limit = GetPinLimit();

/* Identify the minimum of the above */
max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
max_write_batch_size = Min(max_possible_buffer_limit,
max_write_batch_size);

/* Must allow at least 1 IO for forward progress */
max_write_batch_size = Max(1, max_write_batch_size);

return max_write_batch_size;
}

Is this better?

- Melanie

#16

Chao Li

li.evan.chao@gmail.com

about 2 months ago

In reply to: Melanie Plageman (#14)

Re: Checkpointer write combining

On Nov 4, 2025, at 07:34, Melanie Plageman <melanieplageman@gmail.com> wrote:

Thanks for continuing to review! I've revised the patches to
incorporate all of your feedback except for where I mention below.

There were failures in CI due to issues with max batch size, so
attached v8 also seeks to fix those.

- Melanie

On Thu, Oct 16, 2025 at 12:25 AM Chao Li <li.evan.chao@gmail.com> wrote:
3 - 0003
```
+/*
+ * Return the next buffer in the ring or InvalidBuffer if the current sweep is
+ * over.
+ */
+Buffer
+StrategySweepNextBuffer(BufferAccessStrategy strategy, int *sweep_cursor)
+{
+       if (++(*sweep_cursor) >= strategy->nbuffers)
+               *sweep_cursor = 0;
+
+       return strategy->buffers[*sweep_cursor];
+}
```
Feels the function comment is a bit confusing, because the function code doesn’t really perform sweep, the function is just a getter. InvalidBuffer just implies the current sweep is over.

Maybe rephrase to something like: “Return the next buffer in the range. If InvalidBuffer is returned, that implies the current sweep is done."
Yes, actually I think having these helpers mention the sweep is more
confusing than anything else. I've revised them to be named more
generically and updated the comments accordingly.
5 - 0004
```
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+       uint32          max_possible_buffer_limit;
+       uint32          max_write_batch_size;
+       int                     strategy_pin_limit;
+
+       max_write_batch_size = io_combine_limit;
+
+       strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+       max_possible_buffer_limit = GetPinLimit();
+
+       max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+       max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+       max_write_batch_size = Max(1, max_write_batch_size);
+       max_write_batch_size = Min(max_write_batch_size, io_combine_limit);
+       Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
+       return max_write_batch_size;
+}
```
This implementation is hard to understand. I tried to simplify it:
```
uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_write_batch_size = Min(GetPinLimit(), (uint32)strategy_pin_limit);

/* Clamp to io_combine_limit and enforce minimum of 1 */
if (max_write_batch_size > io_combine_limit)
max_write_batch_size = io_combine_limit;
if (max_write_batch_size == 0)
max_write_batch_size = 1;

Assert(max_write_batch_size < MAX_IO_COMBINE_LIMIT);
return max_write_batch_size;
}
```
I agree that the implementation was hard to understand. I've not quite
gone with your version but I have rewritten it like this:

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
uint32 max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_possible_buffer_limit = GetPinLimit();

/* Identify the minimum of the above */
max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);

/* Must allow at least 1 IO for forward progress */
max_write_batch_size = Max(1, max_write_batch_size);

return max_write_batch_size;
}

Is this better?

Yes, I think your version is safer because it enforces the max limit at runtime instead of only asserting it in debug builds.

- Melanie
<v8-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v8-0002-Split-FlushBuffer-into-two-parts.patch><v8-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v8-0004-Write-combining-for-BAS_BULKWRITE.patch><v8-0005-Add-database-Oid-to-CkptSortItem.patch><v8-0006-Implement-checkpointer-data-write-combining.patch><v8-0007-WIP-Refactor-SyncOneBuffer-for-bgwriter-only.patch>

I quickly went through 0001-0006, looks good to me now. As 0007 has WIP in the subject, I skipped it.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#17

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: BharatDB (#15)

Re: Checkpointer write combining

On Wed, Nov 12, 2025 at 6:40 AM BharatDB <bharatdbpg@gmail.com> wrote:

Considering the CI failures in earlier patch versions around “max batch size”, upon my observation I found the failures arise either from boundary conditions when io_combine_limit (GUC) is set larger than the compile-time MAX_IO_COMBINE_LIMIT

How could io_combine_limit be higher than MAX_IO_COMBINE_LIMIT? The
GUC is capped to MAX_IO_COMBINE_LIMIT -- see guc_parameters.dat.

or when pin limits return small/zero values due to which it produce out-of-range batch sizes or assertion failures in CI.

This is true. But I think it can be solved with a single comparison to
1 as in the recent version.

Comparing the approaches suggested in the thread, I think implementing (GUC + compile-time cap first, and then pin limits) could be more effective in avoiding CI failures and also we should consider the following logic conditions:

Set io_combine_limit == 0 explicitly (fallback to 1 for forward progress).

The io_combine_limit GUC has a minimum value of 1 (in guc_parameters.dat)

Use explicit typed comparisons to avoid signed/unsigned pitfalls and add a final Assert() to capture assumptions in CI.

I think my new version works.

uint32 max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_possible_buffer_limit = GetPinLimit();

/* Identify the minimum of the above */
max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);

/* Must allow at least 1 IO for forward progress */
max_write_batch_size = Max(1, max_write_batch_size);

return max_write_batch_size;

GetAccessStrategyPinLimit() is the only function returning a signed
value here -- and it should always return a positive value (while I
wish we would use unsigned integers when a value will never be
negative, the strategy->nbuffers struct member was added a long time
ago). Then the Min() macro should work correctly and produce a value
that fits in a uint32 because of integer promotion rules.

- Melanie

#18

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Chao Li (#16)

7 attachment(s)

Re: Checkpointer write combining

On Thu, Nov 13, 2025 at 3:30 AM Chao Li <li.evan.chao@gmail.com> wrote:

On Nov 4, 2025, at 07:34, Melanie Plageman <melanieplageman@gmail.com> wrote:

uint32
StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
{
uint32 max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_possible_buffer_limit = GetPinLimit();

/* Identify the minimum of the above */
max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);

/* Must allow at least 1 IO for forward progress */
max_write_batch_size = Max(1, max_write_batch_size);

return max_write_batch_size;
}

Is this better?

Yes, I think your version is safer because it enforces the max limit at runtime instead of only asserting it in debug builds.

Cool. I've attached a v9 which is rebased over recent bufmgr.c
changes. In the process, I found a bit of cleanup to do.

I quickly went through 0001-0006, looks good to me now. As 0007 has WIP in the subject, I skipped it.

I no longer remember why I made that patch WIP, so I've removed that
designation.

- Melanie

Attachments:

v9-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From 59432c8962cd1d7866493492149963485f6e63e1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:53:48 -0400
Subject: [PATCH v9 1/7] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with
standard for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 189 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  19 ++-
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 110 insertions(+), 103 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 327ddb7adc8..90c24b8d93d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2331,125 +2327,116 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned pinned and owned by
-	 * this backend.
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		/* Attempt to claim a victim buffer. Buffer is returned pinned. */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
+
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
 
-	if (buf_state & BM_VALID)
-	{
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
+
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 28d952b3534..1465984b141 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -780,12 +781,20 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held. We must hold the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -795,11 +804,17 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
+	LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf);
+
+	if (XLogNeedsFlush(lsn))
+		return false;
+
 	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
+	 * Remove the dirty buffer from the ring; necessary to prevent an infinite
 	 * loop if all ring members are dirty.
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
-
 	return true;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5400c56a965..04fdea64f83 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -486,6 +486,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v9-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Split-FlushBuffer-into-two-parts.patchDownload

From 8dff2f6ba609909f2820acde080228b799f3fb02 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:54:19 -0400
Subject: [PATCH v9 2/7] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This separation procides symmetry with
future code for batch flushing which necessarily separates these steps,
as it must prepare multiple buffers before flushing them together.

These steps are moved into a new FlushBuffer() helper function,
CleanVictimBuffer() which will contain both the batch flushing and
single flush code in future commits.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 142 +++++++++++++++++++---------
 1 file changed, 98 insertions(+), 44 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90c24b8d93d..235e261738b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -533,6 +533,12 @@ static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						  IOObject io_object, IOContext io_context,
+						  XLogRecPtr buffer_lsn);
+static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
+							  IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2394,12 +2400,8 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			/* Content lock is released inside CleanVictimBuffer */
+			CleanVictimBuffer(buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4270,54 +4272,64 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
-	uint32		buf_state;
+	XLogRecPtr	lsn;
 
-	/*
-	 * Try to start an I/O operation.  If StartBufferIO returns false, then
-	 * someone else flushed the buffer before we could, so we need not do
-	 * anything.
-	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (PrepareFlushBuffer(buf, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
 
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = buf;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
+{
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
 
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+		return;
 
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber);
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
 
-	buf_state = LockBufHdr(buf);
+/*
+ * Prepare the buffer with bufdesc for writing. Returns true if the buffer
+ * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
+ * the table is logged.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state;
 
 	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
+	 * Try to start an I/O operation.  If StartBufferIO returns false, then
+	 * someone else flushed the buffer before we could, so we need not do
+	 * anything.
 	 */
-	recptr = BufferGetLSN(buf);
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
 
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	UnlockBufHdrExt(buf, buf_state,
-					0, BM_JUST_DIRTIED,
-					0);
+	*lsn = InvalidXLogRecPtr;
+	buf_state = LockBufHdr(bufdesc);
 
 	/*
-	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
-	 * rule that log updates must hit disk before any of the data-file changes
-	 * they describe do.
+	 * Record the buffer's LSN. We will force XLOG flush up to buffer's LSN.
+	 * This implements the basic WAL rule that log updates must hit disk
+	 * before any of the data-file changes they describe do.
 	 *
 	 * However, this rule does not apply to unlogged relations, which will be
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
@@ -4330,9 +4342,51 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * happen, attempting to flush WAL through that location would fail, with
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
+	 *
+	 * We must hold the buffer header lock when examining the page LSN since
+	 * don't have buffer exclusively locked in all cases.
 	 */
 	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	UnlockBufHdrExt(bufdesc, buf_state,
+					0, BM_JUST_DIRTIED,
+					0);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = shared_buffer_write_error_callback;
+	errcallback.arg = buf;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
+										buf->tag.blockNum,
+										reln->smgr_rlocator.locator.spcOid,
+										reln->smgr_rlocator.locator.dbOid,
+										reln->smgr_rlocator.locator.relNumber);
+
+	/* Force XLOG flush up to buffer's LSN */
+	if (XLogRecPtrIsValid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v9-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v9-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From d4d815c9ef5a270d8e9c7dde59e2121d3c1586a7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:15:43 -0400
Subject: [PATCH v9 3/7] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse them. By
eagerly flushing the buffers in a larger run, we encourage larger writes
at the kernel level and less interleaving of WAL flushes and data file
writes. The effect is mainly noticeable with multiple parallel COPY
FROMs. In this case, client backends achieve higher write throughput and
end up spending less time waiting on acquiring the lock to flush WAL.
Larger flush operations also mean less time waiting for flush operations
at the kernel level.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which do not require a WAL flush.

This patch also is a step toward AIO writes, as it lines up multiple
buffers that can be issued asynchronously once the infrastructure
exists.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Earlier version Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 237 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  48 ++++++
 src/include/storage/buf_internals.h   |   4 +
 3 files changed, 281 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 235e261738b..57a3eae865e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -531,14 +531,25 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
+
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
-static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *NextStrategyBufToFlush(BufferAccessStrategy strategy,
+										  Buffer sweep_end,
+										  XLogRecPtr *lsn, int *sweep_cursor);
+
+static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator, bool skip_pinned,
+												   XLogRecPtr *max_lsn);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc,
+							   XLogRecPtr *lsn);
 static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						  IOObject io_object, IOContext io_context,
 						  XLogRecPtr buffer_lsn);
-static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
-							  IOContext io_context);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc,
+							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2401,7 +2412,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 			}
 
 			/* Content lock is released inside CleanVictimBuffer */
-			CleanVictimBuffer(buf_hdr, from_ring, io_context);
+			CleanVictimBuffer(strategy, buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4278,6 +4289,61 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state = LockBufHdr(bufdesc);
+
+	*lsn = BufferGetLSN(bufdesc);
+
+	UnlockBufHdr(bufdesc);
+
+	/*
+	 * See buffer flushing code for more details on why we condition this on
+	 * the relation being logged.
+	 */
+	return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}
+
+
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or NULL when there are no further buffers to consider
+ * writing out.
+ */
+static BufferDesc *
+NextStrategyBufToFlush(BufferAccessStrategy strategy,
+					   Buffer sweep_end,
+					   XLogRecPtr *lsn, int *sweep_cursor)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum =
+			StrategyNextBuffer(strategy, sweep_cursor)) != sweep_end)
+	{
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare and write out a dirty victim buffer.
  *
@@ -4288,21 +4354,176 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
  * bufdesc may be modified.
  */
 static void
-CleanVictimBuffer(BufferDesc *bufdesc,
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc,
 				  bool from_ring, IOContext io_context)
 {
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 
 	/* Set up this victim buffer to be flushed */
 	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && StrategySupportsEagerFlush(strategy))
+	{
+		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
+		int			cursor = StrategyGetCurrentIndex(strategy);
+
+		/* Clean victim buffer and find more to flush opportunistically */
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = NextStrategyBufToFlush(strategy, sweep_end,
+												   &max_lsn, &cursor)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, return the buffer descriptor of the buffer to eagerly flush,
+ * pinned and locked, or NULL if this buffer does not contain a block that
+ * should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+static BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		old_buf_state;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/*
+	 * Ensure that theres a free refcount entry and resource owner slot for
+	 * the pin before pinning the buffer. While this may leake a refcount and
+	 * slot if we return without a buffer, we should use that slot the next
+	 * time we try and reserve a spot.
+	 */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	/*
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header. We have to lock
+	 * the buffer header later if we succeed in pinning the buffer here, but
+	 * avoiding locking the buffer header if the buffer is in use is worth it.
+	 */
+	old_buf_state = pg_atomic_read_u32(&bufdesc->state);
+
+	for (;;)
+	{
+		buf_state = old_buf_state;
+
+		if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+			return NULL;
+
+		/* We don't eagerly flush buffers used by others */
+		if (skip_pinned &&
+			(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+			 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+			return NULL;
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufdesc);
+			continue;
+		}
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufdesc->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufdesc));
+			break;
+		}
+	}
+
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
+	/* Don't eagerly flush buffers requiring WAL flush */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
+	/* Try to start an I/O operation */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+
+	buf_state = LockBufHdr(bufdesc);
+	UnlockBufHdrExt(bufdesc, buf_state, 0, BM_JUST_DIRTIED, 0);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 1465984b141..10301f4aab2 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -156,6 +156,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lead to better I/O
+ * patterns than lazily flushing buffers immediately before reusing them.
+ */
+bool
+StrategySupportsEagerFlush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -307,6 +332,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Given a position in the ring, cursor, increment the position, and return
+ * the buffer at this position.
+ */
+Buffer
+StrategyNextBuffer(BufferAccessStrategy strategy, int *cursor)
+{
+	if (++(*cursor) >= strategy->nbuffers)
+		*cursor = 0;
+
+	return strategy->buffers[*cursor];
+}
+
+/*
+ * Return the current slot in the strategy ring.
+ */
+int
+StrategyGetCurrentIndex(BufferAccessStrategy strategy)
+{
+	return strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 04fdea64f83..c07e309a288 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -506,6 +506,10 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
+								 int *cursor);
+extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v9-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v9-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From 943acb585388f2d7ac6d5f1acc16d2aa7c122a7d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:42:47 -0400
Subject: [PATCH v9 4/7] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.

XXX: Because this sets in-place checksums for batches, it is not
committable until additional infrastructure goes in place.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 216 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  23 +++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 ++++
 src/include/storage/bufpage.h         |   2 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 284 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 57a3eae865e..8a3a5a04d91 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -537,7 +537,11 @@ static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 static BufferDesc *NextStrategyBufToFlush(BufferAccessStrategy strategy,
 										  Buffer sweep_end,
 										  XLogRecPtr *lsn, int *sweep_cursor);
-
+static void FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+							   BufferDesc *batch_start,
+							   uint32 max_batch_size,
+							   BufWriteBatch *batch,
+							   int *sweep_cursor);
 static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
 static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
 												   RelFileLocator *rlocator, bool skip_pinned,
@@ -4310,10 +4314,90 @@ BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
 }
 
 
+
+/*
+ * Given a starting buffer descriptor from a strategy ring that supports eager
+ * flushing, find additional buffers from the ring that can be combined into a
+ * single write batch with the starting buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to issue this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+				   BufferDesc *batch_start,
+				   uint32 max_batch_size,
+				   BufWriteBatch *batch,
+				   int *sweep_cursor)
+{
+	BlockNumber limit;
+
+	Assert(batch_start);
+	batch->bufdescs[0] = batch_start;
+
+	LockBufHdr(batch_start);
+	batch->max_lsn = BufferGetLSN(batch_start);
+	UnlockBufHdr(batch_start);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	Assert(BlockNumberIsValid(batch->start));
+	batch->n = 1;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Min(max_batch_size, limit);
+	limit = Min(GetAdditionalPinLimit(), limit);
+
+	/*
+	 * It's possible we're not allowed any more pins or there aren't more
+	 * blocks in the target relation. In this case, just return. Our batch
+	 * will have only one buffer.
+	 */
+	if (limit <= 0)
+		return;
+
+	/* Now assemble a run of blocks to write out. */
+	for (; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum =
+			 StrategyNextBuffer(strategy, sweep_cursor)) == sweep_end)
+			break;
+
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or NULL when there are no further buffers to consider
- * writing out.
+ * writing out. This will be the start of a new batch of buffers to write out.
  */
 static BufferDesc *
 NextStrategyBufToFlush(BufferAccessStrategy strategy,
@@ -4360,7 +4444,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 {
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 
 	/* Set up this victim buffer to be flushed */
 	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
@@ -4370,19 +4453,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 	{
 		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
 		int			cursor = StrategyGetCurrentIndex(strategy);
+		uint32		max_batch_size = StrategyMaxWriteBatchSize(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
 
 		/* Clean victim buffer and find more to flush opportunistically */
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, sweep_end, bufdesc, max_batch_size,
+							   &batch, &cursor);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 		} while ((bufdesc = NextStrategyBufToFlush(strategy, sweep_end,
 												   &max_lsn, &cursor)) != NULL);
 	}
@@ -4526,6 +4612,70 @@ except_unpin_buffer:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (XLogRecPtrIsValid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		XLogRecPtr	lsn;
+
+		Assert(!BufferNeedsWALFlush(batch->bufdescs[i], &lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with bufdesc for writing. Returns true if the buffer
  * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
@@ -4687,6 +4837,48 @@ FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 	LWLockRelease(BufferDescriptorGetContentLock(buf));
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 10301f4aab2..6eeefccfeca 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -776,6 +776,29 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+	uint32		max_write_batch_size = Min(io_combine_limit, MAX_IO_COMBINE_LIMIT);
+	int			strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	uint32		max_possible_buffer_limit = GetPinLimit();
+
+	/* Identify the minimum of the above */
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+
+	/* Must allow at least 1 IO for forward progress */
+	max_write_batch_size = Max(1, max_write_batch_size);
+
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index aac6e695954..7c2ec99f939 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c07e309a288..ab502c4f825 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -483,6 +483,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -496,6 +524,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 extern void TrackNewBufferPin(Buffer buf);
 
@@ -507,9 +536,12 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern uint32 StrategyMaxWriteBatchSize(BufferAccessStrategy strategy);
 extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
 								 int *cursor);
 extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+								 WritebackContext *wb_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index abc2cf2a020..29a400a71eb 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -506,5 +506,7 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									const void *newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos,
+										uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bce72ae64..20e31307c6f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -350,6 +350,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v9-0005-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v9-0005-Add-database-Oid-to-CkptSortItem.patchDownload

From f53a2be31f3853db95a56b6f7e9448ca328ff4a3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v9 5/7] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8a3a5a04d91..7ecf201d1ef 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3405,6 +3405,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->dbId = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6810,6 +6811,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->dbId < b->dbId)
+		return -1;
+	else if (a->dbId > b->dbId)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ab502c4f825..feb370175f0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -449,6 +449,7 @@ extern uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			dbId;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

v9-0006-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v9-0006-Implement-checkpointer-data-write-combining.patchDownload

From a410d8ee569b27878c22a1061ef789749c8f8c38 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 15:23:16 -0400
Subject: [PATCH v9 6/7] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: BharatDB <bharatdbpg@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 224 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 198 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ecf201d1ef..bfc07edb40e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -513,6 +513,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
@@ -3346,7 +3347,6 @@ TrackNewBufferPin(Buffer buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3358,6 +3358,8 @@ BufferSync(int flags)
 	int			i;
 	uint32		mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3389,6 +3391,7 @@ BufferSync(int flags)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 		uint32		set_bits = 0;
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3531,48 +3534,199 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = CheckpointerMaxBatchSize();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
 
-		bufHdr = GetBufferDescriptor(buf_id);
+		while (batch.n < limit)
+		{
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
 
-		num_processed++;
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
 
-		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
-		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
-		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.dbId;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Min(max_batch_size, limit);
+				limit = Min(GetAdditionalPinLimit(), limit);
+				/* Guarantee progress */
+				limit = Max(limit, 1);
 			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * because we want to start the next IO with this item.
+			 */
+			if (item.dbId != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			Assert(item.tsId == batch.rlocator.spcOid);
+
+			/*
+			 * If the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a few bits. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false.
+			 *
+			 * If the buffer doesn't need checkpointing, don't include it in
+			 * the batch we are building. And if the buffer doesn't need
+			 * flushing, we're done with the item, so count it as processed
+			 * and break out of the loop to issue the IO so far.
+			 */
+			buf_state = pg_atomic_read_u32(&bufHdr->state);
+			if ((buf_state & (BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY)) !=
+				(BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+			PinBuffer(bufHdr, NULL, false);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * Lock buffer header lock before examining LSN because we only
+			 * have a shared lock on the buffer.
+			 */
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			UnlockBufHdrExt(bufHdr, buf_state, 0, BM_JUST_DIRTIED, 0);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
+		 * - otherwise writing becomes unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, IOCONTEXT_NORMAL, &wb_context);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -6408,6 +6562,22 @@ IsBufferCleanupOK(Buffer buffer)
 	return false;
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+CheckpointerMaxBatchSize(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Min(pin_limit, io_combine_limit);
+	result = Min(result, MAX_IO_COMBINE_LIMIT);
+	result = Max(result, 1);
+	return result;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

v9-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patchtext/x-patch; charset=US-ASCII; name=v9-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patchDownload

From c876c4c0bd2107121aae003ffaf1ed01e8be4097 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 16:16:58 -0400
Subject: [PATCH v9 7/7] Refactor SyncOneBuffer for bgwriter use only

Since xxx, only bgwriter uses SyncOneBuffer, so we can remove the
skip_recently_used parameter and make that behavior the default.

While we are at it, 5e89985928795f243 introduced the pattern of using a
CAS loop instead of locking the buffer header and then calling
PinBuffer_Locked(). Do that in SyncOneBuffer() so we can avoid taking
the buffer header spinlock in the common case that the buffer is
recently used.
---
 src/backend/storage/buffer/bufmgr.c | 96 +++++++++++++++++------------
 1 file changed, 56 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bfc07edb40e..fc4735038bd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,8 +515,7 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
+static int	SyncOneBuffer(int buf_id, WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -4000,8 +3999,7 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state = SyncOneBuffer(next_to_clean, wb_context);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -4064,8 +4062,8 @@ BgBufferSync(WritebackContext *wb_context)
 /*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
- * If skip_recently_used is true, we don't write currently-pinned buffers, nor
- * buffers marked recently used, as these are not replacement candidates.
+ * We don't write currently-pinned buffers, nor buffers marked recently used,
+ * as these are not replacement candidates.
  *
  * Returns a bitmask containing the following flag bits:
  *	BUF_WRITTEN: we wrote the buffer.
@@ -4076,53 +4074,71 @@ BgBufferSync(WritebackContext *wb_context)
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+SyncOneBuffer(int buf_id, WritebackContext *wb_context)
 {
 	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
+	uint32		old_buf_state;
 	uint32		buf_state;
 	BufferTag	tag;
 
-	/* Make sure we can handle the pin */
-	ReservePrivateRefCountEntry();
-	ResourceOwnerEnlarge(CurrentResourceOwner);
-
 	/*
-	 * Check whether buffer needs writing.
-	 *
-	 * We can make this check without taking the buffer content lock so long
-	 * as we mark pages dirty in access methods *before* logging changes with
-	 * XLogInsert(): if someone marks the buffer dirty just after our check we
-	 * don't worry because our checkpoint.redo points before log record for
-	 * upcoming changes and so we are not required to write such dirty buffer.
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header.
 	 */
-	buf_state = LockBufHdr(bufHdr);
-
-	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
-		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
+	old_buf_state = pg_atomic_read_u32(&bufHdr->state);
+	for (;;)
 	{
+		buf_state = old_buf_state;
+
+		/*
+		 * We can make these check without taking the buffer content lock so
+		 * long as we mark pages dirty in access methods *before* logging
+		 * changes with XLogInsert(): if someone marks the buffer dirty just
+		 * after our check we don't worry because our checkpoint.redo points
+		 * before log record for upcoming changes and so we are not required
+		 * to write such dirty buffer.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 0 ||
+			BUF_STATE_GET_USAGECOUNT(buf_state) != 0)
+		{
+			/* Don't write recently-used buffers */
+			return result;
+		}
+
 		result |= BUF_REUSABLE;
-	}
-	else if (skip_recently_used)
-	{
-		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr);
-		return result;
-	}
 
-	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
-	{
-		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr);
-		return result;
+		if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
+		{
+			/* It's clean, so nothing to do */
+			return result;
+		}
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufHdr);
+			continue;
+		}
+
+		/* Make sure we can handle the pin */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufHdr->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufHdr));
+			break;
+		}
 	}
 
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Share lock and write it out (FlushBuffer will do nothing if the buffer
+	 * is clean by the time we've locked it.)
 	 */
-	PinBuffer_Locked(bufHdr);
-
 	FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 
 	tag = bufHdr->tag;
@@ -4130,8 +4146,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	UnpinBuffer(bufHdr);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * SyncOneBuffer() is only called by bgwriter, so IOContext will always be
+	 * IOCONTEXT_NORMAL.
 	 */
 	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
 
-- 
2.43.0

#19

Chao Li

li.evan.chao@gmail.com

about 2 months ago

In reply to: Melanie Plageman (#18)

Re: Checkpointer write combining

On Nov 19, 2025, at 02:49, Melanie Plageman <melanieplageman@gmail.com> wrote:

I no longer remember why I made that patch WIP, so I've removed that
designation.

I just reviewed 0007. It removes the second parameter "bool skip_recently_used” from SyncOneBuffer. The function is static and is only called in one place with skip_recently_used=true, thus removing the parameter seems reasonable, and without considering pinned buffer, the function is simplified a little bit.

I only got a tiny comment:
```
+ * We can make these check without taking the buffer content lock so
```

As you changed “this” to “these”, “check” should be changed to “checks” accordingly.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#20

Chao Li

li.evan.chao@gmail.com

about 2 months ago

In reply to: Chao Li (#19)

Re: Checkpointer write combining

On Nov 19, 2025, at 10:00, Chao Li <li.evan.chao@gmail.com> wrote:

On Nov 19, 2025, at 02:49, Melanie Plageman <melanieplageman@gmail.com> wrote:

I no longer remember why I made that patch WIP, so I've removed that
designation.

I just reviewed 0007. It removes the second parameter "bool skip_recently_used” from SyncOneBuffer. The function is static and is only called in one place with skip_recently_used=true, thus removing the parameter seems reasonable, and without considering pinned buffer, the function is simplified a little bit.

I only got a tiny comment:
```
+ * We can make these check without taking the buffer content lock so
```

As you changed “this” to “these”, “check” should be changed to “checks” accordingly.

I just got an compile error:
```
bufmgr.c:3580:33: error: no member named 'dbId' in 'struct CkptSortItem'
3580 | batch.rlocator.dbOid = item.dbId;
| ~~~~ ^
bufmgr.c:3598:13: error: no member named 'dbId' in 'struct CkptSortItem'
3598 | if (item.dbId != batch.rlocator.dbOid)
| ~~~~ ^
2 errors generated.
make[4]: *** [bufmgr.o] Error 1
```

I tried “make clean” and “make” again, which didn’t work. I think the error is introduced by 0006.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#21

BharatDB

bharatdbpg@gmail.com

about 2 months ago

In reply to: Melanie Plageman (#17)

Re: Checkpointer write combining

Hi Melanie,

Thank you for the detailed clarifications which helped me understand
the constraints much more clearly. You are absolutely right regarding
the key points you specified. My initial concern came from trying to
avoid cases where strategy pin limits were unexpectedly small (0 or
negative) or global pin limits temporarily shrink due to memory issue
/ fast cycling of pins.

On Wed, Nov 19, 2025 at 12:00 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Wed, Nov 12, 2025 at 6:40 AM BharatDB <bharatdbpg@gmail.com> wrote:

Considering the CI failures in earlier patch versions around “max batch size”, upon my observation I found the failures arise either from boundary conditions when io_combine_limit (GUC) is set larger than the compile-time MAX_IO_COMBINE_LIMIT

How could io_combine_limit be higher than MAX_IO_COMBINE_LIMIT? The
GUC is capped to MAX_IO_COMBINE_LIMIT -- see guc_parameters.dat.

After revisiting guc_parameters.dat, Now I see that the GUC is
strictly capped at MAX_IO_COMBINE_LIMIT, so comparisons against larger
values are unnecessary. Thus my earlier concern came from assuming
some unbounded user-provided values are not applicable here.

or when pin limits return small/zero values due to which it produce out-of-range batch sizes or assertion failures in CI.

This is true. But I think it can be solved with a single comparison to
1 as in the recent version.

Comparing the approaches suggested in the thread, I think implementing (GUC + compile-time cap first, and then pin limits) could be more effective in avoiding CI failures and also we should consider the following logic conditions:

Set io_combine_limit == 0 explicitly (fallback to 1 for forward progress).

The io_combine_limit GUC has a minimum value of 1 (in guc_parameters.dat)

Noted.

Use explicit typed comparisons to avoid signed/unsigned pitfalls and add a final Assert() to capture assumptions in CI.

I think my new version works.

uint32 max_write_batch_size = Min(io_combine_limit,
MAX_IO_COMBINE_LIMIT);
int strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
uint32 max_possible_buffer_limit = GetPinLimit();

/* Identify the minimum of the above */
max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);

/* Must allow at least 1 IO for forward progress */
max_write_batch_size = Max(1, max_write_batch_size);

return max_write_batch_size;

GetAccessStrategyPinLimit() is the only function returning a signed
value here -- and it should always return a positive value (while I
wish we would use unsigned integers when a value will never be
negative, the strategy->nbuffers struct member was added a long time
ago). Then the Min() macro should work correctly and produce a value
that fits in a uint32 because of integer promotion rules.

Th explanation about GetAccessStrategyPinLimit() (despite being int)
makes sense and I agree that with the Min() macro and integer
promotion rules, the outcome is always safe. Therefore, explicit typed
comparisons are indeed redundant. However, after reviewing the
existing code paths and your updated version,
max_write_batch_size = Max(1, max_write_batch_size);
=> It is clear that both GetAccessStrategyPinLimit() and GetPinLimit()
should always return sensible positive values and the fallback fully
covers the forward-progress requirement. And I agree that it is both
correct and sufficient for the CI failures we were seeing earlier.

Thank you for helping me understand the reasoning behind this design.
And this will be kept in mind for further work on implementing write
combining.
I appreciate your patience and the opportunity to refine my
assumptions. Looking forward to more suggestions and feedbacks.

Thanking you.

Regards,
Soumya

#22

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Chao Li (#20)

7 attachment(s)

Re: Checkpointer write combining

Thanks for the review!

On Tue, Nov 18, 2025 at 9:10 PM Chao Li <li.evan.chao@gmail.com> wrote:

On Nov 19, 2025, at 10:00, Chao Li <li.evan.chao@gmail.com> wrote:

I just reviewed 0007. It removes the second parameter "bool skip_recently_used” from SyncOneBuffer. The function is static and is only called in one place with skip_recently_used=true, thus removing the parameter seems reasonable, and without considering pinned buffer, the function is simplified a little bit.

I only got a tiny comment:
```
+ * We can make these check without taking the buffer content lock so
```

As you changed “this” to “these”, “check” should be changed to “checks” accordingly.

I've made this change. Attached v10 has this.

I just got an compile error:
```
bufmgr.c:3580:33: error: no member named 'dbId' in 'struct CkptSortItem'
3580 | batch.rlocator.dbOid = item.dbId;
| ~~~~ ^
bufmgr.c:3598:13: error: no member named 'dbId' in 'struct CkptSortItem'
3598 | if (item.dbId != batch.rlocator.dbOid)
| ~~~~ ^
2 errors generated.
make[4]: *** [bufmgr.o] Error 1
```

I tried “make clean” and “make” again, which didn’t work. I think the error is introduced by 0006.

Are you sure you applied 0005? It is the one that adds dbId to CkptSortItem.

- Melanie

Attachments:

v10-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From 98b5bf6872b5c8b1e2d4f42c7d8a7e01a1c7858f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:53:48 -0400
Subject: [PATCH v10 1/7] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with
standard for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 189 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  19 ++-
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 110 insertions(+), 103 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 327ddb7adc8..90c24b8d93d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2331,125 +2327,116 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned pinned and owned by
-	 * this backend.
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		/* Attempt to claim a victim buffer. Buffer is returned pinned. */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
+
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
 
-	if (buf_state & BM_VALID)
-	{
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
+
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 28d952b3534..acbabeb3c3b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -780,12 +781,20 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held. We must hold the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -795,11 +804,17 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
+	LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf);
+
+	if (!XLogNeedsFlush(lsn))
+		return false;
+
 	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
+	 * Remove the dirty buffer from the ring; necessary to prevent an infinite
 	 * loop if all ring members are dirty.
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
-
 	return true;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5400c56a965..04fdea64f83 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -486,6 +486,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v10-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v10-0002-Split-FlushBuffer-into-two-parts.patchDownload

From 7bf89e41ee440bf7515d49058abdc3c6bc639828 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:54:19 -0400
Subject: [PATCH v10 2/7] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This separation procides symmetry with
future code for batch flushing which necessarily separates these steps,
as it must prepare multiple buffers before flushing them together.

These steps are moved into a new FlushBuffer() helper function,
CleanVictimBuffer() which will contain both the batch flushing and
single flush code in future commits.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 142 +++++++++++++++++++---------
 1 file changed, 98 insertions(+), 44 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90c24b8d93d..235e261738b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -533,6 +533,12 @@ static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						  IOObject io_object, IOContext io_context,
+						  XLogRecPtr buffer_lsn);
+static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
+							  IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2394,12 +2400,8 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			/* Content lock is released inside CleanVictimBuffer */
+			CleanVictimBuffer(buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4270,54 +4272,64 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
-	uint32		buf_state;
+	XLogRecPtr	lsn;
 
-	/*
-	 * Try to start an I/O operation.  If StartBufferIO returns false, then
-	 * someone else flushed the buffer before we could, so we need not do
-	 * anything.
-	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (PrepareFlushBuffer(buf, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
 
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = buf;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
+{
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;
 
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+		return;
 
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber);
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	LWLockRelease(content_lock);
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
 
-	buf_state = LockBufHdr(buf);
+/*
+ * Prepare the buffer with bufdesc for writing. Returns true if the buffer
+ * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
+ * the table is logged.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state;
 
 	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
+	 * Try to start an I/O operation.  If StartBufferIO returns false, then
+	 * someone else flushed the buffer before we could, so we need not do
+	 * anything.
 	 */
-	recptr = BufferGetLSN(buf);
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
 
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	UnlockBufHdrExt(buf, buf_state,
-					0, BM_JUST_DIRTIED,
-					0);
+	*lsn = InvalidXLogRecPtr;
+	buf_state = LockBufHdr(bufdesc);
 
 	/*
-	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
-	 * rule that log updates must hit disk before any of the data-file changes
-	 * they describe do.
+	 * Record the buffer's LSN. We will force XLOG flush up to buffer's LSN.
+	 * This implements the basic WAL rule that log updates must hit disk
+	 * before any of the data-file changes they describe do.
 	 *
 	 * However, this rule does not apply to unlogged relations, which will be
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
@@ -4330,9 +4342,51 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * happen, attempting to flush WAL through that location would fail, with
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
+	 *
+	 * We must hold the buffer header lock when examining the page LSN since
+	 * don't have buffer exclusively locked in all cases.
 	 */
 	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	UnlockBufHdrExt(bufdesc, buf_state,
+					0, BM_JUST_DIRTIED,
+					0);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = shared_buffer_write_error_callback;
+	errcallback.arg = buf;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
+										buf->tag.blockNum,
+										reln->smgr_rlocator.locator.spcOid,
+										reln->smgr_rlocator.locator.dbOid,
+										reln->smgr_rlocator.locator.relNumber);
+
+	/* Force XLOG flush up to buffer's LSN */
+	if (XLogRecPtrIsValid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v10-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v10-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From cc26976f233bc9bb02915c8dbd6deb5dc28244ea Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:15:43 -0400
Subject: [PATCH v10 3/7] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse them. By
eagerly flushing the buffers in a larger run, we encourage larger writes
at the kernel level and less interleaving of WAL flushes and data file
writes. The effect is mainly noticeable with multiple parallel COPY
FROMs. In this case, client backends achieve higher write throughput and
end up spending less time waiting on acquiring the lock to flush WAL.
Larger flush operations also mean less time waiting for flush operations
at the kernel level.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which do not require a WAL flush.

This patch also is a step toward AIO writes, as it lines up multiple
buffers that can be issued asynchronously once the infrastructure
exists.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Earlier version Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 237 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  48 ++++++
 src/include/storage/buf_internals.h   |   4 +
 3 files changed, 281 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 235e261738b..57a3eae865e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -531,14 +531,25 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
+
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
-static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *NextStrategyBufToFlush(BufferAccessStrategy strategy,
+										  Buffer sweep_end,
+										  XLogRecPtr *lsn, int *sweep_cursor);
+
+static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator, bool skip_pinned,
+												   XLogRecPtr *max_lsn);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc,
+							   XLogRecPtr *lsn);
 static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						  IOObject io_object, IOContext io_context,
 						  XLogRecPtr buffer_lsn);
-static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
-							  IOContext io_context);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc,
+							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2401,7 +2412,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 			}
 
 			/* Content lock is released inside CleanVictimBuffer */
-			CleanVictimBuffer(buf_hdr, from_ring, io_context);
+			CleanVictimBuffer(strategy, buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4278,6 +4289,61 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state = LockBufHdr(bufdesc);
+
+	*lsn = BufferGetLSN(bufdesc);
+
+	UnlockBufHdr(bufdesc);
+
+	/*
+	 * See buffer flushing code for more details on why we condition this on
+	 * the relation being logged.
+	 */
+	return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}
+
+
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or NULL when there are no further buffers to consider
+ * writing out.
+ */
+static BufferDesc *
+NextStrategyBufToFlush(BufferAccessStrategy strategy,
+					   Buffer sweep_end,
+					   XLogRecPtr *lsn, int *sweep_cursor)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum =
+			StrategyNextBuffer(strategy, sweep_cursor)) != sweep_end)
+	{
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare and write out a dirty victim buffer.
  *
@@ -4288,21 +4354,176 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
  * bufdesc may be modified.
  */
 static void
-CleanVictimBuffer(BufferDesc *bufdesc,
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc,
 				  bool from_ring, IOContext io_context)
 {
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
+	bool		first_buffer = true;
 
 	/* Set up this victim buffer to be flushed */
 	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
 		return;
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	if (from_ring && StrategySupportsEagerFlush(strategy))
+	{
+		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
+		int			cursor = StrategyGetCurrentIndex(strategy);
+
+		/* Clean victim buffer and find more to flush opportunistically */
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			content_lock = BufferDescriptorGetContentLock(bufdesc);
+			LWLockRelease(content_lock);
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = NextStrategyBufToFlush(strategy, sweep_end,
+												   &max_lsn, &cursor)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		content_lock = BufferDescriptorGetContentLock(bufdesc);
+		LWLockRelease(content_lock);
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, return the buffer descriptor of the buffer to eagerly flush,
+ * pinned and locked, or NULL if this buffer does not contain a block that
+ * should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+static BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		old_buf_state;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/*
+	 * Ensure that theres a free refcount entry and resource owner slot for
+	 * the pin before pinning the buffer. While this may leake a refcount and
+	 * slot if we return without a buffer, we should use that slot the next
+	 * time we try and reserve a spot.
+	 */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	/*
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header. We have to lock
+	 * the buffer header later if we succeed in pinning the buffer here, but
+	 * avoiding locking the buffer header if the buffer is in use is worth it.
+	 */
+	old_buf_state = pg_atomic_read_u32(&bufdesc->state);
+
+	for (;;)
+	{
+		buf_state = old_buf_state;
+
+		if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+			return NULL;
+
+		/* We don't eagerly flush buffers used by others */
+		if (skip_pinned &&
+			(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+			 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+			return NULL;
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufdesc);
+			continue;
+		}
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufdesc->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufdesc));
+			break;
+		}
+	}
+
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
+	/* Don't eagerly flush buffers requiring WAL flush */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
+	/* Try to start an I/O operation */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+
+	buf_state = LockBufHdr(bufdesc);
+	UnlockBufHdrExt(bufdesc, buf_state, 0, BM_JUST_DIRTIED, 0);
+
+	return bufdesc;
+
+except_unlock_content:
 	LWLockRelease(content_lock);
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index acbabeb3c3b..4a3009d190c 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -156,6 +156,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lead to better I/O
+ * patterns than lazily flushing buffers immediately before reusing them.
+ */
+bool
+StrategySupportsEagerFlush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -307,6 +332,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Given a position in the ring, cursor, increment the position, and return
+ * the buffer at this position.
+ */
+Buffer
+StrategyNextBuffer(BufferAccessStrategy strategy, int *cursor)
+{
+	if (++(*cursor) >= strategy->nbuffers)
+		*cursor = 0;
+
+	return strategy->buffers[*cursor];
+}
+
+/*
+ * Return the current slot in the strategy ring.
+ */
+int
+StrategyGetCurrentIndex(BufferAccessStrategy strategy)
+{
+	return strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 04fdea64f83..c07e309a288 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -506,6 +506,10 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
+								 int *cursor);
+extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v10-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v10-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From eb89dfc1488a0410da06cbc5d387c41623b7bc1b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:42:47 -0400
Subject: [PATCH v10 4/7] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.

XXX: Because this sets in-place checksums for batches, it is not
committable until additional infrastructure goes in place.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 216 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  23 +++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 ++++
 src/include/storage/bufpage.h         |   2 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 284 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 57a3eae865e..8a3a5a04d91 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -537,7 +537,11 @@ static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 static BufferDesc *NextStrategyBufToFlush(BufferAccessStrategy strategy,
 										  Buffer sweep_end,
 										  XLogRecPtr *lsn, int *sweep_cursor);
-
+static void FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+							   BufferDesc *batch_start,
+							   uint32 max_batch_size,
+							   BufWriteBatch *batch,
+							   int *sweep_cursor);
 static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
 static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
 												   RelFileLocator *rlocator, bool skip_pinned,
@@ -4310,10 +4314,90 @@ BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
 }
 
 
+
+/*
+ * Given a starting buffer descriptor from a strategy ring that supports eager
+ * flushing, find additional buffers from the ring that can be combined into a
+ * single write batch with the starting buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to issue this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+				   BufferDesc *batch_start,
+				   uint32 max_batch_size,
+				   BufWriteBatch *batch,
+				   int *sweep_cursor)
+{
+	BlockNumber limit;
+
+	Assert(batch_start);
+	batch->bufdescs[0] = batch_start;
+
+	LockBufHdr(batch_start);
+	batch->max_lsn = BufferGetLSN(batch_start);
+	UnlockBufHdr(batch_start);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	Assert(BlockNumberIsValid(batch->start));
+	batch->n = 1;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Min(max_batch_size, limit);
+	limit = Min(GetAdditionalPinLimit(), limit);
+
+	/*
+	 * It's possible we're not allowed any more pins or there aren't more
+	 * blocks in the target relation. In this case, just return. Our batch
+	 * will have only one buffer.
+	 */
+	if (limit <= 0)
+		return;
+
+	/* Now assemble a run of blocks to write out. */
+	for (; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum =
+			 StrategyNextBuffer(strategy, sweep_cursor)) == sweep_end)
+			break;
+
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or NULL when there are no further buffers to consider
- * writing out.
+ * writing out. This will be the start of a new batch of buffers to write out.
  */
 static BufferDesc *
 NextStrategyBufToFlush(BufferAccessStrategy strategy,
@@ -4360,7 +4444,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 {
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 	LWLock	   *content_lock;
-	bool		first_buffer = true;
 
 	/* Set up this victim buffer to be flushed */
 	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
@@ -4370,19 +4453,22 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 	{
 		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
 		int			cursor = StrategyGetCurrentIndex(strategy);
+		uint32		max_batch_size = StrategyMaxWriteBatchSize(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
 
 		/* Clean victim buffer and find more to flush opportunistically */
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			content_lock = BufferDescriptorGetContentLock(bufdesc);
-			LWLockRelease(content_lock);
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, sweep_end, bufdesc, max_batch_size,
+							   &batch, &cursor);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 		} while ((bufdesc = NextStrategyBufToFlush(strategy, sweep_end,
 												   &max_lsn, &cursor)) != NULL);
 	}
@@ -4526,6 +4612,70 @@ except_unpin_buffer:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (XLogRecPtrIsValid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		XLogRecPtr	lsn;
+
+		Assert(!BufferNeedsWALFlush(batch->bufdescs[i], &lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with bufdesc for writing. Returns true if the buffer
  * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
@@ -4687,6 +4837,48 @@ FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 	LWLockRelease(BufferDescriptorGetContentLock(buf));
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4a3009d190c..189274fc0c0 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -776,6 +776,29 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+	uint32		max_write_batch_size = Min(io_combine_limit, MAX_IO_COMBINE_LIMIT);
+	int			strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	uint32		max_possible_buffer_limit = GetPinLimit();
+
+	/* Identify the minimum of the above */
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+
+	/* Must allow at least 1 IO for forward progress */
+	max_write_batch_size = Max(1, max_write_batch_size);
+
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index aac6e695954..7c2ec99f939 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c07e309a288..ab502c4f825 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -483,6 +483,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -496,6 +524,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 extern void TrackNewBufferPin(Buffer buf);
 
@@ -507,9 +536,12 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern uint32 StrategyMaxWriteBatchSize(BufferAccessStrategy strategy);
 extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
 								 int *cursor);
 extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+								 WritebackContext *wb_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index abc2cf2a020..29a400a71eb 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -506,5 +506,7 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									const void *newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos,
+										uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 57f2a9ccdc5..7ad60afe702 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -350,6 +350,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v10-0005-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v10-0005-Add-database-Oid-to-CkptSortItem.patchDownload

From 962dbb9e2967a41c20f9bfda0edad551a061094c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v10 5/7] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8a3a5a04d91..7ecf201d1ef 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3405,6 +3405,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->dbId = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6810,6 +6811,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->dbId < b->dbId)
+		return -1;
+	else if (a->dbId > b->dbId)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ab502c4f825..feb370175f0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -449,6 +449,7 @@ extern uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			dbId;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

v10-0006-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v10-0006-Implement-checkpointer-data-write-combining.patchDownload

From a06ece1b615f3c5ee374818a31f08a6b6ff897c2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 15:23:16 -0400
Subject: [PATCH v10 6/7] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Soumya <bharatdbpg@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 224 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 198 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7ecf201d1ef..bfc07edb40e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -513,6 +513,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
@@ -3346,7 +3347,6 @@ TrackNewBufferPin(Buffer buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3358,6 +3358,8 @@ BufferSync(int flags)
 	int			i;
 	uint32		mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3389,6 +3391,7 @@ BufferSync(int flags)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 		uint32		set_bits = 0;
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3531,48 +3534,199 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = CheckpointerMaxBatchSize();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
 
-		bufHdr = GetBufferDescriptor(buf_id);
+		while (batch.n < limit)
+		{
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
 
-		num_processed++;
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
 
-		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
-		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
-		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.dbId;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Min(max_batch_size, limit);
+				limit = Min(GetAdditionalPinLimit(), limit);
+				/* Guarantee progress */
+				limit = Max(limit, 1);
 			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * because we want to start the next IO with this item.
+			 */
+			if (item.dbId != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			Assert(item.tsId == batch.rlocator.spcOid);
+
+			/*
+			 * If the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a few bits. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false.
+			 *
+			 * If the buffer doesn't need checkpointing, don't include it in
+			 * the batch we are building. And if the buffer doesn't need
+			 * flushing, we're done with the item, so count it as processed
+			 * and break out of the loop to issue the IO so far.
+			 */
+			buf_state = pg_atomic_read_u32(&bufHdr->state);
+			if ((buf_state & (BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY)) !=
+				(BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+			PinBuffer(bufHdr, NULL, false);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * Lock buffer header lock before examining LSN because we only
+			 * have a shared lock on the buffer.
+			 */
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			UnlockBufHdrExt(bufHdr, buf_state, 0, BM_JUST_DIRTIED, 0);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
+		 * - otherwise writing becomes unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, IOCONTEXT_NORMAL, &wb_context);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -6408,6 +6562,22 @@ IsBufferCleanupOK(Buffer buffer)
 	return false;
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+CheckpointerMaxBatchSize(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Min(pin_limit, io_combine_limit);
+	result = Min(result, MAX_IO_COMBINE_LIMIT);
+	result = Max(result, 1);
+	return result;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

v10-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patchtext/x-patch; charset=US-ASCII; name=v10-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patchDownload

From 524d031716b61e44d29fed8a1d72b64125eebc1c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 16:16:58 -0400
Subject: [PATCH v10 7/7] Refactor SyncOneBuffer for bgwriter use only

Since xxx, only bgwriter uses SyncOneBuffer, so we can remove the
skip_recently_used parameter and make that behavior the default.

While we are at it, 5e89985928795f243 introduced the pattern of using a
CAS loop instead of locking the buffer header and then calling
PinBuffer_Locked(). Do that in SyncOneBuffer() so we can avoid taking
the buffer header spinlock in the common case that the buffer is
recently used.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 96 +++++++++++++++++------------
 1 file changed, 56 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bfc07edb40e..378ae71a5ee 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,8 +515,7 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
+static int	SyncOneBuffer(int buf_id, WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -4000,8 +3999,7 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state = SyncOneBuffer(next_to_clean, wb_context);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -4064,8 +4062,8 @@ BgBufferSync(WritebackContext *wb_context)
 /*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
- * If skip_recently_used is true, we don't write currently-pinned buffers, nor
- * buffers marked recently used, as these are not replacement candidates.
+ * We don't write currently-pinned buffers, nor buffers marked recently used,
+ * as these are not replacement candidates.
  *
  * Returns a bitmask containing the following flag bits:
  *	BUF_WRITTEN: we wrote the buffer.
@@ -4076,53 +4074,71 @@ BgBufferSync(WritebackContext *wb_context)
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+SyncOneBuffer(int buf_id, WritebackContext *wb_context)
 {
 	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
+	uint32		old_buf_state;
 	uint32		buf_state;
 	BufferTag	tag;
 
-	/* Make sure we can handle the pin */
-	ReservePrivateRefCountEntry();
-	ResourceOwnerEnlarge(CurrentResourceOwner);
-
 	/*
-	 * Check whether buffer needs writing.
-	 *
-	 * We can make this check without taking the buffer content lock so long
-	 * as we mark pages dirty in access methods *before* logging changes with
-	 * XLogInsert(): if someone marks the buffer dirty just after our check we
-	 * don't worry because our checkpoint.redo points before log record for
-	 * upcoming changes and so we are not required to write such dirty buffer.
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header.
 	 */
-	buf_state = LockBufHdr(bufHdr);
-
-	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
-		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
+	old_buf_state = pg_atomic_read_u32(&bufHdr->state);
+	for (;;)
 	{
+		buf_state = old_buf_state;
+
+		/*
+		 * We can make these checks without taking the buffer content lock so
+		 * long as we mark pages dirty in access methods *before* logging
+		 * changes with XLogInsert(): if someone marks the buffer dirty just
+		 * after our check we don't worry because our checkpoint.redo points
+		 * before log record for upcoming changes and so we are not required
+		 * to write such dirty buffer.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 0 ||
+			BUF_STATE_GET_USAGECOUNT(buf_state) != 0)
+		{
+			/* Don't write recently-used buffers */
+			return result;
+		}
+
 		result |= BUF_REUSABLE;
-	}
-	else if (skip_recently_used)
-	{
-		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr);
-		return result;
-	}
 
-	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
-	{
-		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr);
-		return result;
+		if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
+		{
+			/* It's clean, so nothing to do */
+			return result;
+		}
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufHdr);
+			continue;
+		}
+
+		/* Make sure we can handle the pin */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufHdr->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufHdr));
+			break;
+		}
 	}
 
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Share lock and write it out (FlushBuffer will do nothing if the buffer
+	 * is clean by the time we've locked it.)
 	 */
-	PinBuffer_Locked(bufHdr);
-
 	FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 
 	tag = bufHdr->tag;
@@ -4130,8 +4146,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	UnpinBuffer(bufHdr);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * SyncOneBuffer() is only called by bgwriter, so IOContext will always be
+	 * IOCONTEXT_NORMAL.
 	 */
 	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
 
-- 
2.43.0

#23

Chao Li

li.evan.chao@gmail.com

about 2 months ago

In reply to: Melanie Plageman (#22)

Re: Checkpointer write combining

On Nov 20, 2025, at 06:12, Melanie Plageman <melanieplageman@gmail.com> wrote:

Are you sure you applied 0005? It is the one that adds dbId to CkptSortItem.

My bad. Yes, I missed to download and apply 0005.

- Melanie
<v10-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v10-0002-Split-FlushBuffer-into-two-parts.patch><v10-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v10-0004-Write-combining-for-BAS_BULKWRITE.patch><v10-0005-Add-database-Oid-to-CkptSortItem.patch><v10-0006-Implement-checkpointer-data-write-combining.patch><v10-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patch>

I went through the whole patch set again, and got a few more comments:

1 - 0002
```
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
+{
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	LWLock	   *content_lock;

-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+		return;
```

I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, we should release “content_lock”.

Because the function comment clearly says “the content lock must be held exclusively”. Also, looking at the code where CleanVictimBuffer() is called:
```
if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
{
LWLockRelease(content_lock);
UnpinBuffer(buf_hdr);
continue;
}

/* Content lock is released inside CleanVictimBuffer */
CleanVictimBuffer(buf_hdr, from_ring, io_context);
```

In the previous “if” clause, it releases content_lock.

2 - 0002
```
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
```

The function comment says "the content lock must be held exclusively”, but from the code:
```
content_lock = BufferDescriptorGetContentLock(buf_hdr);
if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
{
/*
* Someone else has locked the buffer, so give it up and loop
* back to get another one.
*/
UnpinBuffer(buf_hdr);
continue;
}

/*
* If using a nondefault strategy, and writing the buffer would
* require a WAL flush, let the strategy decide whether to go
* ahead and write/reuse the buffer or to choose another victim.
* We need the content lock to inspect the page LSN, so this can't
* be done inside StrategyGetBuffer.
*/
if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
{
LWLockRelease(content_lock);
UnpinBuffer(buf_hdr);
continue;
}

/* Content lock is released inside CleanVictimBuffer */
CleanVictimBuffer(buf_hdr, from_ring, io_context);
```
Content_lock is acquired with LW_SHARED.

3 - 0003

In CleanVictimBuffer(), more logic are added, but the content_lock leak problem is still there.

4 - 0003
```
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state = LockBufHdr(bufdesc);
+
+	*lsn = BufferGetLSN(bufdesc);
+
+	UnlockBufHdr(bufdesc);
+
+	/*
+	 * See buffer flushing code for more details on why we condition this on
+	 * the relation being logged.
+	 */
+	return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}
```

This is a new function. I am thinking that if we should only update “lsn” when not BM_PERMANENT? Because from the existing code:
```
if (buf_state & BM_PERMANENT)
XLogFlush(recptr);
```

XLogFlush should only happen when BM_PERMANENT.

5 - 0004 - I am thinking if that could be a race condition?

PageSetBatchChecksumInplace() is called once after all pages were pinned earlier, but other backends may modify the page contents while the batch is being assembled, because batching only holds content_lock per page temporarily, not across the entire run.
So that:
• Page A pinned + content lock acquired + LSN read → content lock released
• Another backend modifies Page A and sets new LSN, dirties buffer
• Page A is written by this batch using an outdated checksum / outdated page contents

6 - 0006 - Ah, 0006 seems to have solved comment 3 about BM_PERMANENT.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#24

Melanie Plageman

melanieplageman@gmail.com

about 2 months ago

In reply to: Chao Li (#23)

7 attachment(s)

Re: Checkpointer write combining

Thanks for continuing to review this!

On Wed, Nov 19, 2025 at 9:32 PM Chao Li <li.evan.chao@gmail.com> wrote:

+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+                                 bool from_ring, IOContext io_context)
+{
+       XLogRecPtr      max_lsn = InvalidXLogRecPtr;
+       LWLock     *content_lock;

-       /* Find smgr relation for buffer */
-       if (reln == NULL)
-               reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+       /* Set up this victim buffer to be flushed */
+       if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+               return;
```

I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, we should release “content_lock”.

Oh wow, you are so right. That's a big mistake! I would have thought a
test would fail, but it seems we don't have coverage of this. I've
fixed the code in the attached v11. I'll see how difficult it is to
write a test that covers the case where we try to do IO but someone
else has already done it.

+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
The function comment says "the content lock must be held exclusively”, but from the code:
Content_lock is acquired with LW_SHARED.

Thanks, I've updated the comment.

+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+       uint32          buf_state = LockBufHdr(bufdesc);
+
+       *lsn = BufferGetLSN(bufdesc);
+
+       UnlockBufHdr(bufdesc);
+
+       /*
+        * See buffer flushing code for more details on why we condition this on
+        * the relation being logged.
+        */
+       return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}

This is a new function. I am thinking that if we should only update “lsn” when not BM_PERMANENT?

That makes sense. I don't use the lsn unless the buffer is logged
(BM_PERMANENT), but I think it is weird for the function to set the
LSN if it is returning false. I've modified the function to set it to
InvalidXLogRecPtr when it would return false. I've also added an
assert before XLogFlush() to make sure the buffer is logged before
flushing the WAL.

5 - 0004 - I am thinking if that could be a race condition?

PageSetBatchChecksumInplace() is called once after all pages were pinned earlier, but other backends may modify the page contents while the batch is being assembled, because batching only holds content_lock per page temporarily, not across the entire run.
So that:
• Page A pinned + content lock acquired + LSN read → content lock released
• Another backend modifies Page A and sets new LSN, dirties buffer
• Page A is written by this batch using an outdated checksum / outdated page contents

So, there is a race condition but it is slightly different from the
scenario you mention. We actually hold the content lock until after
doing the write. That means someone else can't get an exclusive lock
and modify the tuple data in the buffer. However, queries can set hint
bits while only holding a share lock. That can update the LSN (if an
FPI is required), which would cause bad things to happen when we write
out a buffer with an outdated checksum. What we do in master is make a
copy of the buffer, checksum it, and write out the copied buffer (see
PageSetChecksumCopy() and its function comment).

I have an XXX in PageSetBatchChecksumInplace() and in the commit
message for this patch explaining that it isn't committable until some
ongoing work Andres is doing adding a new lock type for setting hint
bits is committed. So, this specific part of this patch is WIP.

- Melanie

Attachments:

v11-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchtext/x-patch; charset=US-ASCII; name=v11-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patchDownload

From 171fc1e7c2d4a1d4b8fccf338a1203d7fcff0b7f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:53:48 -0400
Subject: [PATCH v11 1/7] Refactor goto into for loop in GetVictimBuffer()

GetVictimBuffer() implemented a loop to optimistically lock a clean
victim buffer using a goto. Future commits will add batch flushing
functionality to GetVictimBuffer. The new logic works better with
standard for loop flow control.

This commit is only a refactor and does not introduce any new
functionality.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 189 ++++++++++++--------------
 src/backend/storage/buffer/freelist.c |  19 ++-
 src/include/storage/buf_internals.h   |   5 +
 3 files changed, 110 insertions(+), 103 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 327ddb7adc8..90c24b8d93d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -68,10 +68,6 @@
 #include "utils/timestamp.h"
 
 
-/* Note: these two macros only work on shared buffers, not local ones! */
-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
-#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
-
 /* Note: this macro only works on local buffers, not shared ones! */
 #define LocalBufHdrGetBlock(bufHdr) \
 	LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)]
@@ -2331,125 +2327,116 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 	ReservePrivateRefCountEntry();
 	ResourceOwnerEnlarge(CurrentResourceOwner);
 
-	/* we return here if a prospective victim buffer gets used concurrently */
-again:
-
-	/*
-	 * Select a victim buffer.  The buffer is returned pinned and owned by
-	 * this backend.
-	 */
-	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
-	buf = BufferDescriptorGetBuffer(buf_hdr);
-
-	/*
-	 * We shouldn't have any other pins for this buffer.
-	 */
-	CheckBufferIsPinnedOnce(buf);
-
-	/*
-	 * If the buffer was dirty, try to write it out.  There is a race
-	 * condition here, in that someone might dirty it after we released the
-	 * buffer header lock above, or even while we are writing it out (since
-	 * our share-lock won't prevent hint-bit updates).  We will recheck the
-	 * dirty bit after re-locking the buffer header.
-	 */
-	if (buf_state & BM_DIRTY)
+	/* Select a victim buffer using an optimistic locking scheme. */
+	for (;;)
 	{
-		LWLock	   *content_lock;
 
-		Assert(buf_state & BM_TAG_VALID);
-		Assert(buf_state & BM_VALID);
+		/* Attempt to claim a victim buffer. Buffer is returned pinned. */
+		buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		buf = BufferDescriptorGetBuffer(buf_hdr);
 
 		/*
-		 * We need a share-lock on the buffer contents to write it out (else
-		 * we might write invalid data, eg because someone else is compacting
-		 * the page contents while we write).  We must use a conditional lock
-		 * acquisition here to avoid deadlock.  Even though the buffer was not
-		 * pinned (and therefore surely not locked) when StrategyGetBuffer
-		 * returned it, someone else could have pinned and exclusive-locked it
-		 * by the time we get here. If we try to get the lock unconditionally,
-		 * we'd block waiting for them; if they later block waiting for us,
-		 * deadlock ensues. (This has been observed to happen when two
-		 * backends are both trying to split btree index pages, and the second
-		 * one just happens to be trying to split the page the first one got
-		 * from StrategyGetBuffer.)
+		 * We shouldn't have any other pins for this buffer.
 		 */
-		content_lock = BufferDescriptorGetContentLock(buf_hdr);
-		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
-		{
-			/*
-			 * Someone else has locked the buffer, so give it up and loop back
-			 * to get another one.
-			 */
-			UnpinBuffer(buf_hdr);
-			goto again;
-		}
+		CheckBufferIsPinnedOnce(buf);
 
 		/*
-		 * If using a nondefault strategy, and writing the buffer would
-		 * require a WAL flush, let the strategy decide whether to go ahead
-		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
-		 * StrategyGetBuffer.
+		 * If the buffer was dirty, try to write it out.  There is a race
+		 * condition here, in that someone might dirty it after we released
+		 * the buffer header lock above, or even while we are writing it out
+		 * (since our share-lock won't prevent hint-bit updates).  We will
+		 * recheck the dirty bit after re-locking the buffer header.
 		 */
-		if (strategy != NULL)
+		if (buf_state & BM_DIRTY)
 		{
-			XLogRecPtr	lsn;
+			LWLock	   *content_lock;
 
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(buf_state & BM_VALID);
+
+			/*
+			 * We need a share-lock on the buffer contents to write it out
+			 * (else we might write invalid data, eg because someone else is
+			 * compacting the page contents while we write).  We must use a
+			 * conditional lock acquisition here to avoid deadlock.  Even
+			 * though the buffer was not pinned (and therefore surely not
+			 * locked) when StrategyGetBuffer returned it, someone else could
+			 * have pinned and exclusive-locked it by the time we get here. If
+			 * we try to get the lock unconditionally, we'd block waiting for
+			 * them; if they later block waiting for us, deadlock ensues.
+			 * (This has been observed to happen when two backends are both
+			 * trying to split btree index pages, and the second one just
+			 * happens to be trying to split the page the first one got from
+			 * StrategyGetBuffer.)
+			 */
+			content_lock = BufferDescriptorGetContentLock(buf_hdr);
+			if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				/*
+				 * Someone else has locked the buffer, so give it up and loop
+				 * back to get another one.
+				 */
+				UnpinBuffer(buf_hdr);
+				continue;
+			}
 
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			/*
+			 * If using a nondefault strategy, and writing the buffer would
+			 * require a WAL flush, let the strategy decide whether to go
+			 * ahead and write/reuse the buffer or to choose another victim.
+			 * We need the content lock to inspect the page LSN, so this can't
+			 * be done inside StrategyGetBuffer.
+			 */
+			if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 			{
 				LWLockRelease(content_lock);
 				UnpinBuffer(buf_hdr);
-				goto again;
+				continue;
 			}
-		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
+			/* OK, do the I/O */
+			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+			LWLockRelease(content_lock);
 
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
-	}
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 
 
-	if (buf_state & BM_VALID)
-	{
+		if (buf_state & BM_VALID)
+		{
+			/*
+			 * When a BufferAccessStrategy is in use, blocks evicted from
+			 * shared buffers are counted as IOOP_EVICT in the corresponding
+			 * context (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted
+			 * by a strategy in two cases: 1) while initially claiming buffers
+			 * for the strategy ring 2) to replace an existing strategy ring
+			 * buffer because it is pinned or in use and cannot be reused.
+			 *
+			 * Blocks evicted from buffers already in the strategy ring are
+			 * counted as IOOP_REUSE in the corresponding strategy context.
+			 *
+			 * At this point, we can accurately count evictions and reuses,
+			 * because we have successfully claimed the valid buffer.
+			 * Previously, we may have been forced to release the buffer due
+			 * to concurrent pinners or erroring out.
+			 */
+			pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+							   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
+		}
+
 		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
+		 * If the buffer has an entry in the buffer mapping table, delete it.
+		 * This can fail because another backend could have pinned or dirtied
+		 * the buffer.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT, 1, 0);
-	}
+		if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+		{
+			UnpinBuffer(buf_hdr);
+			continue;
+		}
 
-	/*
-	 * If the buffer has an entry in the buffer mapping table, delete it. This
-	 * can fail because another backend could have pinned or dirtied the
-	 * buffer.
-	 */
-	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
-	{
-		UnpinBuffer(buf_hdr);
-		goto again;
+		break;
 	}
 
 	/* a final set of sanity checks */
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 28d952b3534..acbabeb3c3b 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/buf_internals.h"
@@ -780,12 +781,20 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held. We must hold the content lock to examine the LSN.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	XLogRecPtr	lsn;
+
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -795,11 +804,17 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
+	LockBufHdr(buf);
+	lsn = BufferGetLSN(buf);
+	UnlockBufHdr(buf);
+
+	if (!XLogNeedsFlush(lsn))
+		return false;
+
 	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
+	 * Remove the dirty buffer from the ring; necessary to prevent an infinite
 	 * loop if all ring members are dirty.
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
-
 	return true;
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 5400c56a965..04fdea64f83 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -486,6 +486,11 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 /*
  * Internal buffer management routines
  */
+
+/* Note: these two macros only work on shared buffers, not local ones! */
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
+
 /* bufmgr.c */
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
-- 
2.43.0

v11-0002-Split-FlushBuffer-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v11-0002-Split-FlushBuffer-into-two-parts.patchDownload

From 8d2ec82321118daaabbac64785f4de5f7ef9d298 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 10:54:19 -0400
Subject: [PATCH v11 2/7] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This separation procides symmetry with
future code for batch flushing which necessarily separates these steps,
as it must prepare multiple buffers before flushing them together.

These steps are moved into a new FlushBuffer() helper function,
CleanVictimBuffer() which will contain both the batch flushing and
single flush code in future commits.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 143 +++++++++++++++++++---------
 1 file changed, 99 insertions(+), 44 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90c24b8d93d..c88ff08493c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -533,6 +533,12 @@ static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						  IOObject io_object, IOContext io_context,
+						  XLogRecPtr buffer_lsn);
+static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
+							  IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2394,12 +2400,8 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 				continue;
 			}
 
-			/* OK, do the I/O */
-			FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-			LWLockRelease(content_lock);
-
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			/* Content lock is released inside CleanVictimBuffer */
+			CleanVictimBuffer(buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4270,54 +4272,65 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
-	uint32		buf_state;
+	XLogRecPtr	lsn;
 
-	/*
-	 * Try to start an I/O operation.  If StartBufferIO returns false, then
-	 * someone else flushed the buffer before we could, so we need not do
-	 * anything.
-	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
+	if (PrepareFlushBuffer(buf, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
 
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = buf;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held, and the buffer header
+ * spinlock must not be held. The content lock is released and the buffer is
+ * returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+				  bool from_ring, IOContext io_context)
+{
+	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
 
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+	/* Set up this victim buffer to be flushed */
+	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+	{
+		LWLockRelease(BufferDescriptorGetContentLock(bufdesc));
+		return;
+	}
 
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber);
+	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+	LWLockRelease(BufferDescriptorGetContentLock(bufdesc));
+	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+								  &bufdesc->tag);
+}
 
-	buf_state = LockBufHdr(buf);
+/*
+ * Prepare the buffer with bufdesc for writing. Returns true if the buffer
+ * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
+ * the table is logged.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state;
 
 	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
+	 * Try to start an I/O operation.  If StartBufferIO returns false, then
+	 * someone else flushed the buffer before we could, so we need not do
+	 * anything.
 	 */
-	recptr = BufferGetLSN(buf);
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
 
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	UnlockBufHdrExt(buf, buf_state,
-					0, BM_JUST_DIRTIED,
-					0);
+	*lsn = InvalidXLogRecPtr;
+	buf_state = LockBufHdr(bufdesc);
 
 	/*
-	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
-	 * rule that log updates must hit disk before any of the data-file changes
-	 * they describe do.
+	 * Record the buffer's LSN. We will force XLOG flush up to buffer's LSN.
+	 * This implements the basic WAL rule that log updates must hit disk
+	 * before any of the data-file changes they describe do.
 	 *
 	 * However, this rule does not apply to unlogged relations, which will be
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
@@ -4330,9 +4343,51 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * happen, attempting to flush WAL through that location would fail, with
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
+	 *
+	 * We must hold the buffer header lock when examining the page LSN since
+	 * don't have buffer exclusively locked in all cases.
 	 */
 	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	UnlockBufHdrExt(bufdesc, buf_state,
+					0, BM_JUST_DIRTIED,
+					0);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = shared_buffer_write_error_callback;
+	errcallback.arg = buf;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
+										buf->tag.blockNum,
+										reln->smgr_rlocator.locator.spcOid,
+										reln->smgr_rlocator.locator.dbOid,
+										reln->smgr_rlocator.locator.relNumber);
+
+	/* Force XLOG flush up to buffer's LSN */
+	if (XLogRecPtrIsValid(buffer_lsn))
+		XLogFlush(buffer_lsn);
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

v11-0003-Eagerly-flush-bulkwrite-strategy-ring.patchtext/x-patch; charset=US-ASCII; name=v11-0003-Eagerly-flush-bulkwrite-strategy-ring.patchDownload

From ddb6b8e78b56d331707c22c7d37af4087182e83c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:15:43 -0400
Subject: [PATCH v11 3/7] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse them. By
eagerly flushing the buffers in a larger run, we encourage larger writes
at the kernel level and less interleaving of WAL flushes and data file
writes. The effect is mainly noticeable with multiple parallel COPY
FROMs. In this case, client backends achieve higher write throughput and
end up spending less time waiting on acquiring the lock to flush WAL.
Larger flush operations also mean less time waiting for flush operations
at the kernel level.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which do not require a WAL flush.

This patch also is a step toward AIO writes, as it lines up multiple
buffers that can be issued asynchronously once the infrastructure
exists.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Earlier version Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 249 +++++++++++++++++++++++++-
 src/backend/storage/buffer/freelist.c |  48 +++++
 src/include/storage/buf_internals.h   |   4 +
 3 files changed, 292 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c88ff08493c..9b39129da42 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -531,14 +531,25 @@ static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_c
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
+
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
-static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *NextStrategyBufToFlush(BufferAccessStrategy strategy,
+										  Buffer sweep_end,
+										  XLogRecPtr *lsn, int *sweep_cursor);
+
+static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+												   RelFileLocator *rlocator, bool skip_pinned,
+												   XLogRecPtr *max_lsn);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc,
+							   XLogRecPtr *lsn);
 static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						  IOObject io_object, IOContext io_context,
 						  XLogRecPtr buffer_lsn);
-static void CleanVictimBuffer(BufferDesc *bufdesc, bool from_ring,
-							  IOContext io_context);
+static void CleanVictimBuffer(BufferAccessStrategy strategy,
+							  BufferDesc *bufdesc,
+							  bool from_ring, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2401,7 +2412,7 @@ GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 			}
 
 			/* Content lock is released inside CleanVictimBuffer */
-			CleanVictimBuffer(buf_hdr, from_ring, io_context);
+			CleanVictimBuffer(strategy, buf_hdr, from_ring, io_context);
 		}
 
 
@@ -4278,6 +4289,69 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock. If the buffer is
+ * unlogged, *lsn shouldn't be used by the caller and is set to
+ * InvalidXLogRecPtr.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+	uint32		buf_state;
+
+	buf_state = LockBufHdr(bufdesc);
+	*lsn = BufferGetLSN(bufdesc);
+	UnlockBufHdr(bufdesc);
+
+	/*
+	 * See buffer flushing code for more details on why we condition this on
+	 * the relation being logged.
+	 */
+	if (!(buf_state & BM_PERMANENT))
+	{
+		*lsn = InvalidXLogRecPtr;
+		return false;
+	}
+
+	return XLogNeedsFlush(*lsn);
+}
+
+
+/*
+ * Returns the buffer descriptor of the buffer containing the next block we
+ * should eagerly flush or NULL when there are no further buffers to consider
+ * writing out.
+ */
+static BufferDesc *
+NextStrategyBufToFlush(BufferAccessStrategy strategy,
+					   Buffer sweep_end,
+					   XLogRecPtr *lsn, int *sweep_cursor)
+{
+	Buffer		bufnum;
+	BufferDesc *bufdesc;
+
+	while ((bufnum =
+			StrategyNextBuffer(strategy, sweep_cursor)) != sweep_end)
+	{
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		if ((bufdesc = PrepareOrRejectEagerFlushBuffer(bufnum,
+													   InvalidBlockNumber,
+													   NULL,
+													   true,
+													   lsn)) != NULL)
+			return bufdesc;
+	}
+
+	return NULL;
+}
+
 /*
  * Prepare and write out a dirty victim buffer.
  *
@@ -4288,10 +4362,12 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
  * bufdesc may be modified.
  */
 static void
-CleanVictimBuffer(BufferDesc *bufdesc,
+CleanVictimBuffer(BufferAccessStrategy strategy,
+				  BufferDesc *bufdesc,
 				  bool from_ring, IOContext io_context)
 {
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
+	bool		first_buffer = true;
 
 	/* Set up this victim buffer to be flushed */
 	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
@@ -4300,10 +4376,162 @@ CleanVictimBuffer(BufferDesc *bufdesc,
 		return;
 	}
 
-	DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-	LWLockRelease(BufferDescriptorGetContentLock(bufdesc));
-	ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-								  &bufdesc->tag);
+	if (from_ring && StrategySupportsEagerFlush(strategy))
+	{
+		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
+		int			cursor = StrategyGetCurrentIndex(strategy);
+
+		/* Clean victim buffer and find more to flush opportunistically */
+		do
+		{
+			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+			LWLockRelease(BufferDescriptorGetContentLock(bufdesc));
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &bufdesc->tag);
+			/* We leave the first buffer pinned for the caller */
+			if (!first_buffer)
+				UnpinBuffer(bufdesc);
+			first_buffer = false;
+		} while ((bufdesc = NextStrategyBufToFlush(strategy, sweep_end,
+												   &max_lsn, &cursor)) != NULL);
+	}
+	else
+	{
+		DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
+		LWLockRelease(BufferDescriptorGetContentLock(bufdesc));
+		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+									  &bufdesc->tag);
+	}
+}
+
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, return the buffer descriptor of the buffer to eagerly flush,
+ * pinned and locked, or NULL if this buffer does not contain a block that
+ * should be flushed.
+ *
+ * require is the BlockNumber required by the caller. Some callers may require
+ * a specific BlockNumber to be in bufnum because they are assembling a
+ * contiguous run of blocks.
+ *
+ * If the caller needs the block to be from a specific relation, rlocator will
+ * be provided.
+ */
+static BufferDesc *
+PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
+								RelFileLocator *rlocator, bool skip_pinned,
+								XLogRecPtr *max_lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		old_buf_state;
+	uint32		buf_state;
+	XLogRecPtr	lsn;
+	BlockNumber blknum;
+	LWLock	   *content_lock;
+
+	if (!BufferIsValid(bufnum))
+		return NULL;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/* Block may need to be in a specific relation */
+	if (rlocator &&
+		!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufdesc->tag),
+							  *rlocator))
+		return NULL;
+
+	/*
+	 * Ensure that theres a free refcount entry and resource owner slot for
+	 * the pin before pinning the buffer. While this may leake a refcount and
+	 * slot if we return without a buffer, we should use that slot the next
+	 * time we try and reserve a spot.
+	 */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	/*
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header. We have to lock
+	 * the buffer header later if we succeed in pinning the buffer here, but
+	 * avoiding locking the buffer header if the buffer is in use is worth it.
+	 */
+	old_buf_state = pg_atomic_read_u32(&bufdesc->state);
+
+	for (;;)
+	{
+		buf_state = old_buf_state;
+
+		if (!(buf_state & BM_DIRTY) || !(buf_state & BM_VALID))
+			return NULL;
+
+		/* We don't eagerly flush buffers used by others */
+		if (skip_pinned &&
+			(BUF_STATE_GET_REFCOUNT(buf_state) > 0 ||
+			 BUF_STATE_GET_USAGECOUNT(buf_state) > 1))
+			return NULL;
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufdesc);
+			continue;
+		}
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufdesc->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufdesc));
+			break;
+		}
+	}
+
+	CheckBufferIsPinnedOnce(bufnum);
+
+	blknum = BufferGetBlockNumber(bufnum);
+	Assert(BlockNumberIsValid(blknum));
+
+	/* We only include contiguous blocks in the run */
+	if (BlockNumberIsValid(require) && blknum != require)
+		goto except_unpin_buffer;
+
+	/* Don't eagerly flush buffers requiring WAL flush */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto except_unpin_buffer;
+
+	/*
+	 * Now that we have the content lock, we need to recheck if we need to
+	 * flush WAL.
+	 */
+	if (BufferNeedsWALFlush(bufdesc, &lsn))
+		goto except_unpin_buffer;
+
+	/* Try to start an I/O operation */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto except_unlock_content;
+
+	if (lsn > *max_lsn)
+		*max_lsn = lsn;
+
+	buf_state = LockBufHdr(bufdesc);
+	UnlockBufHdrExt(bufdesc, buf_state, 0, BM_JUST_DIRTIED, 0);
+
+	return bufdesc;
+
+except_unlock_content:
+	LWLockRelease(content_lock);
+
+except_unpin_buffer:
+	UnpinBuffer(bufdesc);
+	return NULL;
 }
 
 /*
@@ -4387,7 +4615,10 @@ DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 
 	/* Force XLOG flush up to buffer's LSN */
 	if (XLogRecPtrIsValid(buffer_lsn))
+	{
+		Assert(pg_atomic_read_u32(&buf->state) & BM_PERMANENT);
 		XLogFlush(buffer_lsn);
+	}
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index acbabeb3c3b..4a3009d190c 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -156,6 +156,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lead to better I/O
+ * patterns than lazily flushing buffers immediately before reusing them.
+ */
+bool
+StrategySupportsEagerFlush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -307,6 +332,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Given a position in the ring, cursor, increment the position, and return
+ * the buffer at this position.
+ */
+Buffer
+StrategyNextBuffer(BufferAccessStrategy strategy, int *cursor)
+{
+	if (++(*cursor) >= strategy->nbuffers)
+		*cursor = 0;
+
+	return strategy->buffers[*cursor];
+}
+
+/*
+ * Return the current slot in the strategy ring.
+ */
+int
+StrategyGetCurrentIndex(BufferAccessStrategy strategy)
+{
+	return strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 04fdea64f83..c07e309a288 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -506,6 +506,10 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
+								 int *cursor);
+extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

v11-0004-Write-combining-for-BAS_BULKWRITE.patchtext/x-patch; charset=US-ASCII; name=v11-0004-Write-combining-for-BAS_BULKWRITE.patchDownload

From 700a5103cb054c0e3ce4f48265972168a6a47286 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 13:42:47 -0400
Subject: [PATCH v11 4/7] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.

XXX: Because this sets in-place checksums for batches, it is not
committable until additional infrastructure goes in place.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 216 ++++++++++++++++++++++++--
 src/backend/storage/buffer/freelist.c |  23 +++
 src/backend/storage/page/bufpage.c    |  20 +++
 src/backend/utils/probes.d            |   2 +
 src/include/storage/buf_internals.h   |  32 ++++
 src/include/storage/bufpage.h         |   2 +
 src/tools/pgindent/typedefs.list      |   1 +
 7 files changed, 285 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9b39129da42..a10969da77e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -537,7 +537,11 @@ static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 static BufferDesc *NextStrategyBufToFlush(BufferAccessStrategy strategy,
 										  Buffer sweep_end,
 										  XLogRecPtr *lsn, int *sweep_cursor);
-
+static void FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+							   BufferDesc *batch_start,
+							   uint32 max_batch_size,
+							   BufWriteBatch *batch,
+							   int *sweep_cursor);
 static bool BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn);
 static BufferDesc *PrepareOrRejectEagerFlushBuffer(Buffer bufnum, BlockNumber require,
 												   RelFileLocator *rlocator, bool skip_pinned,
@@ -4318,10 +4322,90 @@ BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
 }
 
 
+
+/*
+ * Given a starting buffer descriptor from a strategy ring that supports eager
+ * flushing, find additional buffers from the ring that can be combined into a
+ * single write batch with the starting buffer.
+ *
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to issue this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+				   BufferDesc *batch_start,
+				   uint32 max_batch_size,
+				   BufWriteBatch *batch,
+				   int *sweep_cursor)
+{
+	BlockNumber limit;
+
+	Assert(batch_start);
+	batch->bufdescs[0] = batch_start;
+
+	LockBufHdr(batch_start);
+	batch->max_lsn = BufferGetLSN(batch_start);
+	UnlockBufHdr(batch_start);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	Assert(BlockNumberIsValid(batch->start));
+	batch->n = 1;
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->rlocator = BufTagGetRelFileLocator(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Min(max_batch_size, limit);
+	limit = Min(GetAdditionalPinLimit(), limit);
+
+	/*
+	 * It's possible we're not allowed any more pins or there aren't more
+	 * blocks in the target relation. In this case, just return. Our batch
+	 * will have only one buffer.
+	 */
+	if (limit <= 0)
+		return;
+
+	/* Now assemble a run of blocks to write out. */
+	for (; batch->n < limit; batch->n++)
+	{
+		Buffer		bufnum;
+
+		if ((bufnum =
+			 StrategyNextBuffer(strategy, sweep_cursor)) == sweep_end)
+			break;
+
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if ((batch->bufdescs[batch->n] =
+			 PrepareOrRejectEagerFlushBuffer(bufnum,
+											 batch->start + batch->n,
+											 &batch->rlocator,
+											 true,
+											 &batch->max_lsn)) == NULL)
+			break;
+	}
+}
+
 /*
  * Returns the buffer descriptor of the buffer containing the next block we
  * should eagerly flush or NULL when there are no further buffers to consider
- * writing out.
+ * writing out. This will be the start of a new batch of buffers to write out.
  */
 static BufferDesc *
 NextStrategyBufToFlush(BufferAccessStrategy strategy,
@@ -4367,7 +4451,6 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 				  bool from_ring, IOContext io_context)
 {
 	XLogRecPtr	max_lsn = InvalidXLogRecPtr;
-	bool		first_buffer = true;
 
 	/* Set up this victim buffer to be flushed */
 	if (!PrepareFlushBuffer(bufdesc, &max_lsn))
@@ -4380,18 +4463,23 @@ CleanVictimBuffer(BufferAccessStrategy strategy,
 	{
 		Buffer		sweep_end = BufferDescriptorGetBuffer(bufdesc);
 		int			cursor = StrategyGetCurrentIndex(strategy);
+		uint32		max_batch_size = StrategyMaxWriteBatchSize(strategy);
+
+		/* Pin our victim again so it stays ours even after batch released */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+		IncrBufferRefCount(BufferDescriptorGetBuffer(bufdesc));
 
 		/* Clean victim buffer and find more to flush opportunistically */
 		do
 		{
-			DoFlushBuffer(bufdesc, NULL, IOOBJECT_RELATION, io_context, max_lsn);
-			LWLockRelease(BufferDescriptorGetContentLock(bufdesc));
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &bufdesc->tag);
-			/* We leave the first buffer pinned for the caller */
-			if (!first_buffer)
-				UnpinBuffer(bufdesc);
-			first_buffer = false;
+			BufWriteBatch batch;
+
+			FindFlushAdjacents(strategy, sweep_end, bufdesc, max_batch_size,
+							   &batch, &cursor);
+			FlushBufferBatch(&batch, io_context);
+			/* Content locks released inside CompleteWriteBatchIO */
+			CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 		} while ((bufdesc = NextStrategyBufToFlush(strategy, sweep_end,
 												   &max_lsn, &cursor)) != NULL);
 	}
@@ -4534,6 +4622,70 @@ except_unpin_buffer:
 	return NULL;
 }
 
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (XLogRecPtrIsValid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	if (batch->reln == NULL)
+		batch->reln = smgropen(batch->rlocator, INVALID_PROC_NUMBER);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		XLogRecPtr	lsn;
+
+		Assert(!BufferNeedsWALFlush(batch->bufdescs[i], &lsn));
+	}
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with bufdesc for writing. Returns true if the buffer
  * acutally needs writing and false otherwise. lsn returns the buffer's LSN if
@@ -4698,6 +4850,48 @@ FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 	LWLockRelease(BufferDescriptorGetContentLock(buf));
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		LWLockRelease(BufferDescriptorGetContentLock(batch->bufdescs[i]));
+		ReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 4a3009d190c..189274fc0c0 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -776,6 +776,29 @@ GetBufferFromRing(BufferAccessStrategy strategy, uint32 *buf_state)
 	return NULL;
 }
 
+
+/*
+ * Determine the largest IO we can assemble from the given strategy ring given
+ * strategy-specific as well as global constraints on the number of pinned
+ * buffers and max IO size.
+ */
+uint32
+StrategyMaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+	uint32		max_write_batch_size = Min(io_combine_limit, MAX_IO_COMBINE_LIMIT);
+	int			strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	uint32		max_possible_buffer_limit = GetPinLimit();
+
+	/* Identify the minimum of the above */
+	max_write_batch_size = Min(strategy_pin_limit, max_write_batch_size);
+	max_write_batch_size = Min(max_possible_buffer_limit, max_write_batch_size);
+
+	/* Must allow at least 1 IO for forward progress */
+	max_write_batch_size = Max(1, max_write_batch_size);
+
+	return max_write_batch_size;
+}
+
 /*
  * AddBufferToRing -- add a buffer to the buffer ring
  *
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index aac6e695954..7c2ec99f939 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple block's checksums
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e9e413477ba..36dd4f8375b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index c07e309a288..ab502c4f825 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -483,6 +483,34 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufWriteBatch
+{
+	RelFileLocator rlocator;
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -496,6 +524,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufWriteBatch *batch, IOContext io_context);
 
 extern void TrackNewBufferPin(Buffer buf);
 
@@ -507,9 +536,12 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 /* freelist.c */
 extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern uint32 StrategyMaxWriteBatchSize(BufferAccessStrategy strategy);
 extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
 								 int *cursor);
 extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufWriteBatch *batch, IOContext io_context,
+								 WritebackContext *wb_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index abc2cf2a020..29a400a71eb 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -506,5 +506,7 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									const void *newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos,
+										uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c751c25a04d..2132a5b4b4b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -350,6 +350,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

v11-0005-Add-database-Oid-to-CkptSortItem.patchtext/x-patch; charset=US-ASCII; name=v11-0005-Add-database-Oid-to-CkptSortItem.patchDownload

From d346ca112b6c676f1a2d5e1f743d2cc89bba7bb0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v11 5/7] Add database Oid to CkptSortItem

This is useful for checkpointer write combining -- which will be added
in a future commit.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a10969da77e..68f6d4f2f45 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3405,6 +3405,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->dbId = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6823,6 +6824,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->dbId < b->dbId)
+		return -1;
+	else if (a->dbId > b->dbId)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ab502c4f825..feb370175f0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -449,6 +449,7 @@ extern uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			dbId;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

v11-0006-Implement-checkpointer-data-write-combining.patchtext/x-patch; charset=US-ASCII; name=v11-0006-Implement-checkpointer-data-write-combining.patchDownload

From 9302f6c197275f8d0dd38f491722a293e82105f1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 15:23:16 -0400
Subject: [PATCH v11 6/7] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Soumya <bharatdbpg@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 224 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 198 insertions(+), 28 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 68f6d4f2f45..8cc2fc06646 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -513,6 +513,7 @@ static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
+static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
@@ -3346,7 +3347,6 @@ TrackNewBufferPin(Buffer buf)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
 	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
@@ -3358,6 +3358,8 @@ BufferSync(int flags)
 	int			i;
 	uint32		mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3389,6 +3391,7 @@ BufferSync(int flags)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 		uint32		set_bits = 0;
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3531,48 +3534,199 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = CheckpointerMaxBatchSize();
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
 
-		bufHdr = GetBufferDescriptor(buf_id);
+		while (batch.n < limit)
+		{
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
 
-		num_processed++;
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
 
-		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
-		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
-		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			buf_id = item.buf_id;
+			Assert(buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(buf_id);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.rlocator.spcOid = item.tsId;
+				batch.rlocator.dbOid = item.dbId;
+				batch.rlocator.relNumber = item.relNumber;
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(batch.rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Min(max_batch_size, limit);
+				limit = Min(GetAdditionalPinLimit(), limit);
+				/* Guarantee progress */
+				limit = Max(limit, 1);
 			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * because we want to start the next IO with this item.
+			 */
+			if (item.dbId != batch.rlocator.dbOid)
+				break;
+
+			if (item.relNumber != batch.rlocator.relNumber)
+				break;
+
+			if (item.forkNum != batch.forkno)
+				break;
+
+			Assert(item.tsId == batch.rlocator.spcOid);
+
+			/*
+			 * If the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a few bits. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false.
+			 *
+			 * If the buffer doesn't need checkpointing, don't include it in
+			 * the batch we are building. And if the buffer doesn't need
+			 * flushing, we're done with the item, so count it as processed
+			 * and break out of the loop to issue the IO so far.
+			 */
+			buf_state = pg_atomic_read_u32(&bufHdr->state);
+			if ((buf_state & (BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY)) !=
+				(BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+			PinBuffer(bufHdr, NULL, false);
+
+			/*
+			 * There is a race condition here: it's conceivable that between
+			 * the time we examine the buffer header for BM_CHECKPOINT_NEEDED
+			 * above and when we are now acquiring the lock that, someone else
+			 * not only wrote the buffer but replaced it with another page and
+			 * dirtied it.  In that improbable case, we will write the buffer
+			 * though we didn't need to.  It doesn't seem worth guarding
+			 * against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				processed++;
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * Lock buffer header lock before examining LSN because we only
+			 * have a shared lock on the buffer.
+			 */
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			UnlockBufHdrExt(bufHdr, buf_state, 0, BM_JUST_DIRTIED, 0);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
+		 * - otherwise writing becomes unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, IOCONTEXT_NORMAL, &wb_context);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
@@ -6421,6 +6575,22 @@ IsBufferCleanupOK(Buffer buffer)
 	return false;
 }
 
+/*
+ * The maximum number of blocks that can be written out in a single batch by
+ * the checkpointer.
+ */
+static uint32
+CheckpointerMaxBatchSize(void)
+{
+	uint32		result;
+	uint32		pin_limit = GetPinLimit();
+
+	result = Min(pin_limit, io_combine_limit);
+	result = Min(result, MAX_IO_COMBINE_LIMIT);
+	result = Max(result, 1);
+	return result;
+}
+
 
 /*
  *	Functions for buffer I/O handling
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 36dd4f8375b..d6970731ba9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

v11-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patchtext/x-patch; charset=US-ASCII; name=v11-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patchDownload

From c614f3edcc135cad62bec92939469970389e752f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 15 Oct 2025 16:16:58 -0400
Subject: [PATCH v11 7/7] Refactor SyncOneBuffer for bgwriter use only

Since xxx, only bgwriter uses SyncOneBuffer, so we can remove the
skip_recently_used parameter and make that behavior the default.

While we are at it, 5e89985928795f243 introduced the pattern of using a
CAS loop instead of locking the buffer header and then calling
PinBuffer_Locked(). Do that in SyncOneBuffer() so we can avoid taking
the buffer header spinlock in the common case that the buffer is
recently used.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 96 +++++++++++++++++------------
 1 file changed, 56 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8cc2fc06646..73f2594207f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -515,8 +515,7 @@ static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static uint32 CheckpointerMaxBatchSize(void);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
+static int	SyncOneBuffer(int buf_id, WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -4000,8 +3999,7 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state = SyncOneBuffer(next_to_clean, wb_context);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -4064,8 +4062,8 @@ BgBufferSync(WritebackContext *wb_context)
 /*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
- * If skip_recently_used is true, we don't write currently-pinned buffers, nor
- * buffers marked recently used, as these are not replacement candidates.
+ * We don't write currently-pinned buffers, nor buffers marked recently used,
+ * as these are not replacement candidates.
  *
  * Returns a bitmask containing the following flag bits:
  *	BUF_WRITTEN: we wrote the buffer.
@@ -4076,53 +4074,71 @@ BgBufferSync(WritebackContext *wb_context)
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+SyncOneBuffer(int buf_id, WritebackContext *wb_context)
 {
 	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
+	uint32		old_buf_state;
 	uint32		buf_state;
 	BufferTag	tag;
 
-	/* Make sure we can handle the pin */
-	ReservePrivateRefCountEntry();
-	ResourceOwnerEnlarge(CurrentResourceOwner);
-
 	/*
-	 * Check whether buffer needs writing.
-	 *
-	 * We can make this check without taking the buffer content lock so long
-	 * as we mark pages dirty in access methods *before* logging changes with
-	 * XLogInsert(): if someone marks the buffer dirty just after our check we
-	 * don't worry because our checkpoint.redo points before log record for
-	 * upcoming changes and so we are not required to write such dirty buffer.
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header.
 	 */
-	buf_state = LockBufHdr(bufHdr);
-
-	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
-		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
+	old_buf_state = pg_atomic_read_u32(&bufHdr->state);
+	for (;;)
 	{
+		buf_state = old_buf_state;
+
+		/*
+		 * We can make these checks without taking the buffer content lock so
+		 * long as we mark pages dirty in access methods *before* logging
+		 * changes with XLogInsert(): if someone marks the buffer dirty just
+		 * after our check we don't worry because our checkpoint.redo points
+		 * before log record for upcoming changes and so we are not required
+		 * to write such dirty buffer.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 0 ||
+			BUF_STATE_GET_USAGECOUNT(buf_state) != 0)
+		{
+			/* Don't write recently-used buffers */
+			return result;
+		}
+
 		result |= BUF_REUSABLE;
-	}
-	else if (skip_recently_used)
-	{
-		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr);
-		return result;
-	}
 
-	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
-	{
-		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr);
-		return result;
+		if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
+		{
+			/* It's clean, so nothing to do */
+			return result;
+		}
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufHdr);
+			continue;
+		}
+
+		/* Make sure we can handle the pin */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufHdr->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufHdr));
+			break;
+		}
 	}
 
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Share lock and write it out (FlushBuffer will do nothing if the buffer
+	 * is clean by the time we've locked it.)
 	 */
-	PinBuffer_Locked(bufHdr);
-
 	FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 
 	tag = bufHdr->tag;
@@ -4130,8 +4146,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	UnpinBuffer(bufHdr);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * SyncOneBuffer() is only called by bgwriter, so IOContext will always be
+	 * IOCONTEXT_NORMAL.
 	 */
 	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
 
-- 
2.43.0