Streaming I/O, vectored I/O (WIP)

Started by Thomas Munroover 2 years ago67 messages

thomas.munro@gmail.com

over 2 years ago

14 attachment(s)

Hi,

Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that. This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream". That way,
the streaming infrastructure can look as far into the future as it
wants, and then:

* systematically issue POSIX_FADV_WILLNEED for random access,
replacing patchy ad hoc advice
* build larger vectored I/Os; eg one preadv() call can replace 16 pread() calls

That's more efficient, and it goes faster. It's better even on
systems without 'advice' and/or vectored I/O support, because some
I/Os can be merged into wider simple pread/pwrite calls, and various
other small efficiencies come from batching.

The real goal, though, is to make it easier for later work to replace
the I/O subsystem with true asynchronous and concurrent I/O, as
required to get decent performance with direct I/O (and, at a wild
guess, the magic network smgr replacements that many of our colleagues
on this list work on). Client code such as access methods wouldn't
need to change again to benefit from that, as it would be fully
insulated by the streaming abstraction.

There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model. That's the thing with
the most potential users throughout the source tree and AMs, and I've
attached some work-in-progress examples of half a dozen use cases.

=== 1. Vectored I/O through the layers ===

* Provide vectored variants of FileRead() and FileWrite().
* Provide vectored variants of smgrread() and smgrwrite().
* Provide vectored variant of ReadBuffer().
* Provide multi-block smgrprefetch().

=== 2. Streaming read API ===

* Give SMgrRelation pointers a well-defined lifetime.
* Provide basic streaming read API.

=== 3. Example users of streaming read API ===

* Use streaming reads in pg_prewarm. [TM]
* WIP: Use streaming reads in heapam scans. [AF]
* WIP: Use streaming reads in vacuum. [AF]
* WIP: Use streaming reads in nbtree vacuum scan. [AF]
* WIP: Use streaming reads in bitmap heapscan. [MP]
* WIP: Use streaming reads in recovery. [TM]

=== 4. Some less developed work on vectored writes ===

* WIP: Provide vectored variant of FlushBuffer().
* WIP: Use vectored writes in checkpointer.

All of these are WIP; those marked WIP above are double-WIP. But
there's enough to demo the concept and discuss. Here are some
assorted notes:

* probably need to split block-count and I/O-count in stats system?
* streaming needs to "ramp up", instead of going straight to big reads
* the buffer pin limit is somewhat primitive
* more study of buffer pool correctness required
* 16 block/128KB size limit is not exactly arbitrary but not well
researched (by me at least)
* various TODOs in user patches

A bit about where this code came from and how it relates to the "AIO"
project[1]https://wiki.postgresql.org/wiki/AIO: The idea and terminology 'streaming I/O' are due to
Andres Freund. This implementation of it is mine, and to keep this
mailing list fun, he hasn't reviewed it yet. The example user patches
are by Andres, Melanie Plageman and myself, and were cherry picked
from the AIO branch, where they originally ran on top of Andres's
truly asynchronous 'streaming read', which is completely different
code. It has (or will have) exactly the same API, but it does much
more, with much more infrastructure. But the AIO branch is far too
much to propose at once.

We might have been a little influenced by a recent discussion on
pgsql-performance[2]/messages/by-id/218fa2e0-bc58-e469-35dd-c5cb35906064@gmx.net that I could summarise as "why do you guys need
to do all this fancy AIO stuff, just give me bigger reads!". That was
actually a bit of a special case, I think (something is wrong with
btrfs's prefetch heuristics?), but in conversation we realised that
converting parts of PostgreSQL over to a stream-oriented model could
be done independently of AIO, and could offer some nice incremental
benefits already. So I worked on producing this code with an
identical API that just maps on to old fashioned synchronous I/O
calls, except bigger and better.

The "example user" patches would be proposed separately in their own
threads after some more work, but I wanted to demonstrate the wide
applicability of this style of API in this preview. Some of these
make use of the ability to attach a bit of extra data to each buffer
-- see Melanie's bitmap heapscan patch, for example. In later
revisions I'll probably just pick one or two examples to work with for
a smaller core patch set, and then the rest can be developed
separately. (We thought about btree scans too as a nice high value
area to tackle, but Tomas Vondra is hacking in that area and we didn't
want to step on his toes.)

[1]: https://wiki.postgresql.org/wiki/AIO
[2]: /messages/by-id/218fa2e0-bc58-e469-35dd-c5cb35906064@gmx.net

Attachments:

v1-0011-WIP-Use-streaming-reads-in-bitmap-heapscan.patchtext/x-patch; charset=US-ASCII; name=v1-0011-WIP-Use-streaming-reads-in-bitmap-heapscan.patchDownload

From 5860ccb743a3981ade55e1d394d43bde5cc9d342 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 22 Aug 2023 16:48:30 +1200
Subject: [PATCH v1 11/14] WIP: Use streaming reads in bitmap heapscan.

XXX Cherry-picked from
https://github.com/melanieplageman/postgres/tree/bhs_aio and lightly
modified by TM, for demonstration purposes.

Author: Melanie Plageman <melanieplageman@gmail.com>
---
 src/backend/access/gin/ginget.c           |  17 +-
 src/backend/access/gin/ginscan.c          |   7 +
 src/backend/access/heap/heapam.c          |  76 ++++
 src/backend/access/heap/heapam_handler.c  |  78 ++--
 src/backend/commands/explain.c            |  21 +-
 src/backend/executor/nodeBitmapHeapscan.c | 521 ++--------------------
 src/backend/nodes/tidbitmap.c             |  76 ++--
 src/include/access/heapam.h               |   7 +
 src/include/access/relscan.h              |   8 +
 src/include/access/tableam.h              |  71 ++-
 src/include/executor/nodeBitmapHeapscan.h |   1 +
 src/include/nodes/execnodes.h             |  39 +-
 src/include/nodes/tidbitmap.h             |   4 +-
 13 files changed, 289 insertions(+), 637 deletions(-)

diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 1f0214498c..c6a45a4e03 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -375,7 +375,10 @@ restartScanEntry:
 			if (entry->matchBitmap)
 			{
 				if (entry->matchIterator)
+				{
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+				}
 				entry->matchIterator = NULL;
 				tbm_free(entry->matchBitmap);
 				entry->matchBitmap = NULL;
@@ -388,6 +391,9 @@ restartScanEntry:
 		if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap))
 		{
 			entry->matchIterator = tbm_begin_iterate(entry->matchBitmap);
+			entry->matchResult = (TBMIterateResult *)
+				palloc0(sizeof(TBMIterateResult) + MaxHeapTuplesPerPage *
+						sizeof(OffsetNumber));
 			entry->isFinished = false;
 		}
 	}
@@ -825,21 +831,24 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 		{
 			/*
 			 * If we've exhausted all items on this block, move to next block
-			 * in the bitmap.
+			 * in the bitmap. tbm_iterate() sets matchResult->blockno to
+			 * InvalidBlockNumber when the bitmap is exhausted.
 			 */
-			while (entry->matchResult == NULL ||
+			while ((!BlockNumberIsValid(entry->matchResult->blockno)) ||
 				   (entry->matchResult->ntuples >= 0 &&
 					entry->offset >= entry->matchResult->ntuples) ||
 				   entry->matchResult->blockno < advancePastBlk ||
 				   (ItemPointerIsLossyPage(&advancePast) &&
 					entry->matchResult->blockno == advancePastBlk))
 			{
-				entry->matchResult = tbm_iterate(entry->matchIterator);
 
-				if (entry->matchResult == NULL)
+				tbm_iterate(entry->matchIterator, entry->matchResult);
+				if (!BlockNumberIsValid(entry->matchResult->blockno))
 				{
 					ItemPointerSetInvalid(&entry->curItem);
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+					entry->matchResult = NULL;
 					entry->matchIterator = NULL;
 					entry->isFinished = true;
 					break;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index ae7b0e9bb8..e82baf590e 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -246,7 +246,14 @@ ginFreeScanKeys(GinScanOpaque so)
 		if (entry->list)
 			pfree(entry->list);
 		if (entry->matchIterator)
+		{
 			tbm_end_iterate(entry->matchIterator);
+			if (entry->matchResult)
+			{
+				pfree(entry->matchResult);
+				entry->matchResult = NULL;
+			}
+		}
 		if (entry->matchBitmap)
 			tbm_free(entry->matchBitmap);
 	}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 01cca53586..ad6b7df483 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -226,6 +226,60 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static bool
+bitmapheap_pgsr_next_single(PgStreamingRead *pgsr,
+							uintptr_t pgsr_private, void *io_private,
+							BufferManagerRelation *bmr, ForkNumber *fork, BlockNumber *block,
+							ReadBufferMode *mode)
+{
+	TBMIterateResult *tbmres;
+	HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+	*bmr = BMR_REL(hdesc->rs_base.rs_rd);
+	*fork = MAIN_FORKNUM;
+	*mode = RBM_NORMAL;
+
+	tbmres = (TBMIterateResult *) io_private;
+
+	for (;;)
+	{
+		if (hdesc->rs_base.shared_tbmiterator)
+			tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+		else
+			tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+		if (!BlockNumberIsValid(tbmres->blockno))
+			return false;
+
+		/*
+		 * Ignore any claimed entries past what we think is the end of the
+		 * relation. It may have been extended after the start of our scan (we
+		 * only hold an AccessShareLock, and it could be inserts from this
+		 * backend).  We don't take this optimization in SERIALIZABLE
+		 * isolation though, as we need to examine all invisible tuples
+		 * reachable by the index.
+		 */
+		if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+			continue;
+
+		if (tbmres->ntuples >= 0)
+			hdesc->rs_base.exact_pages++;
+		else
+			hdesc->rs_base.lossy_pages++;
+
+		if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+			!tbmres->recheck &&
+			VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer))
+		{
+			hdesc->empty_tuples += tbmres->ntuples;
+			continue;
+		}
+
+		*block = tbmres->blockno;
+		return true;
+	}
+}
+
 static bool
 heap_pgsr_next_single(PgStreamingRead *pgsr,
 					  uintptr_t pgsr_private, void *per_io_data,
@@ -443,6 +497,11 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		pg_streaming_read_free(scan->pgsr);
 		scan->pgsr = NULL;
 	}
+	if (BufferIsValid(scan->vmbuffer))
+	{
+		ReleaseBuffer(scan->vmbuffer);
+		scan->vmbuffer = InvalidBuffer;
+	}
 
 	/*
 	 * FIXME: This probably should be done in the !rs_inited blocks instead.
@@ -456,6 +515,17 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		else
 			scan->pgsr = heap_pgsr_single_alloc(scan);
 	}
+	else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+	{
+		int			iodepth = Max(Min(128, NBuffers / 128), 1);
+
+		scan->pgsr = pg_streaming_read_buffer_alloc(iodepth,
+													offsetof(TBMIterateResult, offsets) + MaxHeapTuplesPerPage * sizeof(OffsetNumber),
+													(uintptr_t) scan, scan->rs_strategy, bitmapheap_pgsr_next_single);
+
+
+		scan->rs_inited = true;
+	}
 }
 
 /*
@@ -1140,6 +1210,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_strategy = NULL;	/* set in initscan */
 
 	scan->pgsr = NULL;
+	scan->vmbuffer = InvalidBuffer;
+	scan->empty_tuples = 0;
+	scan->rs_base.exact_pages = 0;
+	scan->rs_base.lossy_pages = 0;
+	scan->rs_base.tbmiterator = NULL;
+	scan->rs_base.shared_tbmiterator = NULL;
 
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 26378cbb4b..c665910b27 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "storage/streaming_read.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -2114,38 +2115,55 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
  * Executor related callbacks for the heap AM
  * ------------------------------------------------------------------------
  */
-
 static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
-							  TBMIterateResult *tbmres)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
-	BlockNumber block = tbmres->blockno;
+	TBMIterateResult *tbmres;
+	void	   *io_private;
 	Buffer		buffer;
 	Snapshot	snapshot;
 	int			ntup;
 
-	hscan->rs_cindex = 0;
-	hscan->rs_ntuples = 0;
+	Assert(hscan->pgsr);
 
 	/*
-	 * Ignore any claimed entries past what we think is the end of the
-	 * relation. It may have been extended after the start of our scan (we
-	 * only hold an AccessShareLock, and it could be inserts from this
-	 * backend).  We don't take this optimization in SERIALIZABLE isolation
-	 * though, as we need to examine all invisible tuples reachable by the
-	 * index.
+	 * Release the buffer containing the previous block and reset the per-page
+	 * counters. Reset hscan->rs_ntuples here just to be safe.
 	 */
-	if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
-		return false;
+	if (BufferIsValid(hscan->rs_cbuf))
+	{
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
+	}
+
+	hscan->rs_cindex = 0;
+	hscan->rs_ntuples = 0;
+
+	hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->pgsr, &io_private);
+
+	if (BufferIsInvalid(hscan->rs_cbuf))
+	{
+		if (BufferIsValid(hscan->vmbuffer))
+		{
+			ReleaseBuffer(hscan->vmbuffer);
+			hscan->vmbuffer = InvalidBuffer;
+		}
+
+		return hscan->empty_tuples > 0;
+	}
+
+	Assert(io_private);
+
+	tbmres = (TBMIterateResult *) io_private;
+
+	Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
+
+	*recheck = tbmres->recheck;
+
+	hscan->rs_cblock = tbmres->blockno;
+	hscan->rs_ntuples = tbmres->ntuples;
 
-	/*
-	 * Acquire pin on the target heap page, trading in any pin we held before.
-	 */
-	hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
-										  scan->rs_rd,
-										  block);
-	hscan->rs_cblock = block;
 	buffer = hscan->rs_cbuf;
 	snapshot = scan->rs_snapshot;
 
@@ -2181,7 +2199,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			ItemPointerData tid;
 			HeapTupleData heapTuple;
 
-			ItemPointerSet(&tid, block, offnum);
+			ItemPointerSet(&tid, tbmres->blockno, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
 									   &heapTuple, NULL, true))
 				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
@@ -2209,7 +2227,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			loctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
 			loctup.t_len = ItemIdGetLength(lp);
 			loctup.t_tableOid = scan->rs_rd->rd_id;
-			ItemPointerSet(&loctup.t_self, block, offnum);
+			ItemPointerSet(&loctup.t_self, tbmres->blockno, offnum);
 			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 			if (valid)
 			{
@@ -2227,25 +2245,31 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 	Assert(ntup <= MaxHeapTuplesPerPage);
 	hscan->rs_ntuples = ntup;
 
-	return ntup > 0;
+	return true;
 }
 
 static bool
-heapam_scan_bitmap_next_tuple(TableScanDesc scan,
-							  TBMIterateResult *tbmres,
-							  TupleTableSlot *slot)
+heapam_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 	OffsetNumber targoffset;
 	Page		page;
 	ItemId		lp;
 
+	if (hscan->empty_tuples)
+	{
+		ExecStoreAllNullTuple(slot);
+		hscan->empty_tuples--;
+		return true;
+	}
+
 	/*
 	 * Out of range?  If so, nothing more to look at on this page
 	 */
 	if (hscan->rs_cindex < 0 || hscan->rs_cindex >= hscan->rs_ntuples)
 		return false;
 
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	targoffset = hscan->rs_vistuples[hscan->rs_cindex];
 	page = BufferGetPage(hscan->rs_cbuf);
 	lp = PageGetItemId(page, targoffset);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8570b14f62..e654ada418 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/relscan.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "commands/createas.h"
@@ -112,7 +113,7 @@ static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_tidbitmap_info(BitmapHeapScanState *planstate,
+static void show_tidbitmap_info(TableScanDesc tdesc,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
 									   PlanState *planstate, ExplainState *es);
@@ -1819,7 +1820,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			if (es->analyze)
-				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
+				show_tidbitmap_info(((BitmapHeapScanState *) planstate)->ss.ss_currentScanDesc, es);
 			break;
 		case T_SampleScan:
 			show_tablesample(((SampleScan *) plan)->tablesample,
@@ -3408,25 +3409,25 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
  * If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
  */
 static void
-show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
+show_tidbitmap_info(TableScanDesc tdesc, ExplainState *es)
 {
 	if (es->format != EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainPropertyInteger("Exact Heap Blocks", NULL,
-							   planstate->exact_pages, es);
+							   tdesc->exact_pages, es);
 		ExplainPropertyInteger("Lossy Heap Blocks", NULL,
-							   planstate->lossy_pages, es);
+							   tdesc->lossy_pages, es);
 	}
 	else
 	{
-		if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+		if (tdesc->exact_pages > 0 || tdesc->lossy_pages > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "Heap Blocks:");
-			if (planstate->exact_pages > 0)
-				appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
-			if (planstate->lossy_pages > 0)
-				appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages);
+			if (tdesc->exact_pages > 0)
+				appendStringInfo(es->str, " exact=%ld", tdesc->exact_pages);
+			if (tdesc->lossy_pages > 0)
+				appendStringInfo(es->str, " lossy=%ld", tdesc->lossy_pages);
 			appendStringInfoChar(es->str, '\n');
 		}
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index f35df0b8bf..32518e8d31 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -54,11 +54,6 @@
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-												TBMIterateResult *tbmres);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
-								  TableScanDesc scan);
 static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
 
 
@@ -74,10 +69,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	ExprContext *econtext;
 	TableScanDesc scan;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator = NULL;
-	TBMSharedIterator *shared_tbmiterator = NULL;
-	TBMIterateResult *tbmres;
-	TupleTableSlot *slot;
+	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 	ParallelBitmapHeapState *pstate = node->pstate;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
 
@@ -88,23 +80,10 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	slot = node->ss.ss_ScanTupleSlot;
 	scan = node->ss.ss_currentScanDesc;
 	tbm = node->tbm;
-	if (pstate == NULL)
-		tbmiterator = node->tbmiterator;
-	else
-		shared_tbmiterator = node->shared_tbmiterator;
-	tbmres = node->tbmres;
 
 	/*
 	 * If we haven't yet performed the underlying index scan, do it, and begin
 	 * the iteration over the bitmap.
-	 *
-	 * For prefetching, we use *two* iterators, one for the pages we are
-	 * actually scanning and another that runs ahead of the first for
-	 * prefetching.  node->prefetch_pages tracks exactly how many pages ahead
-	 * the prefetch iterator is.  Also, node->prefetch_target tracks the
-	 * desired prefetch distance, which starts small and increases up to the
-	 * node->prefetch_maximum.  This is to avoid doing a lot of prefetching in
-	 * a scan that stops after a few tuples because of a LIMIT.
 	 */
 	if (!node->initialized)
 	{
@@ -116,17 +95,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				elog(ERROR, "unrecognized result from subplan");
 
 			node->tbm = tbm;
-			node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->prefetch_iterator = tbm_begin_iterate(tbm);
-				node->prefetch_pages = 0;
-				node->prefetch_target = -1;
-			}
-#endif							/* USE_PREFETCH */
+			scan->tbmiterator = tbm_begin_iterate(tbm);
 		}
 		else
 		{
@@ -148,194 +117,59 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				 * dsa_pointer of the iterator state which will be used by
 				 * multiple processes to iterate jointly.
 				 */
-				pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
-				if (node->prefetch_maximum > 0)
-				{
-					pstate->prefetch_iterator =
-						tbm_prepare_shared_iterate(tbm);
-
-					/*
-					 * We don't need the mutex here as we haven't yet woke up
-					 * others.
-					 */
-					pstate->prefetch_pages = 0;
-					pstate->prefetch_target = -1;
-				}
-#endif
+				pstate->tbmiterator =
+					tbm_prepare_shared_iterate(tbm);
 
 				/* We have initialized the shared state so wake up others. */
 				BitmapDoneInitializingSharedState(pstate);
 			}
 
 			/* Allocate a private iterator and attach the shared state to it */
-			node->shared_tbmiterator = shared_tbmiterator =
+			scan->shared_tbmiterator =
 				tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->shared_prefetch_iterator =
-					tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
-			}
-#endif							/* USE_PREFETCH */
 		}
 		node->initialized = true;
+
+		/* Get the first block. if none, end of scan */
+		if (!table_scan_bitmap_next_block(scan, &node->recheck))
+			goto exit;
 	}
 
-	for (;;)
+	do
 	{
-		bool		skip_fetch;
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * Get next page of results if needed
-		 */
-		if (tbmres == NULL)
-		{
-			if (!pstate)
-				node->tbmres = tbmres = tbm_iterate(tbmiterator);
-			else
-				node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
-			if (tbmres == NULL)
-			{
-				/* no more entries in the bitmap */
-				break;
-			}
-
-			BitmapAdjustPrefetchIterator(node, tbmres);
-
-			/*
-			 * We can skip fetching the heap page if we don't need any fields
-			 * from the heap, and the bitmap entries don't need rechecking,
-			 * and all tuples on the page are visible to our transaction.
-			 *
-			 * XXX: It's a layering violation that we do these checks above
-			 * tableam, they should probably moved below it at some point.
-			 */
-			skip_fetch = (node->can_skip_fetch &&
-						  !tbmres->recheck &&
-						  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-										 tbmres->blockno,
-										 &node->vmbuffer));
-
-			if (skip_fetch)
-			{
-				/* can't be lossy in the skip_fetch case */
-				Assert(tbmres->ntuples >= 0);
-
-				/*
-				 * The number of tuples on this page is put into
-				 * node->return_empty_tuples.
-				 */
-				node->return_empty_tuples = tbmres->ntuples;
-			}
-			else if (!table_scan_bitmap_next_block(scan, tbmres))
-			{
-				/* AM doesn't think this block is valid, skip */
-				continue;
-			}
-
-			if (tbmres->ntuples >= 0)
-				node->exact_pages++;
-			else
-				node->lossy_pages++;
-
-			/* Adjust the prefetch target */
-			BitmapAdjustPrefetchTarget(node);
-		}
-		else
+		while (table_scan_bitmap_next_tuple(scan, slot))
 		{
-			/*
-			 * Continuing in previously obtained page.
-			 */
-
-#ifdef USE_PREFETCH
-
-			/*
-			 * Try to prefetch at least a few pages even before we get to the
-			 * second page if we don't stop reading after the first tuple.
-			 */
-			if (!pstate)
-			{
-				if (node->prefetch_target < node->prefetch_maximum)
-					node->prefetch_target++;
-			}
-			else if (pstate->prefetch_target < node->prefetch_maximum)
-			{
-				/* take spinlock while updating shared state */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_target < node->prefetch_maximum)
-					pstate->prefetch_target++;
-				SpinLockRelease(&pstate->mutex);
-			}
-#endif							/* USE_PREFETCH */
-		}
+			CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We issue prefetch requests *after* fetching the current page to try
-		 * to avoid having prefetching interfere with the main I/O. Also, this
-		 * should happen only when we have determined there is still something
-		 * to do on the current page, else we may uselessly prefetch the same
-		 * page we are just about to request for real.
-		 *
-		 * XXX: It's a layering violation that we do these checks above
-		 * tableam, they should probably moved below it at some point.
-		 */
-		BitmapPrefetch(node, scan);
-
-		if (node->return_empty_tuples > 0)
-		{
-			/*
-			 * If we don't have to fetch the tuple, just return nulls.
-			 */
-			ExecStoreAllNullTuple(slot);
+			if (!node->recheck)
+				return slot;
 
-			if (--node->return_empty_tuples == 0)
-			{
-				/* no more tuples to return in the next round */
-				node->tbmres = tbmres = NULL;
-			}
-		}
-		else
-		{
-			/*
-			 * Attempt to fetch tuple from AM.
-			 */
-			if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
-			{
-				/* nothing more to look at on this page */
-				node->tbmres = tbmres = NULL;
-				continue;
-			}
+			econtext->ecxt_scantuple = slot;
+			if (ExecQualAndReset(node->bitmapqualorig, econtext))
+				return slot;
 
-			/*
-			 * If we are using lossy info, we have to recheck the qual
-			 * conditions at every tuple.
-			 */
-			if (tbmres->recheck)
-			{
-				econtext->ecxt_scantuple = slot;
-				if (!ExecQualAndReset(node->bitmapqualorig, econtext))
-				{
-					/* Fails recheck, so drop it and loop back for another */
-					InstrCountFiltered2(node, 1);
-					ExecClearTuple(slot);
-					continue;
-				}
-			}
+			/* Fails recheck, so drop it and loop back for another */
+			InstrCountFiltered2(node, 1);
+			ExecClearTuple(slot);
 		}
+	} while (table_scan_bitmap_next_block(scan, &node->recheck));
 
-		/* OK to return this tuple */
-		return slot;
-	}
+exit:
 
 	/*
-	 * if we get here it means we are at the end of the scan..
+	 * Release iterator
 	 */
-	return ExecClearTuple(slot);
+	if (scan->tbmiterator)
+	{
+		tbm_end_iterate(scan->tbmiterator);
+		scan->tbmiterator = NULL;
+	}
+	else if (scan->shared_tbmiterator)
+	{
+		tbm_end_shared_iterate(scan->shared_tbmiterator);
+		scan->shared_tbmiterator = NULL;
+	}
+	return NULL;
 }
 
 /*
@@ -353,215 +187,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
 	ConditionVariableBroadcast(&pstate->cv);
 }
 
-/*
- *	BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-							 TBMIterateResult *tbmres)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (node->prefetch_pages > 0)
-		{
-			/* The main iterator has closed the distance by one page */
-			node->prefetch_pages--;
-		}
-		else if (prefetch_iterator)
-		{
-			/* Do not let the prefetch iterator get behind the main one */
-			TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-
-			if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
-				elog(ERROR, "prefetch and main iterators are out of sync");
-		}
-		return;
-	}
-
-	if (node->prefetch_maximum > 0)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_pages > 0)
-		{
-			pstate->prefetch_pages--;
-			SpinLockRelease(&pstate->mutex);
-		}
-		else
-		{
-			/* Release the mutex before iterating */
-			SpinLockRelease(&pstate->mutex);
-
-			/*
-			 * In case of shared mode, we can not ensure that the current
-			 * blockno of the main iterator and that of the prefetch iterator
-			 * are same.  It's possible that whatever blockno we are
-			 * prefetching will be processed by another process.  Therefore,
-			 * we don't validate the blockno here as we do in non-parallel
-			 * case.
-			 */
-			if (prefetch_iterator)
-				tbm_shared_iterate(prefetch_iterator);
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max.  Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		if (node->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (node->prefetch_target >= node->prefetch_maximum / 2)
-			node->prefetch_target = node->prefetch_maximum;
-		else if (node->prefetch_target > 0)
-			node->prefetch_target *= 2;
-		else
-			node->prefetch_target++;
-		return;
-	}
-
-	/* Do an unlocked check first to save spinlock acquisitions. */
-	if (pstate->prefetch_target < node->prefetch_maximum)
-	{
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
-			pstate->prefetch_target = node->prefetch_maximum;
-		else if (pstate->prefetch_target > 0)
-			pstate->prefetch_target *= 2;
-		else
-			pstate->prefetch_target++;
-		SpinLockRelease(&pstate->mutex);
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (node->prefetch_pages < node->prefetch_target)
-			{
-				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-				bool		skip_fetch;
-
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = NULL;
-					break;
-				}
-				node->prefetch_pages++;
-
-				/*
-				 * If we expect not to have to actually read this heap page,
-				 * skip this prefetch call, but continue to run the prefetch
-				 * logic normally.  (Would it be better not to increment
-				 * prefetch_pages?)
-				 *
-				 * This depends on the assumption that the index AM will
-				 * report the same recheck flag for this future heap page as
-				 * it did for the current heap page; which is not a certainty
-				 * but is true in many cases.
-				 */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-
-		return;
-	}
-
-	if (pstate->prefetch_pages < pstate->prefetch_target)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (1)
-			{
-				TBMIterateResult *tbmpre;
-				bool		do_prefetch = false;
-				bool		skip_fetch;
-
-				/*
-				 * Recheck under the mutex. If some other process has already
-				 * done enough prefetching then we need not to do anything.
-				 */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_pages < pstate->prefetch_target)
-				{
-					pstate->prefetch_pages++;
-					do_prefetch = true;
-				}
-				SpinLockRelease(&pstate->mutex);
-
-				if (!do_prefetch)
-					return;
-
-				tbmpre = tbm_shared_iterate(prefetch_iterator);
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_shared_iterate(prefetch_iterator);
-					node->shared_prefetch_iterator = NULL;
-					break;
-				}
-
-				/* As above, skip prefetch if we expect not to need page */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
 /*
  * BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
  */
@@ -607,29 +232,10 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 	table_rescan(node->ss.ss_currentScanDesc, NULL);
 
 	/* release bitmaps and buffers if any */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 	node->tbm = NULL;
-	node->tbmiterator = NULL;
-	node->tbmres = NULL;
-	node->prefetch_iterator = NULL;
 	node->initialized = false;
-	node->shared_tbmiterator = NULL;
-	node->shared_prefetch_iterator = NULL;
-	node->vmbuffer = InvalidBuffer;
-	node->pvmbuffer = InvalidBuffer;
 
 	ExecScanReScan(&node->ss);
 
@@ -675,20 +281,8 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
 	/*
 	 * release bitmaps and buffers if any
 	 */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 
 	/*
 	 * close heap scan
@@ -707,6 +301,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		can_skip_fetch;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -726,32 +321,10 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
 
 	scanstate->tbm = NULL;
-	scanstate->tbmiterator = NULL;
-	scanstate->tbmres = NULL;
-	scanstate->return_empty_tuples = 0;
-	scanstate->vmbuffer = InvalidBuffer;
-	scanstate->pvmbuffer = InvalidBuffer;
-	scanstate->exact_pages = 0;
-	scanstate->lossy_pages = 0;
-	scanstate->prefetch_iterator = NULL;
-	scanstate->prefetch_pages = 0;
-	scanstate->prefetch_target = 0;
 	scanstate->pscan_len = 0;
 	scanstate->initialized = false;
-	scanstate->shared_tbmiterator = NULL;
-	scanstate->shared_prefetch_iterator = NULL;
 	scanstate->pstate = NULL;
 
-	/*
-	 * We can potentially skip fetching heap pages if we do not need any
-	 * columns of the table, either for checking non-indexable quals or for
-	 * returning data.  This test is a bit simplistic, as it checks the
-	 * stronger condition that there's no qual or return tlist at all.  But in
-	 * most cases it's probably not worth working harder than that.
-	 */
-	scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
-								 node->scan.plan.targetlist == NIL);
-
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -790,19 +363,22 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->bitmapqualorig =
 		ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
 
+	scanstate->ss.ss_currentRelation = currentRelation;
+
 	/*
-	 * Maximum number of prefetches for the tablespace if configured,
-	 * otherwise the current value of the effective_io_concurrency GUC.
+	 * We can potentially skip fetching heap pages if we do not need any
+	 * columns of the table, either for checking non-indexable quals or for
+	 * returning data.  This test is a bit simplistic, as it checks the
+	 * stronger condition that there's no qual or return tlist at all.  But in
+	 * most cases it's probably not worth working harder than that.
 	 */
-	scanstate->prefetch_maximum =
-		get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
-	scanstate->ss.ss_currentRelation = currentRelation;
+	can_skip_fetch = (node->scan.plan.qual == NIL &&
+					  node->scan.plan.targetlist == NIL);
 
 	scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
 														  estate->es_snapshot,
 														  0,
-														  NULL);
+														  NULL, can_skip_fetch);
 
 	/*
 	 * all done.
@@ -887,12 +463,9 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 	pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
 
 	pstate->tbmiterator = 0;
-	pstate->prefetch_iterator = 0;
 
 	/* Initialize the mutex */
 	SpinLockInit(&pstate->mutex);
-	pstate->prefetch_pages = 0;
-	pstate->prefetch_target = 0;
 	pstate->state = BM_INITIAL;
 
 	ConditionVariableInit(&pstate->cv);
@@ -924,11 +497,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 	if (DsaPointerIsValid(pstate->tbmiterator))
 		tbm_free_shared_area(dsa, pstate->tbmiterator);
 
-	if (DsaPointerIsValid(pstate->prefetch_iterator))
-		tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
 	pstate->tbmiterator = InvalidDsaPointer;
-	pstate->prefetch_iterator = InvalidDsaPointer;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 29a1858441..052c8ff91a 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -180,7 +180,6 @@ struct TBMIterator
 	int			spageptr;		/* next spages index */
 	int			schunkptr;		/* next schunks index */
 	int			schunkbit;		/* next bit to check in current schunk */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /*
@@ -221,7 +220,6 @@ struct TBMSharedIterator
 	PTEntryArray *ptbase;		/* pagetable element array */
 	PTIterationArray *ptpages;	/* sorted exact page index list */
 	PTIterationArray *ptchunks; /* sorted lossy page index list */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /* Local function prototypes */
@@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * Create the TBMIterator struct, with enough trailing space to serve the
 	 * needs of the TBMIterateResult sub-struct.
 	 */
-	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-									  MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+	iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
 	iterator->tbm = tbm;
 
 	/*
@@ -957,20 +954,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
 /*
  * tbm_iterate - scan through next page of a TIDBitmap
  *
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan.  Pages are guaranteed to be delivered in numerical
- * order.  If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition.  If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway.  (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order.  tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition.  If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway.  (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
  */
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
 {
 	TIDBitmap  *tbm = iterator->tbm;
-	TBMIterateResult *output = &(iterator->output);
 
 	Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
 
@@ -998,6 +996,7 @@ tbm_iterate(TBMIterator *iterator)
 	 * If both chunk and per-page data remain, must output the numerically
 	 * earlier page.
 	 */
+	Assert(tbmres);
 	if (iterator->schunkptr < tbm->nchunks)
 	{
 		PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -1008,11 +1007,11 @@ tbm_iterate(TBMIterator *iterator)
 			chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			iterator->schunkbit++;
-			return output;
+			return;
 		}
 	}
 
@@ -1028,16 +1027,17 @@ tbm_iterate(TBMIterator *iterator)
 			page = tbm->spages[iterator->spageptr];
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		iterator->spageptr++;
-		return output;
+		return;
 	}
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
@@ -1047,10 +1047,9 @@ tbm_iterate(TBMIterator *iterator)
  *	across multiple processes.  We need to acquire the iterator LWLock,
  *	before accessing the shared members.
  */
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
 {
-	TBMIterateResult *output = &iterator->output;
 	TBMSharedIteratorState *istate = iterator->state;
 	PagetableEntry *ptbase = NULL;
 	int		   *idxpages = NULL;
@@ -1101,13 +1100,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 			chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			istate->schunkbit++;
 
 			LWLockRelease(&istate->lock);
-			return output;
+			return;
 		}
 	}
 
@@ -1117,21 +1116,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 		int			ntuples;
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		istate->spageptr++;
 
 		LWLockRelease(&istate->lock);
 
-		return output;
+		return;
 	}
 
 	LWLockRelease(&istate->lock);
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 852d47b1f8..8cb4be8570 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -20,6 +20,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "nodes/execnodes.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -76,9 +77,15 @@ typedef struct HeapScanDescData
 	struct PgStreamingRead *pgsr;
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
+
+	/*
+	 * MFIXME: not sure if vmbuffer is being released in the right place.
+	 */
+	Buffer		vmbuffer;		/* current VM buffer for BHS */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
+	int			empty_tuples;
 }			HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac0..6642e1a754 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/spin.h"
@@ -40,6 +41,13 @@ typedef struct TableScanDescData
 	ItemPointerData rs_mintid;
 	ItemPointerData rs_maxtid;
 
+	/* Only used for Bitmap table scans */
+	TBMIterator *tbmiterator;
+	TBMSharedIterator *shared_tbmiterator;
+	long		exact_pages;
+	long		lossy_pages;
+	int			empty_tuples;
+
 	/*
 	 * Information about type and behaviour of the scan, a bitmask of members
 	 * of the ScanOptions enum (see tableam.h).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 230bc39cc0..864ca4ed10 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -32,6 +32,7 @@ extern PGDLLIMPORT char *default_table_access_method;
 extern PGDLLIMPORT bool synchronize_seqscans;
 
 
+struct BitmapHeapScanState;
 struct BulkInsertStateData;
 struct IndexInfo;
 struct SampleScanState;
@@ -61,7 +62,8 @@ typedef enum ScanOptions
 	SO_ALLOW_PAGEMODE = 1 << 8,
 
 	/* unregister snapshot at scan end? */
-	SO_TEMP_SNAPSHOT = 1 << 9
+	SO_TEMP_SNAPSHOT = 1 << 9,
+	SO_CAN_SKIP_FETCH = 1 << 10
 } ScanOptions;
 
 /*
@@ -773,53 +775,36 @@ typedef struct TableAmRoutine
 	 */
 
 	/*
-	 * Prepare to fetch / check / return tuples from `tbmres->blockno` as part
-	 * of a bitmap table scan. `scan` was started via table_beginscan_bm().
-	 * Return false if there are no tuples to be found on the page, true
-	 * otherwise.
+	 * Prepare to fetch / check / return tuples obtained from the streaming
+	 * read helper as part of a bitmap table scan. `scan` was started via
+	 * table_beginscan_bm(). Return false if the relation is exhausted and
+	 * true otherwise.
 	 *
 	 * This will typically read and pin the target block, and do the necessary
 	 * work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
 	 * make sense to perform tuple visibility checks at this time). For some
-	 * AMs it will make more sense to do all the work referencing `tbmres`
-	 * contents here, for others it might be better to defer more work to
-	 * scan_bitmap_next_tuple.
+	 * AMs, it might be better to defer more work to scan_bitmap_next_tuple().
 	 *
-	 * If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
-	 * on the page have to be returned, otherwise the tuples at offsets in
-	 * `tbmres->offsets` need to be returned.
+	 * After examining the page from the TBMIterateResult returned by the
+	 * streaming read helper, most AMs will set relevant fields in the `scan`
+	 * for the benefit of scan_bitmap_next_tuple(). All AMs must set
+	 * `recheck`, passed in by the caller, based on the value in the
+	 * TBMIterateResult so that BitmapHeapNext() can determine whether or not
+	 * to yield tuples.
 	 *
-	 * XXX: Currently this may only be implemented if the AM uses md.c as its
-	 * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
-	 * blockids directly to the underlying storage. nodeBitmapHeapscan.c
-	 * performs prefetching directly using that interface.  This probably
-	 * needs to be rectified at a later point.
-	 *
-	 * XXX: Currently this may only be implemented if the AM uses the
-	 * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
-	 * perform prefetching.  This probably needs to be rectified at a later
-	 * point.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_block) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres);
+	bool		(*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck);
 
 	/*
 	 * Fetch the next tuple of a bitmap table scan into `slot` and return true
 	 * if a visible tuple was found, false otherwise.
 	 *
-	 * For some AMs it will make more sense to do all the work referencing
-	 * `tbmres` contents in scan_bitmap_next_block, for others it might be
-	 * better to defer more work to this callback.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres,
-										   TupleTableSlot *slot);
+	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan, TupleTableSlot *slot);
 
 	/*
 	 * Prepare to fetch tuples from the next block in a sample scan. Return
@@ -944,10 +929,13 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
  */
 static inline TableScanDesc
 table_beginscan_bm(Relation rel, Snapshot snapshot,
-				   int nkeys, struct ScanKeyData *key)
+				   int nkeys, struct ScanKeyData *key, bool can_skip_fetch)
 {
 	uint32		flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
 
+	if (can_skip_fetch)
+		flags |= SO_CAN_SKIP_FETCH;
+
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
 
@@ -1955,8 +1943,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
  * used after verifying the presence (at plan time or such).
  */
 static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1966,8 +1953,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
 
-	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
-														   tbmres);
+	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck);
 }
 
 /*
@@ -1979,9 +1965,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
  * returned false.
  */
 static inline bool
-table_scan_bitmap_next_tuple(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres,
-							 TupleTableSlot *slot)
+table_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
@@ -1992,7 +1976,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
 
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
-														   tbmres,
 														   slot);
 }
 
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index 3a267a7fbd..578989ae7f 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -28,5 +28,6 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 										  ParallelContext *pcxt);
 extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 										   ParallelWorkerContext *pwcxt);
+extern void table_bitmap_scan_setup(BitmapHeapScanState *scanstate);
 
 #endif							/* NODEBITMAPHEAPSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cb714f4a19..6aad5f1d70 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1676,11 +1676,7 @@ typedef enum
 /* ----------------
  *	 ParallelBitmapHeapState information
  *		tbmiterator				iterator for scanning current pages
- *		prefetch_iterator		iterator for prefetching ahead of current page
- *		mutex					mutual exclusion for the prefetching variable
- *								and state
- *		prefetch_pages			# pages prefetch iterator is ahead of current
- *		prefetch_target			current target prefetch distance
+ *		mutex					mutual exclusion for the state
  *		state					current state of the TIDBitmap
  *		cv						conditional wait variable
  *		phs_snapshot_data		snapshot data shared to workers
@@ -1689,10 +1685,7 @@ typedef enum
 typedef struct ParallelBitmapHeapState
 {
 	dsa_pointer tbmiterator;
-	dsa_pointer prefetch_iterator;
 	slock_t		mutex;
-	int			prefetch_pages;
-	int			prefetch_target;
 	SharedBitmapState state;
 	ConditionVariable cv;
 	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1703,22 +1696,9 @@ typedef struct ParallelBitmapHeapState
  *
  *		bitmapqualorig	   execution state for bitmapqualorig expressions
  *		tbm				   bitmap obtained from child index scan(s)
- *		tbmiterator		   iterator for scanning current pages
- *		tbmres			   current-page data
- *		can_skip_fetch	   can we potentially skip tuple fetches in this scan?
- *		return_empty_tuples number of empty tuples to return
- *		vmbuffer		   buffer for visibility-map lookups
- *		pvmbuffer		   ditto, for prefetched pages
- *		exact_pages		   total number of exact pages retrieved
- *		lossy_pages		   total number of lossy pages retrieved
- *		prefetch_iterator  iterator for prefetching ahead of current page
- *		prefetch_pages	   # pages prefetch iterator is ahead of current
- *		prefetch_target    current target prefetch distance
- *		prefetch_maximum   maximum value for prefetch_target
+ *		recheck			   whether or not the tuple needs to be rechecked before returning
  *		pscan_len		   size of the shared memory for parallel bitmap
  *		initialized		   is node is ready to iterate
- *		shared_tbmiterator	   shared iterator
- *		shared_prefetch_iterator shared iterator for prefetching
  *		pstate			   shared state for parallel bitmap scan
  * ----------------
  */
@@ -1727,22 +1707,9 @@ typedef struct BitmapHeapScanState
 	ScanState	ss;				/* its first field is NodeTag */
 	ExprState  *bitmapqualorig;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator;
-	TBMIterateResult *tbmres;
-	bool		can_skip_fetch;
-	int			return_empty_tuples;
-	Buffer		vmbuffer;
-	Buffer		pvmbuffer;
-	long		exact_pages;
-	long		lossy_pages;
-	TBMIterator *prefetch_iterator;
-	int			prefetch_pages;
-	int			prefetch_target;
-	int			prefetch_maximum;
+	bool		recheck;
 	Size		pscan_len;
 	bool		initialized;
-	TBMSharedIterator *shared_tbmiterator;
-	TBMSharedIterator *shared_prefetch_iterator;
 	ParallelBitmapHeapState *pstate;
 } BitmapHeapScanState;
 
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index b64e36437a..e15d77dc81 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -64,8 +64,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
 
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
 extern void tbm_end_iterate(TBMIterator *iterator);
 extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
-- 
2.39.2

v1-0012-WIP-Use-streaming-reads-in-recovery.patchtext/x-patch; charset=US-ASCII; name=v1-0012-WIP-Use-streaming-reads-in-recovery.patchDownload

From b10722996a4f27c44e5b6b07eac306a6cc017ca5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 23 May 2022 10:49:17 +1200
Subject: [PATCH v1 12/14] WIP: Use streaming reads in recovery.

Replace xlogprefetcher.c's local I/O queue mechanism with the standard
'streaming buffer read' pattern.  Benefits:

* 8KB posix_fadvise() and pread() calls can be batched into larger
  vector I/O calls that transfer ranges of blocks up to 128KB at a time

* the prefetcher receives a stream of already-pinned buffers, which also
  means that buffers that are accessed multiple times within the
  prefetching window can stay pinned; this applies whether or not I/O
  is necessary, so it provides a performance benefit even to well cached
  recovery workloads

Adopting centralized infrastructure also means that future proposed
improvements (for example asynchronous I/O) will automatically apply
here too.

Previously, the GUC "recovery_prefetch" had a default setting "try",
for the benefit of systems with no posix_fadvise() system call, where
there was no point in looking ahead in the WAL.  Now that it is useful
on all system to do so, to build vector I/Os and consolidate pins, the
option is available everywhere so there is no point in the "try"
setting.  It can still be turned off, though, and it is still limited by
maintenance_io_concurrency (0 on systems with no posix_fadvise()).

XXX TODO kill most of the pg_stat_recovery_prefetch hit/miss counters
and push data into the standard pgstatio infrastructure/views.

XXX TODO need to work on the code that handles changes to recovery
settings during recovery (ie buffers on already-decoded records might be
out of sync)

Author: Thomas Munro <thomas.munro@gmail.com>
---
 doc/src/sgml/config.sgml                    |  10 +-
 src/backend/access/transam/xlogprefetcher.c | 601 ++++++++++----------
 src/backend/access/transam/xlogreader.c     |  12 +
 src/backend/access/transam/xlogrecovery.c   |   5 +-
 src/backend/access/transam/xlogutils.c      |  74 ++-
 src/backend/storage/buffer/bufmgr.c         |   6 +-
 src/backend/storage/freespace/freespace.c   |   2 +-
 src/backend/utils/misc/guc_tables.c         |   6 +-
 src/include/access/xlogprefetcher.h         |   3 +-
 src/include/access/xlogreader.h             |   1 +
 src/include/access/xlogutils.h              |   4 +-
 11 files changed, 380 insertions(+), 344 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 694d667bf9..521d3981ec 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3737,13 +3737,9 @@ include_dir 'conf.d'
        <para>
         Whether to try to prefetch blocks that are referenced in the WAL that
         are not yet in the buffer pool, during recovery.  Valid values are
-        <literal>off</literal>, <literal>on</literal> and
-        <literal>try</literal> (the default).  The setting
-        <literal>try</literal> enables
-        prefetching only if the operating system provides the
-        <function>posix_fadvise</function> function, which is currently used
-        to implement prefetching.  Note that some operating systems provide the
-        function, but it doesn't do anything.
+        <literal>off</literal> and <literal>on</literal> (the default).
+        This allows <function>posix_fadvise</function> hints to be provided
+        to some operating systems, and I/O to be performed more efficiently.
        </para>
        <para>
         Prefetching blocks that will soon be needed can reduce I/O wait times
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 539928cb85..ec068cf1ea 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -27,6 +27,7 @@
 
 #include "postgres.h"
 
+#include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xlogprefetcher.h"
 #include "access/xlogreader.h"
@@ -44,6 +45,7 @@
 #include "storage/bufmgr.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/guc_hooks.h"
 #include "utils/hsearch.h"
 
@@ -53,12 +55,6 @@
  */
 #define XLOGPREFETCHER_STATS_DISTANCE BLCKSZ
 
-/*
- * To detect repeated access to the same block and skip useless extra system
- * calls, we remember a small window of recently prefetched blocks.
- */
-#define XLOGPREFETCHER_SEQ_WINDOW_SIZE 4
-
 /*
  * When maintenance_io_concurrency is not saturated, we're prepared to look
  * ahead up to N times that number of block references.
@@ -69,58 +65,13 @@
 /* #define XLOGPREFETCHER_DEBUG_LEVEL LOG */
 
 /* GUCs */
-int			recovery_prefetch = RECOVERY_PREFETCH_TRY;
+int			recovery_prefetch = RECOVERY_PREFETCH_ON;
 
-#ifdef USE_PREFETCH
 #define RecoveryPrefetchEnabled() \
-		(recovery_prefetch != RECOVERY_PREFETCH_OFF && \
-		 maintenance_io_concurrency > 0)
-#else
-#define RecoveryPrefetchEnabled() false
-#endif
+		(recovery_prefetch != RECOVERY_PREFETCH_OFF)
 
 static int	XLogPrefetchReconfigureCount = 0;
 
-/*
- * Enum used to report whether an IO should be started.
- */
-typedef enum
-{
-	LRQ_NEXT_NO_IO,
-	LRQ_NEXT_IO,
-	LRQ_NEXT_AGAIN
-} LsnReadQueueNextStatus;
-
-/*
- * Type of callback that can decide which block to prefetch next.  For now
- * there is only one.
- */
-typedef LsnReadQueueNextStatus (*LsnReadQueueNextFun) (uintptr_t lrq_private,
-													   XLogRecPtr *lsn);
-
-/*
- * A simple circular queue of LSNs, using to control the number of
- * (potentially) inflight IOs.  This stands in for a later more general IO
- * control mechanism, which is why it has the apparently unnecessary
- * indirection through a function pointer.
- */
-typedef struct LsnReadQueue
-{
-	LsnReadQueueNextFun next;
-	uintptr_t	lrq_private;
-	uint32		max_inflight;
-	uint32		inflight;
-	uint32		completed;
-	uint32		head;
-	uint32		tail;
-	uint32		size;
-	struct
-	{
-		bool		io;
-		XLogRecPtr	lsn;
-	}			queue[FLEXIBLE_ARRAY_MEMBER];
-} LsnReadQueue;
-
 /*
  * A prefetcher.  This is a mechanism that wraps an XLogReader, prefetching
  * blocks that will be soon be referenced, to try to avoid IO stalls.
@@ -139,16 +90,11 @@ struct XLogPrefetcher
 	HTAB	   *filter_table;
 	dlist_head	filter_queue;
 
-	/* Book-keeping to avoid repeat prefetches. */
-	RelFileLocator recent_rlocator[XLOGPREFETCHER_SEQ_WINDOW_SIZE];
-	BlockNumber recent_block[XLOGPREFETCHER_SEQ_WINDOW_SIZE];
-	int			recent_idx;
-
 	/* Book-keeping to disable prefetching temporarily. */
 	XLogRecPtr	no_readahead_until;
 
 	/* IO depth manager. */
-	LsnReadQueue *streaming_read;
+	PgStreamingRead *streaming_read;
 
 	XLogRecPtr	begin_ptr;
 
@@ -197,103 +143,16 @@ static inline bool XLogPrefetcherIsFiltered(XLogPrefetcher *prefetcher,
 											BlockNumber blockno);
 static inline void XLogPrefetcherCompleteFilters(XLogPrefetcher *prefetcher,
 												 XLogRecPtr replaying_lsn);
-static LsnReadQueueNextStatus XLogPrefetcherNextBlock(uintptr_t pgsr_private,
-													  XLogRecPtr *lsn);
+static bool XLogPrefetcherNextBlock(PgStreamingRead *pgsr,
+									uintptr_t pgsr_prviate,
+									void *per_io_data,
+									BufferManagerRelation *bmr,
+									ForkNumber *forknum,
+									BlockNumber *blocknum,
+									ReadBufferMode *mode);
 
 static XLogPrefetchStats *SharedStats;
 
-static inline LsnReadQueue *
-lrq_alloc(uint32 max_distance,
-		  uint32 max_inflight,
-		  uintptr_t lrq_private,
-		  LsnReadQueueNextFun next)
-{
-	LsnReadQueue *lrq;
-	uint32		size;
-
-	Assert(max_distance >= max_inflight);
-
-	size = max_distance + 1;	/* full ring buffer has a gap */
-	lrq = palloc(offsetof(LsnReadQueue, queue) + sizeof(lrq->queue[0]) * size);
-	lrq->lrq_private = lrq_private;
-	lrq->max_inflight = max_inflight;
-	lrq->size = size;
-	lrq->next = next;
-	lrq->head = 0;
-	lrq->tail = 0;
-	lrq->inflight = 0;
-	lrq->completed = 0;
-
-	return lrq;
-}
-
-static inline void
-lrq_free(LsnReadQueue *lrq)
-{
-	pfree(lrq);
-}
-
-static inline uint32
-lrq_inflight(LsnReadQueue *lrq)
-{
-	return lrq->inflight;
-}
-
-static inline uint32
-lrq_completed(LsnReadQueue *lrq)
-{
-	return lrq->completed;
-}
-
-static inline void
-lrq_prefetch(LsnReadQueue *lrq)
-{
-	/* Try to start as many IOs as we can within our limits. */
-	while (lrq->inflight < lrq->max_inflight &&
-		   lrq->inflight + lrq->completed < lrq->size - 1)
-	{
-		Assert(((lrq->head + 1) % lrq->size) != lrq->tail);
-		switch (lrq->next(lrq->lrq_private, &lrq->queue[lrq->head].lsn))
-		{
-			case LRQ_NEXT_AGAIN:
-				return;
-			case LRQ_NEXT_IO:
-				lrq->queue[lrq->head].io = true;
-				lrq->inflight++;
-				break;
-			case LRQ_NEXT_NO_IO:
-				lrq->queue[lrq->head].io = false;
-				lrq->completed++;
-				break;
-		}
-		lrq->head++;
-		if (lrq->head == lrq->size)
-			lrq->head = 0;
-	}
-}
-
-static inline void
-lrq_complete_lsn(LsnReadQueue *lrq, XLogRecPtr lsn)
-{
-	/*
-	 * We know that LSNs before 'lsn' have been replayed, so we can now assume
-	 * that any IOs that were started before then have finished.
-	 */
-	while (lrq->tail != lrq->head &&
-		   lrq->queue[lrq->tail].lsn < lsn)
-	{
-		if (lrq->queue[lrq->tail].io)
-			lrq->inflight--;
-		else
-			lrq->completed--;
-		lrq->tail++;
-		if (lrq->tail == lrq->size)
-			lrq->tail = 0;
-	}
-	if (RecoveryPrefetchEnabled())
-		lrq_prefetch(lrq);
-}
-
 size_t
 XLogPrefetchShmemSize(void)
 {
@@ -395,7 +254,23 @@ XLogPrefetcherAllocate(XLogReaderState *reader)
 void
 XLogPrefetcherFree(XLogPrefetcher *prefetcher)
 {
-	lrq_free(prefetcher->streaming_read);
+	/*
+	 * Redo records are normally expected to read and then release any buffers
+	 * referenced by a WAL record, but we may need to release buffers for the
+	 * final block if recovery ends without replaying a record.
+	 */
+	if (prefetcher->reader->record)
+	{
+		DecodedXLogRecord *last_record = prefetcher->reader->record;
+
+		for (int i = 0; i <= last_record->max_block_id; ++i)
+		{
+			if (last_record->blocks[i].in_use &&
+				BufferIsValid(last_record->blocks[i].prefetch_buffer))
+				ReleaseBuffer(last_record->blocks[i].prefetch_buffer);
+		}
+	}
+	pg_streaming_read_free(prefetcher->streaming_read);
 	hash_destroy(prefetcher->filter_table);
 	pfree(prefetcher);
 }
@@ -415,8 +290,8 @@ XLogPrefetcherGetReader(XLogPrefetcher *prefetcher)
 void
 XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher)
 {
-	uint32		io_depth;
-	uint32		completed;
+	int			ios;
+	int			pins;
 	int64		wal_distance;
 
 
@@ -432,13 +307,13 @@ XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher)
 		wal_distance = 0;
 	}
 
-	/* How many IOs are currently in flight and completed? */
-	io_depth = lrq_inflight(prefetcher->streaming_read);
-	completed = lrq_completed(prefetcher->streaming_read);
+	/* How many IOs are currently in progress, and how many pins do we have? */
+	ios = pg_streaming_read_ios(prefetcher->streaming_read);
+	pins = pg_streaming_read_pins(prefetcher->streaming_read);
 
 	/* Update the instantaneous stats visible in pg_stat_recovery_prefetch. */
-	SharedStats->io_depth = io_depth;
-	SharedStats->block_distance = io_depth + completed;
+	SharedStats->io_depth = ios;
+	SharedStats->block_distance = pins;
 	SharedStats->wal_distance = wal_distance;
 
 	prefetcher->next_stats_shm_lsn =
@@ -446,23 +321,18 @@ XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher)
 }
 
 /*
- * A callback that examines the next block reference in the WAL, and possibly
- * starts an IO so that a later read will be fast.
- *
- * Returns LRQ_NEXT_AGAIN if no more WAL data is available yet.
- *
- * Returns LRQ_NEXT_IO if the next block reference is for a main fork block
- * that isn't in the buffer pool, and the kernel has been asked to start
- * reading it to make a future read system call faster. An LSN is written to
- * *lsn, and the I/O will be considered to have completed once that LSN is
- * replayed.
- *
- * Returns LRQ_NEXT_NO_IO if we examined the next block reference and found
- * that it was already in the buffer pool, or we decided for various reasons
- * not to prefetch.
+ * A PgStreamingRead callback that generates a stream of block references by
+ * looking ahead in the WAL, which XLogPrefetcherReadRecord() will later
+ * retrieve.
  */
-static LsnReadQueueNextStatus
-XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
+static bool
+XLogPrefetcherNextBlock(PgStreamingRead *pgsr,
+						uintptr_t pgsr_private,
+						void *per_io_data,
+						BufferManagerRelation *bmr,
+						ForkNumber *forknum,
+						BlockNumber *blocknum,
+						ReadBufferMode *mode)
 {
 	XLogPrefetcher *prefetcher = (XLogPrefetcher *) pgsr_private;
 	XLogReaderState *reader = prefetcher->reader;
@@ -475,6 +345,8 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 	for (;;)
 	{
 		DecodedXLogRecord *record;
+		uint8		rmid;
+		uint8		record_type;
 
 		/* Try to read a new future record, if we don't already have one. */
 		if (prefetcher->record == NULL)
@@ -491,7 +363,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 
 			/* Readahead is disabled until we replay past a certain point. */
 			if (nonblocking && replaying_lsn <= prefetcher->no_readahead_until)
-				return LRQ_NEXT_AGAIN;
+				return false;
 
 			record = XLogReadAhead(prefetcher->reader, nonblocking);
 			if (record == NULL)
@@ -505,23 +377,20 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 					prefetcher->no_readahead_until =
 						prefetcher->reader->decode_queue_tail->lsn;
 
-				return LRQ_NEXT_AGAIN;
-			}
-
-			/*
-			 * If prefetching is disabled, we don't need to analyze the record
-			 * or issue any prefetches.  We just need to cause one record to
-			 * be decoded.
-			 */
-			if (!RecoveryPrefetchEnabled())
-			{
-				*lsn = InvalidXLogRecPtr;
-				return LRQ_NEXT_NO_IO;
+				return false;
 			}
 
 			/* We have a new record to process. */
 			prefetcher->record = record;
 			prefetcher->next_block_id = 0;
+
+			/*
+			 * If prefetching is disabled, we don't need to analyze the
+			 * record.  We just needed to cause one record to be decoded, so
+			 * we can give up now.
+			 */
+			if (!RecoveryPrefetchEnabled())
+				return false;
 		}
 		else
 		{
@@ -529,15 +398,15 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 			record = prefetcher->record;
 		}
 
+		rmid = record->header.xl_rmid;
+		record_type = record->header.xl_info & ~XLR_INFO_MASK;
+
 		/*
 		 * Check for operations that require us to filter out block ranges, or
 		 * pause readahead completely.
 		 */
 		if (replaying_lsn < record->lsn)
 		{
-			uint8		rmid = record->header.xl_rmid;
-			uint8		record_type = record->header.xl_info & ~XLR_INFO_MASK;
-
 			if (rmid == RM_XLOG_ID)
 			{
 				if (record_type == XLOG_CHECKPOINT_SHUTDOWN ||
@@ -589,6 +458,51 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 						 rlocator.dbOid,
 						 LSN_FORMAT_ARGS(record->lsn));
 #endif
+
+					/*
+					 * Don't prefetch anything in the source database either,
+					 * because FlushDatabaseBuffers() doesn't like to see any
+					 * pinned buffers.
+					 */
+					rlocator.dbOid = xlrec->src_db_id;
+					XLogPrefetcherAddFilter(prefetcher, rlocator, 0, record->lsn);
+
+#ifdef XLOGPREFETCHER_DEBUG_LEVEL
+					elog(XLOGPREFETCHER_DEBUG_LEVEL,
+						 "suppressing prefetch in database %u until %X/%X is replayed due to raw file copy",
+						 rlocator.dbOid,
+						 LSN_FORMAT_ARGS(record->lsn));
+#endif
+				}
+				else if (record_type == XLOG_DBASE_CREATE_WAL_LOG)
+				{
+					xl_dbase_create_wal_log_rec *xlrec =
+						(xl_dbase_create_wal_log_rec *) record->main_data;
+					RelFileLocator rlocator =
+					{InvalidOid, xlrec->db_id, InvalidRelFileNumber};
+
+					/*
+					 * As above, we don't want to pin buffers on the other
+					 * side of a DROP, CREATE DATABASE sequence that recycles
+					 * database OIDs.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, rlocator, 0, record->lsn);
+				}
+				else if (record_type == XLOG_DBASE_DROP)
+				{
+					xl_dbase_drop_rec *xlrec =
+						(xl_dbase_drop_rec *) record->main_data;
+					RelFileLocator rlocator =
+					{InvalidOid, xlrec->db_id, InvalidRelFileNumber};
+
+					/*
+					 * XLOG_DBASE_DROP can be used while moving between
+					 * tablespaces, and in that case it invalidates buffers
+					 * from all tablespaces.  That means that we also have to
+					 * wait for the drop before prefetching anything in this
+					 * whole DB.
+					 */
+					XLogPrefetcherAddFilter(prefetcher, rlocator, 0, record->lsn);
 				}
 			}
 			else if (rmid == RM_SMGR_ID)
@@ -645,6 +559,53 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 #endif
 				}
 			}
+			else if (rmid == RM_XACT_ID)
+			{
+				/*
+				 * If this is a COMMIT/ABORT that drops relations, the
+				 * SMgrRelation pointers to those relations that we have
+				 * already queued for streaming read will become dangling
+				 * pointers after this is replayed due to smgrclose(), so
+				 * pause prefetching here until that's done.
+				 */
+				if (unlikely(record->header.xl_info & XACT_XINFO_HAS_RELFILELOCATORS))
+				{
+					RelFileLocator *xlocators;
+					int			nrels = 0;
+
+					if (record_type == XLOG_XACT_COMMIT ||
+						record_type == XLOG_XACT_COMMIT_PREPARED)
+					{
+						xl_xact_parsed_commit parsed;
+
+						ParseCommitRecord(record->header.xl_info,
+										  (xl_xact_commit *) record->main_data,
+										  &parsed);
+
+						xlocators = parsed.xlocators;
+						nrels = parsed.nrels;
+					}
+					else if (record_type == XLOG_XACT_ABORT)
+					{
+						xl_xact_parsed_abort parsed;
+
+						ParseAbortRecord(record->header.xl_info,
+										 (xl_xact_abort *) record->main_data,
+										 &parsed);
+
+						xlocators = parsed.xlocators;
+						nrels = parsed.nrels;
+					}
+
+					/*
+					 * Filter out these relations until the record is
+					 * replayed.
+					 */
+					for (int i = 0; i < nrels; ++i)
+						XLogPrefetcherAddFilter(prefetcher, xlocators[i], 0,
+												record->lsn);
+				}
+			}
 		}
 
 		/* Scan the block references, starting where we left off last time. */
@@ -653,69 +614,39 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 			int			block_id = prefetcher->next_block_id++;
 			DecodedBkpBlock *block = &record->blocks[block_id];
 			SMgrRelation reln;
-			PrefetchBufferResult result;
 
 			if (!block->in_use)
 				continue;
 
 			Assert(!BufferIsValid(block->prefetch_buffer));
+			Assert(!block->prefetch_buffer_streamed);
 
 			/*
-			 * Record the LSN of this record.  When it's replayed,
-			 * LsnReadQueue will consider any IOs submitted for earlier LSNs
-			 * to be finished.
+			 * We only stream the main fork only, for now.  Some of the other
+			 * forks require RBM_ZERO_ON_ERROR and have unusual logging
+			 * protocols.
 			 */
-			*lsn = record->lsn;
-
-			/* We don't try to prefetch anything but the main fork for now. */
 			if (block->forknum != MAIN_FORKNUM)
-			{
-				return LRQ_NEXT_NO_IO;
-			}
+				continue;
 
 			/*
-			 * If there is a full page image attached, we won't be reading the
-			 * page, so don't bother trying to prefetch.
+			 * FPI_FOR_HINT is special, with full_page_writes = off and
+			 * wal_log_hints = true, the record is solely used to include
+			 * knowledge about modified blocks in the WAL.
 			 */
-			if (block->has_image)
+			if (rmid == RM_XLOG_ID && record_type == XLOG_FPI_FOR_HINT &&
+				!block->has_image)
 			{
 				XLogPrefetchIncrement(&SharedStats->skip_fpw);
-				return LRQ_NEXT_NO_IO;
-			}
-
-			/* There is no point in reading a page that will be zeroed. */
-			if (block->flags & BKPBLOCK_WILL_INIT)
-			{
-				XLogPrefetchIncrement(&SharedStats->skip_init);
-				return LRQ_NEXT_NO_IO;
+				continue;
 			}
 
 			/* Should we skip prefetching this block due to a filter? */
 			if (XLogPrefetcherIsFiltered(prefetcher, block->rlocator, block->blkno))
 			{
 				XLogPrefetchIncrement(&SharedStats->skip_new);
-				return LRQ_NEXT_NO_IO;
-			}
-
-			/* There is no point in repeatedly prefetching the same block. */
-			for (int i = 0; i < XLOGPREFETCHER_SEQ_WINDOW_SIZE; ++i)
-			{
-				if (block->blkno == prefetcher->recent_block[i] &&
-					RelFileLocatorEquals(block->rlocator, prefetcher->recent_rlocator[i]))
-				{
-					/*
-					 * XXX If we also remembered where it was, we could set
-					 * recent_buffer so that recovery could skip smgropen()
-					 * and a buffer table lookup.
-					 */
-					XLogPrefetchIncrement(&SharedStats->skip_rep);
-					return LRQ_NEXT_NO_IO;
-				}
+				continue;
 			}
-			prefetcher->recent_rlocator[prefetcher->recent_idx] = block->rlocator;
-			prefetcher->recent_block[prefetcher->recent_idx] = block->blkno;
-			prefetcher->recent_idx =
-				(prefetcher->recent_idx + 1) % XLOGPREFETCHER_SEQ_WINDOW_SIZE;
 
 			/*
 			 * We could try to have a fast path for repeated references to the
@@ -744,7 +675,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 				XLogPrefetcherAddFilter(prefetcher, block->rlocator, 0,
 										record->lsn);
 				XLogPrefetchIncrement(&SharedStats->skip_new);
-				return LRQ_NEXT_NO_IO;
+				continue;
 			}
 
 			/*
@@ -766,41 +697,33 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 				XLogPrefetcherAddFilter(prefetcher, block->rlocator, block->blkno,
 										record->lsn);
 				XLogPrefetchIncrement(&SharedStats->skip_new);
-				return LRQ_NEXT_NO_IO;
+				continue;
 			}
 
-			/* Try to initiate prefetching. */
-			result = PrefetchSharedBuffer(reln, block->forknum, block->blkno);
-			if (BufferIsValid(result.recent_buffer))
-			{
-				/* Cache hit, nothing to do. */
-				XLogPrefetchIncrement(&SharedStats->hit);
-				block->prefetch_buffer = result.recent_buffer;
-				return LRQ_NEXT_NO_IO;
-			}
-			else if (result.initiated_io)
-			{
-				/* Cache miss, I/O (presumably) started. */
-				XLogPrefetchIncrement(&SharedStats->prefetch);
-				block->prefetch_buffer = InvalidBuffer;
-				return LRQ_NEXT_IO;
-			}
-			else if ((io_direct_flags & IO_DIRECT_DATA) == 0)
-			{
-				/*
-				 * This shouldn't be possible, because we already determined
-				 * that the relation exists on disk and is big enough.
-				 * Something is wrong with the cache invalidation for
-				 * smgrexists(), smgrnblocks(), or the file was unlinked or
-				 * truncated beneath our feet?
-				 */
-				elog(ERROR,
-					 "could not prefetch relation %u/%u/%u block %u",
-					 reln->smgr_rlocator.locator.spcOid,
-					 reln->smgr_rlocator.locator.dbOid,
-					 reln->smgr_rlocator.locator.relNumber,
-					 block->blkno);
-			}
+			/*
+			 * Stream this block!  It will be looked up and pinned by the
+			 * streaming read (possibly combining I/Os with nearby blocks),
+			 * and XLogPrefetcherReadRecord() will consume it from the
+			 * streaming read object, and make it available for
+			 * XLogReadBufferForRedo() to provide to a redo routine.
+			 */
+			block->prefetch_buffer_streamed = true;
+
+			/*
+			 * If there's an FPI or redo is just going to zero it, we can skip
+			 * a useless I/O, and stream a pinned buffer that
+			 * XLogReadBufferForRedo() will pass to ZeroBuffer().
+			 */
+			if (block->apply_image ||
+				(block->flags & BKPBLOCK_WILL_INIT))
+				*mode = RBM_WILL_ZERO;
+			else
+				*mode = RBM_NORMAL;
+			*bmr = BMR_SMGR(reln, RELPERSISTENCE_PERMANENT);
+			*forknum = block->forknum;
+			*blocknum = block->blkno;
+
+			return true;
 		}
 
 		/*
@@ -815,7 +738,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 		 */
 		if (prefetcher->reader->decode_queue_tail &&
 			prefetcher->reader->decode_queue_tail->lsn == prefetcher->begin_ptr)
-			return LRQ_NEXT_AGAIN;
+			return false;
 
 		/* Advance to the next record. */
 		prefetcher->record = NULL;
@@ -989,34 +912,64 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	DecodedXLogRecord *record;
 	XLogRecPtr	replayed_up_to;
 
+#ifdef USE_ASSERT_CHECKING
+
+	/*
+	 * The recovery code (ie individual redo routines) should have called
+	 * XLogReadBufferForRedo() for all registered buffers.  Here we'll assert
+	 * that that's the case.
+	 */
+	if (prefetcher->reader->record)
+	{
+		DecodedXLogRecord *last_record = prefetcher->reader->record;
+
+		for (int i = 0; i <= last_record->max_block_id; ++i)
+		{
+			if (last_record->blocks[i].in_use &&
+				BufferIsValid(last_record->blocks[i].prefetch_buffer))
+				elog(ERROR,
+					 "redo routine did not read buffer pinned by prefetcher, LSN %X/%X",
+					 LSN_FORMAT_ARGS(last_record->lsn));
+		}
+	}
+#endif
+
 	/*
 	 * See if it's time to reset the prefetching machinery, because a relevant
 	 * GUC was changed.
 	 */
 	if (unlikely(XLogPrefetchReconfigureCount != prefetcher->reconfigure_count))
 	{
-		uint32		max_distance;
-		uint32		max_inflight;
+		int			max_ios;
 
 		if (prefetcher->streaming_read)
-			lrq_free(prefetcher->streaming_read);
+			pg_streaming_read_free(prefetcher->streaming_read);
 
 		if (RecoveryPrefetchEnabled())
 		{
-			Assert(maintenance_io_concurrency > 0);
-			max_inflight = maintenance_io_concurrency;
-			max_distance = max_inflight * XLOGPREFETCHER_DISTANCE_MULTIPLIER;
+			/*
+			 * maintenance_io_concurrency might be 0 on some systems, but we
+			 * need at least 1 to function.
+			 */
+			max_ios = Max(maintenance_io_concurrency, 1);
 		}
 		else
 		{
-			max_inflight = 1;
-			max_distance = 1;
+			/*
+			 * Non-zero values are needed, but the NextBlock callback will
+			 * never actually do any prefetching, so the streaming read will
+			 * not be used except as a way to force records to be decoded one
+			 * by one.
+			 */
+			max_ios = 1;
 		}
 
-		prefetcher->streaming_read = lrq_alloc(max_distance,
-											   max_inflight,
-											   (uintptr_t) prefetcher,
-											   XLogPrefetcherNextBlock);
+		prefetcher->streaming_read =
+			pg_streaming_read_buffer_alloc(max_ios,
+										   0,
+										   (uintptr_t) prefetcher,
+										   NULL,
+										   XLogPrefetcherNextBlock);
 
 		prefetcher->reconfigure_count = XLogPrefetchReconfigureCount;
 	}
@@ -1035,20 +988,15 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	XLogPrefetcherCompleteFilters(prefetcher, replayed_up_to);
 
 	/*
-	 * All IO initiated by earlier WAL is now completed.  This might trigger
-	 * further prefetching.
-	 */
-	lrq_complete_lsn(prefetcher->streaming_read, replayed_up_to);
-
-	/*
-	 * If there's nothing queued yet, then start prefetching to cause at least
-	 * one record to be queued.
+	 * If there's nothing queued yet, then start prefetching.  Normally this
+	 * happens automatically when we call streaming_read_get_next() below to
+	 * complete earlier IOs, but if we didn't have a special case for an empty
+	 * queue we'd never be able to get started.
 	 */
 	if (!XLogReaderHasQueuedRecordOrError(prefetcher->reader))
 	{
-		Assert(lrq_inflight(prefetcher->streaming_read) == 0);
-		Assert(lrq_completed(prefetcher->streaming_read) == 0);
-		lrq_prefetch(prefetcher->streaming_read);
+		pg_streaming_read_reset(prefetcher->streaming_read);
+		pg_streaming_read_prefetch(prefetcher->streaming_read);
 	}
 
 	/* Read the next record. */
@@ -1078,23 +1026,64 @@ XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 	if (unlikely(record->lsn >= prefetcher->next_stats_shm_lsn))
 		XLogPrefetcherComputeStats(prefetcher);
 
-	Assert(record == prefetcher->reader->record);
+	/*
+	 * Make sure that any IOs we initiated earlier for this record are
+	 * completed, by pulling the buffers out of the StreamingRead.
+	 */
+	for (int block_id = 0; block_id <= record->max_block_id; ++block_id)
+	{
+		DecodedBkpBlock *block = &record->blocks[block_id];
+		Buffer		buffer;
 
-	return &record->header;
-}
+		if (!block->in_use)
+			continue;
 
-bool
-check_recovery_prefetch(int *new_value, void **extra, GucSource source)
-{
-#ifndef USE_PREFETCH
-	if (*new_value == RECOVERY_PREFETCH_ON)
-	{
-		GUC_check_errdetail("recovery_prefetch is not supported on platforms that lack posix_fadvise().");
-		return false;
-	}
+		/*
+		 * If we haven't streamed this buffer, then
+		 * XLogReadBufferForRedoExtended() will just have to read it the
+		 * traditional way.
+		 */
+		if (!block->prefetch_buffer_streamed)
+			continue;
+
+		Assert(!BufferIsValid(block->prefetch_buffer));
+
+		/*
+		 * Otherwise we already have a pinned buffer waiting for us in the
+		 * streaming read.
+		 */
+		buffer = pg_streaming_read_buffer_get_next(prefetcher->streaming_read, NULL);
+
+		if (buffer == InvalidBuffer)
+			elog(PANIC, "unexpectedly ran out of buffers in streaming read");
+
+		block->prefetch_buffer = buffer;
+		block->prefetch_buffer_streamed = false;
+
+		/*
+		 * Assert that we're in sync with XLogPrefetcherNextBlock(), which is
+		 * feeding blocks into the far end of the pipe.  For every decoded
+		 * block that has the prefetch_buffer_streamed flag set, in order, we
+		 * expect the corresponding already-pinned buffer to be the next to
+		 * come out of streaming_read.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		{
+			RelFileLocator rlocator;
+			ForkNumber	forknum;
+			BlockNumber blocknum;
+
+			BufferGetTag(buffer, &rlocator, &forknum, &blocknum);
+			Assert(RelFileLocatorEquals(rlocator, block->rlocator));
+			Assert(forknum == block->forknum);
+			Assert(blocknum == block->blkno);
+		}
 #endif
+	}
 
-	return true;
+	Assert(record == prefetcher->reader->record);
+
+	return &record->header;
 }
 
 void
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c9f9f6e98f..de6cd926ad 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1750,6 +1750,7 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
 			blk->prefetch_buffer = InvalidBuffer;
+			blk->prefetch_buffer_streamed = false;
 
 			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
@@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
  * If the WAL record contains a block reference with the given ID, *rlocator,
  * *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
  * returns true.  Otherwise returns false.
+ *
+ * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership
+ * of the pin is transferred to the caller.
  */
 bool
 XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
@@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
 	if (blknum)
 		*blknum = bkpb->blkno;
 	if (prefetch_buffer)
+	{
 		*prefetch_buffer = bkpb->prefetch_buffer;
+
+		/*
+		 * Clear this flag is so that we can assert that redo records take
+		 * ownership of all buffers pinned by xlogprefetcher.c.
+		 */
+		bkpb->prefetch_buffer = InvalidBuffer;
+	}
 	return true;
 }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..204ec48072 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1551,8 +1551,8 @@ ShutdownWalRecovery(void)
 		close(readFile);
 		readFile = -1;
 	}
-	XLogReaderFree(xlogreader);
 	XLogPrefetcherFree(xlogprefetcher);
+	XLogReaderFree(xlogreader);
 
 	if (ArchiveRecoveryRequested)
 	{
@@ -2420,8 +2420,7 @@ verifyBackupPageConsistency(XLogReaderState *record)
 		 * temporary page.
 		 */
 		buf = XLogReadBufferExtended(rlocator, forknum, blkno,
-									 RBM_NORMAL_NO_LOG,
-									 InvalidBuffer);
+									 RBM_NORMAL_NO_LOG);
 		if (!BufferIsValid(buf))
 			continue;
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index bdc0f7e1da..b3382f9534 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -360,13 +360,13 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	RelFileLocator rlocator;
 	ForkNumber	forknum;
 	BlockNumber blkno;
-	Buffer		prefetch_buffer;
+	Buffer		streamed_buffer;
 	Page		page;
 	bool		zeromode;
 	bool		willinit;
 
 	if (!XLogRecGetBlockTagExtended(record, block_id, &rlocator, &forknum, &blkno,
-									&prefetch_buffer))
+									&streamed_buffer))
 	{
 		/* Caller specified a bogus block_id */
 		elog(PANIC, "failed to locate backup block with ID %d in WAL record",
@@ -384,13 +384,47 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	if (!willinit && zeromode)
 		elog(PANIC, "block to be initialized in redo routine must be marked with WILL_INIT flag in the WAL record");
 
+	/* Has xlogprefetcher.c streamed a pinned buffer to us? */
+	if (BufferIsValid(streamed_buffer))
+	{
+#ifdef USE_ASSERT_CHECKING
+		RelFileLocator xrlocator;
+		ForkNumber	xforknum;
+		BlockNumber xblocknum;
+
+		/* It must match the block requested in the WAL. */
+		BufferGetTag(streamed_buffer, &xrlocator, &xforknum, &xblocknum);
+		Assert(RelFileLocatorEquals(xrlocator, rlocator));
+		Assert(xforknum == forknum);
+		Assert(xblocknum == blkno);
+#endif
+	}
+
 	/* If it has a full-page image and it should be restored, do it. */
 	if (XLogRecBlockImageApply(record, block_id))
 	{
 		Assert(XLogRecHasBlockImage(record, block_id));
-		*buf = XLogReadBufferExtended(rlocator, forknum, blkno,
-									  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK,
-									  prefetch_buffer);
+		if (BufferIsValid(streamed_buffer))
+		{
+			/*
+			 * The prefetcher has pinned this buffer, but used RBM_WILL_ZERO
+			 * when it saw that an image would be restored, so the buffer
+			 * wasn't made BM_VALID.  The only thing we are allowed to do now
+			 * is call ZeroBuffer().  Doing this at the last moment means that
+			 * we only lock one block at a time, when the redo routine is
+			 * ready for it, and can tell us which kind of lock it wants.
+			 *
+			 * XXX There doesn't seem to be a good reason to zero a page
+			 * before copying an image over the top, but this matches the
+			 * traditional behavior for now.
+			 */
+			ZeroBuffer(streamed_buffer,
+					   get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_CLEANUP_LOCK);
+			*buf = streamed_buffer;
+		}
+		else
+			*buf = XLogReadBufferExtended(rlocator, forknum, blkno,
+										  get_cleanup_lock ? RBM_ZERO_AND_CLEANUP_LOCK : RBM_ZERO_AND_LOCK);
 		page = BufferGetPage(*buf);
 		if (!RestoreBlockImage(record, block_id, page))
 			ereport(ERROR,
@@ -421,7 +455,22 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	}
 	else
 	{
-		*buf = XLogReadBufferExtended(rlocator, forknum, blkno, mode, prefetch_buffer);
+		if (BufferIsValid(streamed_buffer))
+		{
+			/*
+			 * If the redo routine wants a zeroed page, we do that here.  The
+			 * prefetcher streamed us a pinned but not necessarily BM_VALID
+			 * buffer, because it saw the BKPBLOCK_WILL_INIT flag and used
+			 * PGSR_NEXT_WILL_ZERO to avoid I/O for a page that will be
+			 * zeroed.  We only lock one page at a time, when the redo routine
+			 * is ready to tell us which type of lock it wants.
+			 */
+			if (zeromode)
+				ZeroBuffer(streamed_buffer, mode);
+			*buf = streamed_buffer;
+		}
+		else
+			*buf = XLogReadBufferExtended(rlocator, forknum, blkno, mode);
 		if (BufferIsValid(*buf))
 		{
 			if (mode != RBM_ZERO_AND_LOCK && mode != RBM_ZERO_AND_CLEANUP_LOCK)
@@ -472,8 +521,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  */
 Buffer
 XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
-					   BlockNumber blkno, ReadBufferMode mode,
-					   Buffer recent_buffer)
+					   BlockNumber blkno, ReadBufferMode mode)
 {
 	BlockNumber lastblock;
 	Buffer		buffer;
@@ -481,15 +529,6 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 
 	Assert(blkno != P_NEW);
 
-	/* Do we have a clue where the buffer might be already? */
-	if (BufferIsValid(recent_buffer) &&
-		mode == RBM_NORMAL &&
-		ReadRecentBuffer(rlocator, forknum, blkno, recent_buffer))
-	{
-		buffer = recent_buffer;
-		goto recent_buffer_fast_path;
-	}
-
 	/* Open the relation at smgr level */
 	smgr = smgropen(rlocator, InvalidBackendId);
 
@@ -533,7 +572,6 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 									 mode);
 	}
 
-recent_buffer_fast_path:
 	if (mode == RBM_NORMAL)
 	{
 		/* check that page has been initialized */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 678e3239a9..04dbede971 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4977,7 +4977,11 @@ LockBufferForCleanup(Buffer buffer)
 	Assert(BufferIsPinned(buffer));
 	Assert(PinCountWaitBuf == NULL);
 
-	CheckBufferIsPinnedOnce(buffer);
+#if 0
+	/* Due to prefetching, recovery might hold more than one local pin. */
+	if (!InRecovery)
+		BufferCheckOneLocalPin(buffer);
+#endif
 
 	/* Nobody else to wait for */
 	if (BufferIsLocal(buffer))
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index fb9440ff72..dbc9d87d42 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -212,7 +212,7 @@ XLogRecordPageWithFreeSpace(RelFileLocator rlocator, BlockNumber heapBlk,
 
 	/* If the page doesn't exist already, extend */
 	buf = XLogReadBufferExtended(rlocator, FSM_FORKNUM, blkno,
-								 RBM_ZERO_ON_ERROR, InvalidBuffer);
+								 RBM_ZERO_ON_ERROR);
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	page = BufferGetPage(buf);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index e565a3092f..08dca1d8e6 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -375,7 +375,7 @@ static const struct config_enum_entry huge_pages_status_options[] = {
 static const struct config_enum_entry recovery_prefetch_options[] = {
 	{"off", RECOVERY_PREFETCH_OFF, false},
 	{"on", RECOVERY_PREFETCH_ON, false},
-	{"try", RECOVERY_PREFETCH_TRY, false},
+	{"try", RECOVERY_PREFETCH_ON, false},
 	{"true", RECOVERY_PREFETCH_ON, true},
 	{"false", RECOVERY_PREFETCH_OFF, true},
 	{"yes", RECOVERY_PREFETCH_ON, true},
@@ -4895,8 +4895,8 @@ struct config_enum ConfigureNamesEnum[] =
 			gettext_noop("Look ahead in the WAL to find references to uncached data.")
 		},
 		&recovery_prefetch,
-		RECOVERY_PREFETCH_TRY, recovery_prefetch_options,
-		check_recovery_prefetch, assign_recovery_prefetch, NULL
+		RECOVERY_PREFETCH_ON, recovery_prefetch_options,
+		NULL, assign_recovery_prefetch, NULL
 	},
 
 	{
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index 7dd7f20ad0..91c9b25050 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -24,8 +24,7 @@ extern PGDLLIMPORT int recovery_prefetch;
 typedef enum
 {
 	RECOVERY_PREFETCH_OFF,
-	RECOVERY_PREFETCH_ON,
-	RECOVERY_PREFETCH_TRY
+	RECOVERY_PREFETCH_ON
 }			RecoveryPrefetchValue;
 
 struct XLogPrefetcher;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index da32c7db77..c11c364cbf 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -128,6 +128,7 @@ typedef struct
 
 	/* Prefetching workspace. */
 	Buffer		prefetch_buffer;
+	bool		prefetch_buffer_streamed;
 
 	/* copy of the fork_flags field from the XLogRecordBlockHeader */
 	uint8		flags;
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 5b77b11f50..daf064924e 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -88,10 +88,8 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 													uint8 block_id,
 													ReadBufferMode mode, bool get_cleanup_lock,
 													Buffer *buf);
-
 extern Buffer XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
-									 BlockNumber blkno, ReadBufferMode mode,
-									 Buffer recent_buffer);
+									 BlockNumber blkno, ReadBufferMode mode);
 
 extern Relation CreateFakeRelcacheEntry(RelFileLocator rlocator);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
-- 
2.39.2

v1-0013-WIP-Provide-vectored-variant-of-FlushBuffer.patchtext/x-patch; charset=US-ASCII; name=v1-0013-WIP-Provide-vectored-variant-of-FlushBuffer.patchDownload

From 631478435f4aff6e55fecd03ea27aca37e5f232e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Jul 2023 10:58:26 +1200
Subject: [PATCH v1 13/14] WIP: Provide vectored variant of FlushBuffer().

FlushBuffers() is just like FlushBuffer() except that it tries to write
out multiple consecutive disk blocks at a time with smgrwritev().  The
traditional FlushBuffer() call is now implemented in terms of it.

Author: Thomas Munro <thomas.munro@gmail.com>
---
 src/backend/storage/buffer/bufmgr.c | 235 ++++++++++++++++++++--------
 src/include/storage/buf_internals.h |  10 ++
 2 files changed, 181 insertions(+), 64 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 04dbede971..61eceb4b99 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -53,6 +53,7 @@
 #include "storage/smgr.h"
 #include "storage/standby.h"
 #include "utils/memdebug.h"
+#include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
@@ -478,6 +479,11 @@ static void WaitIO(BufferDesc *buf);
 static bool StartBufferIO(BufferDesc *buf, bool forInput);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits);
+struct shared_buffer_write_error_info
+{
+	BufferDesc *buf;
+	int			nblocks;
+};
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
 static BufferDesc *BufferAlloc(SMgrRelation smgr,
@@ -488,6 +494,8 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   bool *foundPtr, bool *allocPtr,
 							   IOContext io_context);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
+static int	FlushBuffers(BufferDesc **bufs, int nbuffers, SMgrRelation reln,
+						 IOObject io_object, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -3463,8 +3471,12 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
 }
 
 /*
- * FlushBuffer
- *		Physically write out a shared buffer.
+ * FlushBuffers
+ *		Physically write out shared buffers.
+ *
+ * The buffers do not have to be consecutive in memory but must refer to
+ * consecutive blocks of the same relation fork in increasing order of block
+ * number.
  *
  * NOTE: this actually just passes the buffer contents to the kernel; the
  * real write to disk won't happen until the kernel feels like it.  This
@@ -3472,66 +3484,148 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  * However, we will need to force the changes to disk via fsync before
  * we can checkpoint WAL.
  *
- * The caller must hold a pin on the buffer and have share-locked the
+ * The caller must hold a pin on the buffers and have share-locked the
  * buffer contents.  (Note: a share-lock does not prevent updates of
  * hint bits in the buffer, so the page could change while the write
  * is in progress, but we assume that that will not invalidate the data
  * written.)
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
- * as the second parameter.  If not, pass NULL.
+ * as the third parameter.  If not, pass NULL.
  */
-static void
-FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
-			IOContext io_context)
+static int
+FlushBuffers(BufferDesc **bufs, int nbuffers, SMgrRelation reln,
+			 IOObject io_object, IOContext io_context)
 {
-	XLogRecPtr	recptr;
+	XLogRecPtr	max_recptr;
+	struct shared_buffer_write_error_info errinfo;
 	ErrorContextCallback errcallback;
 	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
-	uint32		buf_state;
+	int			first_start_io_index;
+	void	   *bufBlocks[MAX_BUFFERS_PER_TRANSFER];
+	bool		need_checksums = DataChecksumsEnabled();
+	static PGIOAlignedBlock *copies;
+
+	Assert(nbuffers > 0);
+	nbuffers = Min(nbuffers, lengthof(bufBlocks));
 
 	/*
-	 * Try to start an I/O operation.  If StartBufferIO returns false, then
-	 * someone else flushed the buffer before we could, so we need not do
-	 * anything.
+	 * Update page checksums if desired.  Since we have only shared lock on
+	 * the buffer, other processes might be updating hint bits in it, so we
+	 * must copy the page to private storage if we do checksumming.
+	 *
+	 * XXX:TODO kill static local memory, or better yet, kill unlocked hint
+	 * bit modifications so we don't need this
 	 */
-	if (!StartBufferIO(buf, false))
-		return;
+	if (need_checksums)
+	{
+		if (!copies)
+			copies = MemoryContextAllocAligned(TopMemoryContext,
+											   BLCKSZ * lengthof(bufBlocks),
+											   PG_IO_ALIGN_SIZE,
+											   0);
+		for (int i = 0; i < nbuffers; ++i)
+		{
+			memcpy(&copies[i], BufHdrGetBlock(bufs[i]), BLCKSZ);
+			PageSetChecksumInplace((Page) &copies[i],
+								   bufs[0]->tag.blockNum + i);
+		}
+	}
+
+	/*
+	 * Try to start an I/O operation on as many buffers as we can.  If
+	 * StartBufferIO returns false, then someone else flushed the buffer
+	 * before we could, so we need not do anything any we give up on any
+	 * remaining buffers, as they are not consecutive with the blocks we've
+	 * collected.
+	 */
+	first_start_io_index = -1;
+	for (int i = 0; i < nbuffers; ++i)
+	{
+		/* Must be consecutive blocks. */
+		if (i > 0)
+			Assert(BufferTagsConsecutive(&bufs[i - 1]->tag, &bufs[i]->tag));
+
+		if (!StartBufferIO(bufs[i], false))
+		{
+			if (first_start_io_index >= 0)
+			{
+				/*
+				 * We can't go any further because later blocks after this gap
+				 * would not be consecutive.  Cap nbuffers here.
+				 */
+				nbuffers = i;
+				break;
+			}
+			else
+			{
+				/*
+				 * Still searching for the first block we can win the right to
+				 * start I/O on, so it doesn't matter that we couldn't get
+				 * this one.
+				 */
+			}
+		}
+		else
+		{
+			/* Keep track of the first block we started I/O on. */
+			if (first_start_io_index < 0)
+				first_start_io_index = i;
+		}
+		/* Collect the source buffers (copies or shared buffers). */
+		bufBlocks[i] = need_checksums ? &copies[i] : BufHdrGetBlock(bufs[i]);
+	}
+
+	/* If we can't write even one buffer, then we're done. */
+	if (first_start_io_index < 0)
+		return nbuffers;
 
 	/* Setup error traceback support for ereport() */
+	errinfo.buf = bufs[first_start_io_index];
+	errinfo.nblocks = nbuffers;
 	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = (void *) buf;
+	errcallback.arg = &errinfo;
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
 	/* Find smgr relation for buffer */
 	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), InvalidBackendId);
+		reln = smgropen(BufTagGetRelFileLocator(&bufs[first_start_io_index]->tag),
+						InvalidBackendId);
 
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&bufs[first_start_io_index]->tag),
+										bufs[first_start_io_index]->tag.blockNum,
 										reln->smgr_rlocator.locator.spcOid,
 										reln->smgr_rlocator.locator.dbOid,
 										reln->smgr_rlocator.locator.relNumber);
 
-	buf_state = LockBufHdr(buf);
+	/* Find the highest LSN across all the I/O-started buffers. */
+	max_recptr = 0;
+	for (int i = first_start_io_index; i < nbuffers; ++i)
+	{
+		XLogRecPtr	page_recptr;
+		uint32		buf_state;
 
-	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
-	 */
-	recptr = BufferGetLSN(buf);
+		buf_state = LockBufHdr(bufs[i]);
 
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	buf_state &= ~BM_JUST_DIRTIED;
-	UnlockBufHdr(buf, buf_state);
+		/*
+		 * Run PageGetLSN while holding header lock, since we don't have the
+		 * buffer locked exclusively in all cases.
+		 */
+		page_recptr = BufferGetLSN(bufs[i]);
+
+		/* To check if block content changes while flushing. - vadim 01/17/97 */
+		buf_state &= ~BM_JUST_DIRTIED;
+		UnlockBufHdr(bufs[i], buf_state);
+
+		if (buf_state & BM_PERMANENT)
+			max_recptr = Max(page_recptr, max_recptr);
+	}
 
 	/*
-	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
-	 * rule that log updates must hit disk before any of the data-file changes
-	 * they describe do.
+	 * Force XLOG flush up to buffers' greatest LSN.  This implements the
+	 * basic WAL rule that log updates must hit disk before any of the
+	 * data-file changes they describe do.
 	 *
 	 * However, this rule does not apply to unlogged relations, which will be
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
@@ -3545,33 +3639,23 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
 	 */
-	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+	if (max_recptr > 0)
+		XLogFlush(max_recptr);
 
 	/*
-	 * Now it's safe to write buffer to disk. Note that no one else should
+	 * Now it's safe to write buffers to disk. Note that no one else should
 	 * have been able to write it while we were busy with log flushing because
-	 * only one process at a time can set the BM_IO_IN_PROGRESS bit.
+	 * only one process at a time can set the BM_IO_IN_PROGRESS bits.
 	 */
-	bufBlock = BufHdrGetBlock(buf);
-
-	/*
-	 * Update page checksum if desired.  Since we have only shared lock on the
-	 * buffer, other processes might be updating hint bits in it, so we must
-	 * copy the page to private storage if we do checksumming.
-	 */
-	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
 	io_start = pgstat_prepare_io_time();
 
-	/*
-	 * bufToWrite is either the shared buffer or a copy, as appropriate.
-	 */
-	smgrwrite(reln,
-			  BufTagGetForkNum(&buf->tag),
-			  buf->tag.blockNum,
-			  bufToWrite,
-			  false);
+	smgrwritev(reln,
+			   BufTagGetForkNum(&bufs[first_start_io_index]->tag),
+			   bufs[first_start_io_index]->tag.blockNum,
+			   (const void **) &bufBlocks[first_start_io_index],
+			   nbuffers - first_start_io_index,
+			   false);
 
 	/*
 	 * When a strategy is in use, only flushes of dirty buffers already in the
@@ -3594,22 +3678,38 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
 							IOOP_WRITE, io_start, 1);
 
-	pgBufferUsage.shared_blks_written++;
+	pgBufferUsage.shared_blks_written += nbuffers - first_start_io_index;
 
 	/*
-	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
+	 * Mark the buffers as clean (unless BM_JUST_DIRTIED has become set) and
 	 * end the BM_IO_IN_PROGRESS state.
 	 */
-	TerminateBufferIO(buf, true, 0);
+	for (int i = first_start_io_index; i < nbuffers; ++i)
+		TerminateBufferIO(bufs[i], true, 0);
 
-	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
-									   buf->tag.blockNum,
+	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&bufs[first_start_io_index]->tag),
+									   bufs[0]->tag.blockNum,
 									   reln->smgr_rlocator.locator.spcOid,
 									   reln->smgr_rlocator.locator.dbOid,
 									   reln->smgr_rlocator.locator.relNumber);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
+
+	return nbuffers;
+}
+
+/*
+ * FlushBuffer
+ *		Physically write out just one shared buffer.
+ *
+ * Single-block variant of FlushBuffers().
+ */
+static void
+FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			IOContext io_context)
+{
+	FlushBuffers(&buf, 1, reln, io_object, io_context);
 }
 
 /*
@@ -5411,16 +5511,23 @@ AbortBufferIO(Buffer buffer)
 static void
 shared_buffer_write_error_callback(void *arg)
 {
-	BufferDesc *bufHdr = (BufferDesc *) arg;
+	struct shared_buffer_write_error_info *info = arg;
 
 	/* Buffer is pinned, so we can read the tag without locking the spinlock */
-	if (bufHdr != NULL)
+	if (info != NULL)
 	{
-		char	   *path = relpathperm(BufTagGetRelFileLocator(&bufHdr->tag),
-									   BufTagGetForkNum(&bufHdr->tag));
-
-		errcontext("writing block %u of relation %s",
-				   bufHdr->tag.blockNum, path);
+		char	   *path = relpathperm(BufTagGetRelFileLocator(&info->buf->tag),
+									   BufTagGetForkNum(&info->buf->tag));
+
+		if (info->nblocks > 1)
+			errcontext("writing blocks %u..%u of relation %s",
+					   info->buf->tag.blockNum,
+					   info->buf->tag.blockNum + info->nblocks - 1,
+					   path);
+		else
+			errcontext("writing block %u of relation %s",
+					   info->buf->tag.blockNum,
+					   path);
 		pfree(path);
 	}
 }
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 427a989aaf..6a493d2b96 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -159,6 +159,16 @@ BufferTagsEqual(const BufferTag *tag1, const BufferTag *tag2)
 		(tag1->forkNum == tag2->forkNum);
 }
 
+static inline bool
+BufferTagsConsecutive(const BufferTag *tag1, const BufferTag *tag2)
+{
+	return (tag1->spcOid == tag2->spcOid) &&
+		(tag1->dbOid == tag2->dbOid) &&
+		(tag1->relNumber == tag2->relNumber) &&
+		(tag1->forkNum == tag2->forkNum) &&
+		(tag1->blockNum + 1 == tag2->blockNum);
+}
+
 static inline bool
 BufTagMatchesRelFileLocator(const BufferTag *tag,
 							const RelFileLocator *rlocator)
-- 
2.39.2

v1-0014-WIP-Use-vector-writes-in-checkpointer.patchtext/x-patch; charset=US-ASCII; name=v1-0014-WIP-Use-vector-writes-in-checkpointer.patchDownload

From d8ea06b87cbbb5c68a0fb80e7779981f153ac6e3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 27 Jul 2023 15:53:13 +1200
Subject: [PATCH v1 14/14] WIP: Use vector writes in checkpointer.

Experimental hack to teach the checkpointer to write out consecutive
blocks with FlushBuffers().  This would be refactored to use
"streaming write", but the basic idea is shown here.

Author: Thomas Munro <thomas.munro@gmail.com>
---
 src/backend/storage/buffer/bufmgr.c | 167 ++++++++++++++++++++++++----
 src/include/storage/buf_internals.h |   4 +-
 2 files changed, 148 insertions(+), 23 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 61eceb4b99..7de95b04bb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -495,7 +495,8 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   IOContext io_context);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static int	FlushBuffers(BufferDesc **bufs, int nbuffers, SMgrRelation reln,
-						 IOObject io_object, IOContext io_context);
+						 IOObject io_object, IOContext io_context,
+						 int *first_written_index, int *written);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -1824,7 +1825,7 @@ again:
 		LWLockRelease(content_lock);
 
 		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
+									  &buf_hdr->tag, 1);
 	}
 
 
@@ -2608,6 +2609,72 @@ UnpinBuffer(BufferDesc *buf)
 #define ST_DEFINE
 #include <lib/sort_template.h>
 
+/*
+ * Flush a range of already pinned buffers that hold consecutive blocks of a
+ * relation fork.  They are not pinned on return.  Returns the number that
+ * were written out (if this is less than nbuffers, it is because another
+ * backend already wrote some out).
+ */
+static int
+SyncBuffers(BufferDesc **bufs, int nbuffers,
+			WritebackContext *wb_context)
+{
+	int			total_written = 0;
+
+	while (nbuffers > 0)
+	{
+		int			nlocked;
+
+		/* Lock first buffer. */
+		LWLockAcquire(BufferDescriptorGetContentLock(bufs[0]), LW_SHARED);
+		nlocked = 1;
+
+		/* Lock as many more as we can without waiting, to avoid deadlocks. */
+		while (nlocked < nbuffers &&
+			   LWLockConditionalAcquire(BufferDescriptorGetContentLock(bufs[nlocked]),
+										LW_SHARED))
+			nlocked++;
+
+		while (nlocked > 0)
+		{
+			int			flushed;
+			int			written;
+			int			first_written_index;
+
+			/*
+			 * Flush as many as we can with a single write, which may be fewer
+			 * than requested if buffers in this range turn out to have been
+			 * flushed already, creating gaps between flushable block ranges.
+			 */
+			flushed = FlushBuffers(bufs, nlocked, NULL, IOOBJECT_RELATION,
+								   IOCONTEXT_NORMAL,
+								   &first_written_index, &written);
+			total_written += written;
+
+			/* Unlock in reverse order (currently more efficient). */
+			for (int i = flushed - 1; i >= 0; --i)
+				LWLockRelease(BufferDescriptorGetContentLock(bufs[i]));
+
+			/* Queue writeback control. */
+			if (written > 0)
+				ScheduleBufferTagForWriteback(wb_context,
+											  IOCONTEXT_NORMAL,
+											  &bufs[first_written_index]->tag,
+											  written);
+
+			/* Unpin. */
+			for (int i = 0; i < flushed; ++i)
+				UnpinBuffer(bufs[i]);
+
+			bufs += flushed;
+			nlocked -= flushed;
+			nbuffers -= flushed;
+		}
+	}
+
+	return total_written;
+}
+
 /*
  * BufferSync -- Write out all dirty buffers in the pool.
  *
@@ -2633,9 +2700,8 @@ BufferSync(int flags)
 	int			i;
 	int			mask = BM_DIRTY;
 	WritebackContext wb_context;
-
-	/* Make sure we can handle the pin inside SyncOneBuffer */
-	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+	BufferDesc *range_buffers[MAX_BUFFERS_PER_TRANSFER];
+	int			range_nblocks = 0;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -2832,11 +2898,40 @@ BufferSync(int flags)
 		 */
 		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+			ReservePrivateRefCountEntry();
+
+			buf_state = LockBufHdr(bufHdr);
+			if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
+			{
+				/* It's clean, so nothing to do */
+				UnlockBufHdr(bufHdr, buf_state);
+			}
+			else
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buf_written_checkpoints++;
-				num_written++;
+				PinBuffer_Locked(bufHdr);
+
+				/*
+				 * Check if we need to sync the range of buffers we've
+				 * collected so far, because we've collected enough already or
+				 * because our newly pinned buffer is not consecutive with the
+				 * last one.
+				 */
+				if (range_nblocks == lengthof(range_buffers) ||
+					(range_nblocks > 0 &&
+					 !BufferTagsConsecutive(&range_buffers[range_nblocks - 1]->tag,
+											&bufHdr->tag)))
+				{
+					int			written;
+
+					written = SyncBuffers(range_buffers, range_nblocks, &wb_context);
+					range_nblocks = 0;
+					num_written += written;
+					PendingCheckpointerStats.buf_written_checkpoints += written;
+				}
+
+				/* Collect this pinned buffer in our range. */
+				range_buffers[range_nblocks++] = bufHdr;
 			}
 		}
 
@@ -2867,6 +2962,16 @@ BufferSync(int flags)
 		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
 	}
 
+	/* Sync our final range. */
+	if (range_nblocks > 0)
+	{
+		int			written;
+
+		written = SyncBuffers(range_buffers, range_nblocks, &wb_context);
+		num_written += written;
+		PendingCheckpointerStats.buf_written_checkpoints += written;
+	}
+
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
@@ -3245,6 +3350,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * buffer is clean by the time we've locked it.)
 	 */
 	PinBuffer_Locked(bufHdr);
+
 	LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
 
 	FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
@@ -3259,7 +3365,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
 	 * IOContext will always be IOCONTEXT_NORMAL.
 	 */
-	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
+	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag, 1);
 
 	return result | BUF_WRITTEN;
 }
@@ -3495,7 +3601,8 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
  */
 static int
 FlushBuffers(BufferDesc **bufs, int nbuffers, SMgrRelation reln,
-			 IOObject io_object, IOContext io_context)
+			 IOObject io_object, IOContext io_context,
+			 int *first_written_index, int *written)
 {
 	XLogRecPtr	max_recptr;
 	struct shared_buffer_write_error_info errinfo;
@@ -3578,7 +3685,13 @@ FlushBuffers(BufferDesc **bufs, int nbuffers, SMgrRelation reln,
 
 	/* If we can't write even one buffer, then we're done. */
 	if (first_start_io_index < 0)
+	{
+		if (first_written_index)
+			*first_written_index = 0;
+		if (written)
+			*written = 0;
 		return nbuffers;
+	}
 
 	/* Setup error traceback support for ereport() */
 	errinfo.buf = bufs[first_start_io_index];
@@ -3696,6 +3809,12 @@ FlushBuffers(BufferDesc **bufs, int nbuffers, SMgrRelation reln,
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
+	/* Report the range of buffers that we actually wrote, if any. */
+	if (first_written_index)
+		*first_written_index = first_start_io_index;
+	if (written)
+		*written = nbuffers - first_start_io_index;
+
 	return nbuffers;
 }
 
@@ -3709,7 +3828,7 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	FlushBuffers(&buf, 1, reln, io_object, io_context);
+	FlushBuffers(&buf, 1, reln, io_object, io_context, NULL, NULL);
 }
 
 /*
@@ -5734,11 +5853,11 @@ WritebackContextInit(WritebackContext *context, int *max_pending)
 }
 
 /*
- * Add buffer to list of pending writeback requests.
+ * Add buffer tag range to list of pending writeback requests.
  */
 void
 ScheduleBufferTagForWriteback(WritebackContext *wb_context, IOContext io_context,
-							  BufferTag *tag)
+							  BufferTag *tag, int nblocks)
 {
 	PendingWriteback *pending;
 
@@ -5756,6 +5875,7 @@ ScheduleBufferTagForWriteback(WritebackContext *wb_context, IOContext io_context
 		pending = &wb_context->pending_writebacks[wb_context->nr_pending++];
 
 		pending->tag = *tag;
+		pending->nblocks = nblocks;
 	}
 
 	/*
@@ -5812,10 +5932,11 @@ IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context)
 		int			ahead;
 		BufferTag	tag;
 		RelFileLocator currlocator;
-		Size		nblocks = 1;
+		Size		nblocks;
 
 		cur = &wb_context->pending_writebacks[i];
 		tag = cur->tag;
+		nblocks = cur->nblocks;
 		currlocator = BufTagGetRelFileLocator(&tag);
 
 		/*
@@ -5824,6 +5945,8 @@ IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context)
 		 */
 		for (ahead = 0; i + ahead + 1 < wb_context->nr_pending; ahead++)
 		{
+			BlockNumber this_end;
+			BlockNumber next_end;
 
 			next = &wb_context->pending_writebacks[i + ahead + 1];
 
@@ -5833,15 +5956,15 @@ IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context)
 				BufTagGetForkNum(&cur->tag) != BufTagGetForkNum(&next->tag))
 				break;
 
-			/* ok, block queued twice, skip */
-			if (cur->tag.blockNum == next->tag.blockNum)
-				continue;
-
-			/* only merge consecutive writes */
-			if (cur->tag.blockNum + 1 != next->tag.blockNum)
+			/* only merge consecutive or overlapping writes */
+			if (next->tag.blockNum > tag.blockNum + nblocks)
 				break;
 
-			nblocks++;
+			/* find the nblocks value that covers the end of both */
+			this_end = tag.blockNum + nblocks;
+			next_end = next->tag.blockNum + next->nblocks;
+			nblocks = Max(this_end, next_end) - tag.blockNum;
+
 			cur = next;
 		}
 
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 6a493d2b96..0b9a3e3fa8 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -300,6 +300,7 @@ typedef struct PendingWriteback
 {
 	/* could store different types of pending flushes here */
 	BufferTag	tag;
+	int			nblocks;
 } PendingWriteback;
 
 /* struct forward declared in bufmgr.h */
@@ -400,7 +401,8 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
 extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
-										  IOContext io_context, BufferTag *tag);
+										  IOContext io_context, BufferTag *tag,
+										  int nblocks);
 
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
-- 
2.39.2

v1-0001-Provide-vectored-variants-of-FileRead-and-FileWri.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Provide-vectored-variants-of-FileRead-and-FileWri.patchDownload

From 2d79ed35fd906dc53fb7ac7695cfec92544dd7ac Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Jul 2023 12:32:51 +1200
Subject: [PATCH v1 01/14] Provide vectored variants of FileRead() and
 FileWrite().

FileReadV() and FileWriteV() adapt preadv() and pwritev() for
PostgreSQL's virtual file descriptors.  The traditional FileRead() and
FileWrite() functions are implemented in terms of the new functions.

Author: Thomas Munro <thomas.munro@gmail.com>
---
 src/backend/storage/file/fd.c | 39 +++++++++++++++++++++++++++--------
 src/include/storage/fd.h      | 30 +++++++++++++++++++++++++--
 2 files changed, 58 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b490a76ba7..ac5b981cce 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2085,8 +2085,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 int
-FileRead(File file, void *buffer, size_t amount, off_t offset,
-		 uint32 wait_event_info)
+FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		  uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2106,7 +2106,14 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 
 retry:
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pread(vfdP->fd, buffer, amount, offset);
+	/* Avoid a slightly more expensive kernel call if there is no benefit. */
+	if (iovcnt == 1)
+		returnCode = pg_pread(vfdP->fd,
+							  iov[0].iov_base,
+							  iov[0].iov_len,
+							  offset);
+	else
+		returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
 	if (returnCode < 0)
@@ -2141,8 +2148,8 @@ retry:
 }
 
 int
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
-		  uint32 wait_event_info)
+FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		   uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2170,7 +2177,14 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset + amount;
+		size_t		size = 0;
+		off_t		past_write;
+
+		/* Compute the total transfer size. */
+		for (int i = 0; i < iovcnt; ++i)
+			size += iov[i].iov_len;
+
+		past_write = offset + size;
 
 		if (past_write > vfdP->fileSize)
 		{
@@ -2188,11 +2202,18 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 retry:
 	errno = 0;
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
+	/* Avoid a slightly more expensive kernel call if there is no benefit. */
+	if (iovcnt == 1)
+		returnCode = pg_pwrite(vfdP->fd,
+							   iov[0].iov_base,
+							   iov[0].iov_len,
+							   offset);
+	else
+		returnCode = pg_pwritev(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
 	/* if write didn't set errno, assume problem is no disk space */
-	if (returnCode != amount && errno == 0)
+	if (returnCode < 0 && errno == 0)
 		errno = ENOSPC;
 
 	if (returnCode >= 0)
@@ -2202,7 +2223,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + amount;
+			off_t		past_write = offset + returnCode;
 
 			if (past_write > vfdP->fileSize)
 			{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 6791a406fc..a354ae5a7f 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -43,6 +43,8 @@
 #ifndef FD_H
 #define FD_H
 
+#include "port/pg_iovec.h"
+
 #include <dirent.h>
 #include <fcntl.h>
 
@@ -111,8 +113,8 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
-extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
+extern int	FileReadV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileWriteV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
@@ -195,6 +197,30 @@ extern int	durable_unlink(const char *fname, int elevel);
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
+static inline int
+FileRead(File file, void *buffer, size_t amount, off_t offset,
+		 uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = buffer,
+		.iov_len = amount
+	};
+
+	return FileReadV(file, &iov, 1, offset, wait_event_info);
+}
+
+static inline int
+FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+		  uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = unconstify(void *, buffer),
+		.iov_len = amount
+	};
+
+	return FileWriteV(file, &iov, 1, offset, wait_event_info);
+}
+
 /* Filename components */
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
 #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
-- 
2.39.2

v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patchDownload

From 0c6aa2f886281da2b1a76ee694267984d421891a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Jul 2023 14:50:41 +1200
Subject: [PATCH v1 02/14] Provide vectored variants of smgrread() and
 smgrwrite().

smgrreadv() and smgrwritev() map to FileReadV() and FileWriteV(), and
thence the OS's vector I/O routines, after dealing with segmentation.
These functions allow for contiguous regions of relation files to be
transferred into and out of non-neighboring buffers with a single system
call.  The traditional smgrread() and smgrwrite() functions are
implemented in terms of the new functions.

The md.c implementation automatically handles short transfers by
retrying.  They were probably always possible even when we were only
doing 8kB reads and write, but with larger transfers they become more
likely.

Author: Thomas Munro <thomas.munro@gmail.com>
---
 src/backend/storage/smgr/md.c   | 353 +++++++++++++++++++++++---------
 src/backend/storage/smgr/smgr.c |  37 ++--
 src/include/storage/fd.h        |   2 +-
 src/include/storage/md.h        |   9 +-
 src/include/storage/smgr.h      |  25 ++-
 5 files changed, 304 insertions(+), 122 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad170..d1383220bd 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -452,7 +452,7 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 /*
  * mdextend() -- Add a block to the specified relation.
  *
- * The semantics are nearly the same as mdwrite(): write at the
+ * The semantics are nearly the same as mdwritev(): write at the
  * specified position.  However, this is to be used for the case of
  * extending a relation (i.e., blocknum is at or beyond the current
  * EOF).  Note that we assume writing a block beyond current EOF
@@ -737,136 +737,297 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * mdread() -- Read the specified block from a relation.
+ * Convert an array of buffer address into an array of iovec objects, and
+ * return the number that were required.  'iov' must have enough space for up
+ * to PG_IOV_MAX elements.
  */
-void
-mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   void *buffer)
+static int
+buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
+	struct iovec *iovp;
+	int			iovcnt;
 
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+	Assert(nblocks >= 1);
+	Assert(nblocks <= PG_IOV_MAX);
 
-	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend);
+	/* If this build supports direct I/O, buffers must be I/O aligned. */
+	for (int i = 0; i < nblocks; ++i)
+	{
+		if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
+			Assert((uintptr_t) buffers[i] ==
+				   TYPEALIGN(PG_IO_ALIGN_SIZE, buffers[i]));
+	}
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+	/* Start the first iovec off with the first buffer. */
+	iovp = &iov[0];
+	iovp->iov_base = buffers[0];
+	iovp->iov_len = BLCKSZ;
+	iovcnt = 1;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	/* Merge in the rest. */
+	for (int i = 1; i < nblocks; ++i)
+	{
+		void	   *buffer = buffers[i];
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		if (((char *) iovp->iov_base + iovp->iov_len) == buffer)
+		{
+			/* Contiguous with the last iovec. */
+			iovp->iov_len += BLCKSZ;
+		}
+		else
+		{
+			/* Need a new iovec. */
+			iovp++;
+			iovp->iov_base = buffer;
+			iovp->iov_len = BLCKSZ;
+			iovcnt++;
+		}
+	}
+
+	return iovcnt;
+}
 
-	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
+/*
+ * To continue a vector I/O after a short transfer, we need a way to remove
+ * the whole or partial iovecs that have already been transferred.  We assume
+ * that partial transfers with direct I/O will happen on suitable alignment
+ * boundaries, so that we don't have to make further adjustments for that.
+ */
+static int
+adjust_iovec_for_partial_transfer(struct iovec *iov,
+								  int iovcnt,
+								  size_t transferred)
+{
+	struct iovec *keep = iov;
 
-	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
-									   reln->smgr_rlocator.locator.spcOid,
-									   reln->smgr_rlocator.locator.dbOid,
-									   reln->smgr_rlocator.locator.relNumber,
-									   reln->smgr_rlocator.backend,
-									   nbytes,
-									   BLCKSZ);
+	Assert(iovcnt > 0);
 
-	if (nbytes != BLCKSZ)
+	/* Remove wholly transferred iovecs. */
+	while (keep->iov_len <= transferred)
 	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
+		transferred -= keep->iov_len;
+		keep++;
+		iovcnt--;
+	}
+	if (keep != iov)
+		memmove(iov, keep, sizeof(*keep) * iovcnt);
 
-		/*
-		 * Short read: we are at or past EOF, or we read a partial block at
-		 * EOF.  Normally this is an error; upper levels should never try to
-		 * read a nonexistent block.  However, if zero_damaged_pages is ON or
-		 * we are InRecovery, we should instead return zeroes without
-		 * complaining.  This allows, for example, the case of trying to
-		 * update a block that was later truncated away.
-		 */
-		if (zero_damaged_pages || InRecovery)
-			MemSet(buffer, 0, BLCKSZ);
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
-							blocknum, FilePathName(v->mdfd_vfd),
-							nbytes, BLCKSZ)));
+	/* Shouldn't be called with 'transferred' >= total size. */
+	Assert(iovcnt > 0);
+
+	/* Adjust leading iovec, which may have been partially transferred. */
+	Assert(iov->iov_len > transferred);
+	iov->iov_base = (char *) iov->iov_base + transferred;
+	iov->iov_len -= transferred;
+
+	return iovcnt;
+}
+
+/*
+ * mdreadv() -- Read the specified blocks from a relation.
+ */
+void
+mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		void **buffers, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
+
+		v = _mdfd_getseg(reln, forknum, blocknum, false,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
+
+		iovcnt = buffers_to_iov(iov, buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
+
+		/* Inner loop to continue after a short reads. */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend);
+			nbytes = FileReadV(v->mdfd_vfd, iov, iovcnt, seekpos,
+							   WAIT_EVENT_DATA_FILE_READ);
+			TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
+											   reln->smgr_rlocator.locator.spcOid,
+											   reln->smgr_rlocator.locator.dbOid,
+											   reln->smgr_rlocator.locator.relNumber,
+											   reln->smgr_rlocator.backend,
+											   nbytes,
+											   BLCKSZ);
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read %u blocks at block %u in file \"%s\": %m",
+								nblocks_this_segment, blocknum,
+								FilePathName(v->mdfd_vfd))));
+
+			if (nbytes == 0)
+			{
+				/*
+				 * We are at or past EOF, or we read a partial block at EOF.
+				 * Normally this is an error; upper levels should never try to
+				 * read a nonexistent block.  However, if zero_damaged_pages
+				 * is ON or we are InRecovery, we should instead return zeroes
+				 * without complaining.  This allows, for example, the case of
+				 * trying to update a block that was later truncated away.
+				 */
+				if (zero_damaged_pages || InRecovery)
+				{
+					for (BlockNumber i = transferred_this_segment / BLCKSZ;
+						 i < nblocks_this_segment;
+						 ++i)
+						memset(buffers[i], 0, BLCKSZ);
+					break;
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("could not read %u blocks at block %u in file \"%s\": read only %zu of %zu bytes",
+									nblocks_this_segment,
+									blocknum,
+									FilePathName(v->mdfd_vfd),
+									transferred_this_segment,
+									size_this_segment)));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short transfer. */
+			seekpos += nbytes;
+			iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes);
+		}
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
 	}
 }
 
 /*
- * mdwrite() -- Write the supplied block at the appropriate location.
+ * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
  * use mdextend().
  */
 void
-mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		const void *buffer, bool skipFsync)
+mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
-
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
-
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
 #endif
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
-										 reln->smgr_rlocator.locator.spcOid,
-										 reln->smgr_rlocator.locator.dbOid,
-										 reln->smgr_rlocator.locator.relNumber,
-										 reln->smgr_rlocator.backend);
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend,
-										nbytes,
-										BLCKSZ);
+		iovcnt = buffers_to_iov(iov, (void **) buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
 
-	if (nbytes != BLCKSZ)
-	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
-		/* short write: complain appropriately */
-		ereport(ERROR,
-				(errcode(ERRCODE_DISK_FULL),
-				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
-						blocknum,
-						FilePathName(v->mdfd_vfd),
-						nbytes, BLCKSZ),
-				 errhint("Check free disk space.")));
-	}
+		/* Inner loop to continue after a short writes. */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
+												 reln->smgr_rlocator.locator.spcOid,
+												 reln->smgr_rlocator.locator.dbOid,
+												 reln->smgr_rlocator.locator.relNumber,
+												 reln->smgr_rlocator.backend);
+			nbytes = FileWriteV(v->mdfd_vfd, iov, iovcnt, seekpos,
+								WAIT_EVENT_DATA_FILE_WRITE);
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend,
+												nbytes,
+												BLCKSZ);
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write %u blocks at block %u in file \"%s\": %m",
+								nblocks_this_segment, blocknum,
+								FilePathName(v->mdfd_vfd))));
 
-	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+			if (nbytes == 0)
+			{
+				/* short write: complain appropriately */
+				ereport(ERROR,
+						(errcode(ERRCODE_DISK_FULL),
+						 errmsg("could not write %u blocks at block %u in file \"%s\": wrote only %zu of %zu bytes",
+								nblocks_this_segment,
+								blocknum,
+								FilePathName(v->mdfd_vfd),
+								transferred_this_segment,
+								size_this_segment),
+						 errhint("Check free disk space.")));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short transfer. */
+			seekpos += nbytes;
+			iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes);
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
+	}
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c..b46ac3cb09 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -55,10 +55,13 @@ typedef struct f_smgr
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
-	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
-							  BlockNumber blocknum, void *buffer);
-	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-							   BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum,
+							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum,
+								const void **buffers, BlockNumber nblocks,
+								bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -80,8 +83,8 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
-		.smgr_read = mdread,
-		.smgr_write = mdwrite,
+		.smgr_readv = mdreadv,
+		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -551,22 +554,23 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * smgrread() -- read a particular block from a relation into the supplied
- *				 buffer.
+ * smgrreadv() -- read a particular block range from a relation into the
+ *				 supplied buffers.
  *
  * This routine is called from the buffer manager in order to
  * instantiate pages in the shared buffer cache.  All storage managers
  * return pages in the format that POSTGRES expects.
  */
 void
-smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 void *buffer)
+smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  void **buffers, BlockNumber nblocks)
 {
-	smgrsw[reln->smgr_which].smgr_read(reln, forknum, blocknum, buffer);
+	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
+										nblocks);
 }
 
 /*
- * smgrwrite() -- Write the supplied buffer out.
+ * smgrwritev() -- Write the supplied buffers out.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
@@ -581,14 +585,13 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * do not require fsync.
  */
 void
-smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  const void *buffer, bool skipFsync)
+smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_write(reln, forknum, blocknum,
-										buffer, skipFsync);
+	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
+										 buffers, nblocks, skipFsync);
 }
 
-
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index a354ae5a7f..26cf97586f 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -214,7 +214,7 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 		  uint32 wait_event_info)
 {
 	struct iovec iov = {
-		.iov_base = unconstify(void *, buffer),
+		.iov_base = (void *) buffer,
 		.iov_len = amount
 	};
 
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a..0f73d9af49 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,10 +32,11 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-				   void *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-					BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					void **buffers, BlockNumber nblocks);
+extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum,
+					 const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..ae93910132 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -96,10 +96,13 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
-extern void smgrread(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum, void *buffer);
-extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-					  BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
+					  BlockNumber blocknum,
+					  void **buffer, BlockNumber nblocks);
+extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum,
+					   const void **buffer, BlockNumber nblocks,
+					   bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,4 +113,18 @@ extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
+static inline void
+smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 void *buffer)
+{
+	smgrreadv(reln, forknum, blocknum, &buffer, 1);
+}
+
+static inline void
+smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  const void *buffer, bool skipFsync)
+{
+	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
+}
+
 #endif							/* SMGR_H */
-- 
2.39.2

v1-0003-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v1-0003-Provide-vectored-variant-of-ReadBuffer.patchDownload

From f89b04963a80f66d9f9c4b8153a8afbff33b026e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v1 03/14] Provide vectored variant of ReadBuffer().

PrepareReadBuffer() followed by CompleteReadBuffers() or ZeroBuffer() is
equivalent to ReadBuffer(), except that the physical read for multiple
adjacent blocks can be merged into one vectored system call.

ReadBuffer() is now implemented in terms of these functions.

Author: Thomas Munro <thomas.munro@gmail.com>
---
 src/backend/storage/buffer/bufmgr.c   | 541 +++++++++++++++++---------
 src/backend/storage/buffer/localbuf.c |   4 +-
 src/include/storage/buf_internals.h   |   3 +-
 src/include/storage/bufmgr.h          |  20 +
 4 files changed, 383 insertions(+), 185 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3bd82dbfca..a009b9da53 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -447,7 +447,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -485,7 +485,8 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr, IOContext io_context);
+							   bool *foundPtr, bool *allocPtr,
+							   IOContext io_context);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -773,7 +774,7 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 	 * miss.
 	 */
 	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
@@ -800,8 +801,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -975,7 +977,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -989,18 +991,18 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
+	Buffer		buffer;
+	bool		allocated;
 
-	*hit = false;
+	Assert(mode == RBM_ZERO_AND_LOCK ||
+		   mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+		   mode == RBM_NORMAL ||
+		   mode == RBM_NORMAL_NO_LOG ||
+		   mode == RBM_ZERO_ON_ERROR);
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1019,178 +1021,326 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	/* Make sure we will have room to remember the buffer pin */
-	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit,
+							   &allocated);
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	/* At this point we do NOT hold any locks. */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
 
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
+ * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
+ * caller has the opportunity to coalesce reads of neighboring blocks into one
+ * CompleteReadBuffers() call.
+ *
+ * *found is set to true for a hit, and false for a miss.
+ *
+ * *allocated is set to true for a miss that allocates a buffer for the first
+ * time.  If there are multiple calls to PrepareReadBuffer() for the same
+ * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
+ * read, then only the first such call will receive *allocated == true, which
+ * the caller might use to issue just one prefetch hint.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *found,
+				  bool *allocated)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, found, allocated);
+		if (*found)
+			pgBufferUsage.local_blks_hit++;
+		else
+			pgBufferUsage.local_blks_read++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, found, allocated, io_context);
+		if (*found)
+			pgBufferUsage.shared_blks_hit++;
+		else
+			pgBufferUsage.shared_blks_read++;
+	}
+	if (bmr.rel)
+	{
+		pgstat_count_buffer_read(bmr.rel);
+		if (*found)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*found)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
 
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
 	else
 	{
-		instr_time	io_start = pgstat_prepare_io_time();
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
+
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/* Skip this block if someone else has already completed it. */
+		if (!CompleteReadBuffersCanStartIO(buffers[i]))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 *
+			 * XXX We may want to redesign the tracing information to report
+			 * this, if there is still interest in these static tracing points
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/* We found a buffer that we have to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1]))
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time();
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1204,11 +1354,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1220,7 +1367,9 @@ static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr, IOContext io_context)
+			bool *foundPtr,
+			bool *allocatedPtr,
+			IOContext io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
@@ -1258,24 +1407,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		LWLockRelease(newPartitionLock);
 
 		*foundPtr = true;
+		*allocatedPtr = false;
 
 		if (!valid)
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1307,6 +1448,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		BufferDesc *existing_buf_hdr;
 		bool		valid;
 
+		*allocatedPtr = false;
+
 		/*
 		 * Got a collision. Someone has already done what we were about to do.
 		 * We'll just handle this as if it were found in the buffer pool in
@@ -1341,24 +1484,18 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
 	}
 
+	/* Report that we had to allocate a buffer. */
+	*allocatedPtr = true;
+
 	/*
 	 * Need to lock the buffer header too in order to change its tag.
 	 */
@@ -1385,15 +1522,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -2293,7 +2424,11 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was used.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when ReadBuffer() or CompleteReadBuffers()
+		 * call StartBufferIO(), they'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2302,7 +2437,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -4752,6 +4887,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1735ec7141..9ec410d6c0 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -115,7 +115,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, bool *allocPtr)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -141,6 +141,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
 
 		*foundPtr = PinLocalBuffer(bufHdr, true);
+		*allocPtr = false;
 	}
 	else
 	{
@@ -167,6 +168,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 		*foundPtr = false;
+		*allocPtr = true;
 	}
 
 	return bufHdr;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index bc79a329a1..427a989aaf 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -422,7 +422,8 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr,
+									bool *allocPtr);
 extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  ForkNumber fork,
 										  uint32 flags,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b379c76e27..5a2f66ed47 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,8 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "pgstat.h"
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -47,6 +49,8 @@ typedef enum
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
+	RBM_WILL_ZERO,				/* Don't read from disk, caller will call
+								 * ZeroBuffer() */
 	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
 } ReadBufferMode;
@@ -158,6 +162,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/* Maximum number of buffers for multi-buffer I/O functions. */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,19 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *found,
+								bool *allocated);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -245,6 +264,7 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern void AbortBufferIO(Buffer buffer);
 
-- 
2.39.2

v1-0004-Provide-multi-block-smgrprefetch.patchtext/x-patch; charset=US-ASCII; name=v1-0004-Provide-multi-block-smgrprefetch.patchDownload

From 61584e5178fe5d9d147aed127abb7aa3dfdaf67f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 08:53:15 +1200
Subject: [PATCH v1 04/14] Provide multi-block smgrprefetch().

Previously smgrprefetch() would issue POSIX_FADV_WILLNEED advice for a
single block at a time.  Add an nblocks argument so that we can do the
same for a range of blocks.

Author: Thomas Munro <thomas.munro@gmail.com>
---
 src/backend/storage/buffer/bufmgr.c   |  2 +-
 src/backend/storage/buffer/localbuf.c |  2 +-
 src/backend/storage/smgr/md.c         | 36 +++++++++++++++++++--------
 src/backend/storage/smgr/smgr.c       |  7 +++---
 src/include/storage/md.h              |  2 +-
 src/include/storage/smgr.h            |  2 +-
 6 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a009b9da53..d647708f7f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -543,7 +543,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		 * recovery if the relation file doesn't exist.
 		 */
 		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-			smgrprefetch(smgr_reln, forkNum, blockNum))
+			smgrprefetch(smgr_reln, forkNum, blockNum, 1))
 		{
 			result.initiated_io = true;
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 9ec410d6c0..099904e694 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -93,7 +93,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 #ifdef USE_PREFETCH
 		/* Not in buffers, so initiate prefetch */
 		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-			smgrprefetch(smgr, forkNum, blockNum))
+			smgrprefetch(smgr, forkNum, blockNum, 1))
 		{
 			result.initiated_io = true;
 		}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d1383220bd..7e226e1e0f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -710,27 +710,41 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * mdprefetch() -- Initiate asynchronous read of the specified block of a relation
+ * mdprefetch() -- Initiate asynchronous read of the specified blocks of a relation
  */
 bool
-mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   int nblocks)
 {
 #ifdef USE_PREFETCH
-	off_t		seekpos;
-	MdfdVec    *v;
 
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
-					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
-	if (v == NULL)
-		return false;
+	while (nblocks > 0)
+	{
+		off_t		seekpos;
+		MdfdVec    *v;
+		int			nblocks_this_segment;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		v = _mdfd_getseg(reln, forknum, blocknum, false,
+						 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+		if (v == NULL)
+			return false;
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+		(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ * nblocks_this_segment,
+							WAIT_EVENT_DATA_FILE_PREFETCH);
+
+		blocknum += nblocks_this_segment;
+		nblocks -= nblocks_this_segment;
+	}
 #endif							/* USE_PREFETCH */
 
 	return true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b46ac3cb09..e594158f47 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -54,7 +54,7 @@ typedef struct f_smgr
 	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber blocknum);
+								  BlockNumber blocknum, int nblocks);
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
@@ -548,9 +548,10 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * record).
  */
 bool
-smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
 }
 
 /*
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 0f73d9af49..3dc651ec17 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -31,7 +31,7 @@ extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber blocknum);
+					   BlockNumber blocknum, int nblocks);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index ae93910132..6e62dc053c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -95,7 +95,7 @@ extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum);
+						 BlockNumber blocknum, int nblocks);
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffer, BlockNumber nblocks);
-- 
2.39.2

v1-0005-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchtext/x-patch; charset=US-ASCII; name=v1-0005-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchDownload

From 92d00b22ea04817725f77996ead9cbd007f4e351 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 10 Aug 2023 18:16:31 +1200
Subject: [PATCH v1 05/14] Give SMgrRelation pointers a well-defined lifetime.

After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose(), which freed the memory.

Guarantee that the object won't be destroyed until the end of the
current transaction, or in recovery, the commit/abort record that
destroys the underlying storage.

The existing smgrclose() function now closes files and forgets all state
except the rlocator, like smgrrelease() used to do, and additionally
"disowns" the object so that it is disconnected from any Relation and
queued for destruction at AtEOXact_SMgr(), unless it is re-owned by a
Relation again before then.

A new smgrdestroy() function is used by rare places that know there
should be no other references to the SMgrRelation.

The short version:

 * smgrrelease() is removed
 * smgrclose() now releases resources, but doesn't destroy until EOX
 * smgrdestroy() now frees memory, and should rarely be used

Existing code should be unaffected, but it is now possible for code that
has an SMgrRelation object to use it repeatedly during a transaction as
long as the storage hasn't been physically dropped.  Such code would
normally hold a lock on the relation.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
---
 src/backend/access/transam/xlogutils.c |  2 +-
 src/backend/postmaster/bgwriter.c      |  8 +++--
 src/backend/postmaster/checkpointer.c  | 15 ++++----
 src/backend/storage/smgr/smgr.c        | 49 ++++++++++++++++----------
 src/include/storage/smgr.h             |  4 +--
 src/include/utils/rel.h                |  6 ----
 6 files changed, 46 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205..bdc0f7e1da 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..5255b63dec 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -245,10 +245,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index ace9893d95..87552ba2fd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -458,10 +458,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -944,11 +946,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index e594158f47..013300a21b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -257,10 +259,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -287,12 +289,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -300,15 +304,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -320,14 +329,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -339,7 +351,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -731,7 +743,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -745,7 +758,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -755,7 +768,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -766,6 +779,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 6e62dc053c..2656f10f1c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 1426a353cd..a1402c1dd2 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
-- 
2.39.2

v1-0006-Provide-basic-streaming-read-API.patchtext/x-patch; charset=US-ASCII; name=v1-0006-Provide-basic-streaming-read-API.patchDownload

From 89ac3d5aae4f36c3c536dcd6b198e23b9f9e2024 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 28 Jul 2023 22:35:32 +1200
Subject: [PATCH v1 06/14] Provide basic streaming read API.

"Streaming reads" can be used as a more efficient replacement for
ReadBuffer() calls.

The client code supplies a callback that can say which block to read
next, and then consumes individual buffers one at a time.  This division
allows the PgStreamingRead object to build larger calls to
CompleteReadBuffers(), which in turn produce large preadv() calls
instead of traditional single-block pread() calls.  Unless the read is
purely sequential, posix_fadvise() calls are also issued.

This API is intended as a stepping stone, allowing for true asynchronous
implementation in later work.  Code that adapts to the streaming read
API would automatically benefit.

Author: Thomas Munro <thomas.munro@gmail.com>
---
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 436 +++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      |   2 +-
 src/backend/storage/meson.build          |   1 +
 src/include/storage/bufmgr.h             |   1 +
 src/include/storage/streaming_read.h     |  38 ++
 src/tools/pgindent/typedefs.list         |   2 +
 9 files changed, 499 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..156e87cab7
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..ee5a7e961d
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,436 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of clusters of buffers.
+ *
+ * For hits and RBM_WILL_ZERO, need_to_complete is false, we have just one
+ * buffer in each cluster, already pinned and ready for use.
+ *
+ * For misses that require a physical read, need_to_complete is true, and
+ * buffers[] holds a group of of neighboring blocks, so we can complete them
+ * with a single call to CompleteReadBuffers().  We can also issue a single
+ * prefetch for it as soon as it has grown to its largest possible size, if
+ * our random access heuristics determine that is a good idea.
+ */
+typedef struct PgStreamingReadCluster
+{
+	bool		advice_issued;
+	bool		need_complete;
+
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+
+	int			per_io_data_index[MAX_BUFFERS_PER_TRANSFER];
+	bool		need_advice[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadCluster;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	uintptr_t	pgsr_private;
+	PgStreamingReadBufferDetermineNextCB next_cb;
+	BufferAccessStrategy strategy;
+
+	/* Next expected prefetch, for sequential prefetch avoidance. */
+	BufferManagerRelation seq_bmr;
+	ForkNumber	seq_forknum;
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-I/O private data. */
+	size_t		per_io_data_size;
+	void	   *per_io_data;
+	int			per_io_data_next;
+
+	/* Circular buffer of clusters. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadCluster clusters[FLEXIBLE_ARRAY_MEMBER];
+};
+
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int max_ios,
+							   size_t per_io_data_size,
+							   uintptr_t pgsr_private,
+							   BufferAccessStrategy strategy,
+							   PgStreamingReadBufferDetermineNextCB determine_next_cb)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_pinned_buffers;
+
+	Assert(max_ios > 0);
+
+	/*
+	 * We allow twice as many buffers to be pinned as I/Os.  This allows us to
+	 * look further ahead for blocks that need to be read in.
+	 */
+	max_pinned_buffers = max_ios * 2;
+
+	/* Don't allow this backend to pin too many buffers. */
+	LimitAdditionalPins((uint32 *) &max_pinned_buffers);
+	max_pinned_buffers = Max(2, max_pinned_buffers);
+	max_ios = max_pinned_buffers / 2;
+	Assert(max_ios > 0);
+	Assert(max_pinned_buffers > 0);
+	Assert(max_pinned_buffers > max_ios);
+
+	/*
+	 * pgsr->clusters is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0).  So we need two extra elements.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, clusters) +
+				sizeof(pgsr->clusters[0]) * size);
+
+	pgsr->per_io_data_size = per_io_data_size;
+	pgsr->max_ios = max_ios;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->next_cb = determine_next_cb;
+	pgsr->size = size;
+
+	/*
+	 * We look ahead when the number of pinned buffers falls below this
+	 * number.  This encourages the formation of large vectored reads.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(max_ios, max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_io_data_size)
+		pgsr->per_io_data = palloc(per_io_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Issue WILLNEED advice for the head cluster, and allocate a new head
+ * cluster.
+ *
+ * We don't have true asynchronous I/O to actually submit, but this is
+ * equivalent because it might start I/O on systems that understand WILLNEED
+ * advice.  We count it as an I/O in progress.
+ */
+static PgStreamingReadCluster *
+pg_streaming_read_submit(PgStreamingRead *pgsr)
+{
+	PgStreamingReadCluster *head_cluster;
+
+	head_cluster = &pgsr->clusters[pgsr->head];
+	Assert(head_cluster->nblocks > 0);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * Don't bother with advice if there will be no call to
+	 * CompleteReadBuffers() or direct I/O is enabled.
+	 */
+	if (head_cluster->need_complete &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		/*
+		 * Purely sequential advice is known to hurt performance on some
+		 * systems, so only issue it if this looks random.
+		 */
+		if (head_cluster->bmr.smgr != pgsr->seq_bmr.smgr ||
+			head_cluster->bmr.rel != pgsr->seq_bmr.rel ||
+			head_cluster->forknum != pgsr->seq_forknum ||
+			head_cluster->blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				head_cluster->bmr.smgr ? head_cluster->bmr.smgr
+				: RelationGetSmgr(head_cluster->bmr.rel);
+
+			Assert(!head_cluster->advice_issued);
+
+			for (int i = 0; i < head_cluster->nblocks; i++)
+			{
+				if (head_cluster->need_advice[i])
+				{
+					BlockNumber first_blocknum = head_cluster->blocknum + i;
+					int			nblocks = 1;
+
+					/*
+					 * How many adjacent blocks can we merge with to reduce
+					 * system calls?  Usually this is all of them, unless
+					 * there are overlapping reads and our timing is unlucky.
+					 */
+					while ((i + 1) < head_cluster->nblocks &&
+						   head_cluster->need_advice[i + 1])
+					{
+						nblocks++;
+						i++;
+					}
+
+					smgrprefetch(smgr,
+								 head_cluster->forknum,
+								 first_blocknum,
+								 nblocks);
+				}
+
+			}
+
+			/*
+			 * Count this as an I/O that is concurrently in progress.  We
+			 * might have called smgrprefetch() more than once, if some of the
+			 * buffers in the range were already in buffer pool but not valid
+			 * yet, because of a concurrent read, but for now we choose to
+			 * track this as one I/O.
+			 */
+			head_cluster->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the point after this, for the above heuristics. */
+		pgsr->seq_bmr = head_cluster->bmr;
+		pgsr->seq_forknum = head_cluster->forknum;
+		pgsr->seq_blocknum = head_cluster->blocknum + head_cluster->nblocks;
+	}
+#endif
+
+	/* Create a new head cluster.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_cluster = &pgsr->clusters[pgsr->head];
+	head_cluster->nblocks = 0;
+
+	return head_cluster;
+}
+
+void
+pg_streaming_read_prefetch(PgStreamingRead *pgsr)
+{
+	/* If we're finished or can't start one more I/O, then no prefetching. */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a large cluster.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		ReadBufferMode mode;
+		Buffer		buffer;
+		bool		found;
+		bool		allocated;
+		bool		need_complete;
+		PgStreamingReadCluster *head_cluster;
+		void	   *per_io_data;
+
+		/* Do we have a full-sized cluster? */
+		head_cluster = &pgsr->clusters[pgsr->head];
+		if (head_cluster->nblocks == lengthof(head_cluster->buffers))
+		{
+			Assert(head_cluster->need_complete);
+			head_cluster = pg_streaming_read_submit(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated or we couldn't form another
+			 * full cluster after this.
+			 */
+			if (pgsr->ios_in_progress == pgsr->max_ios ||
+				pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+				break;
+		}
+
+		per_io_data = (char *) pgsr->per_io_data +
+			pgsr->per_io_data_size * pgsr->per_io_data_next;
+
+		/*
+		 * Try to find out which block the callback wants to read next.  False
+		 * indicates end-of-stream (but the client can restart).
+		 */
+		if (!pgsr->next_cb(pgsr, pgsr->pgsr_private, per_io_data,
+						   &bmr, &forknum, &blocknum, &mode))
+		{
+			pgsr->finished = true;
+			break;
+		}
+
+		Assert(mode == RBM_NORMAL || mode == RBM_WILL_ZERO);
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found,
+								   &allocated);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found && mode != RBM_WILL_ZERO;
+
+		/* Is there a head cluster that we can't extend? */
+		head_cluster = &pgsr->clusters[pgsr->head];
+		if (head_cluster->nblocks > 0 &&
+			(!need_complete ||
+			 !head_cluster->need_complete ||
+			 head_cluster->bmr.smgr != bmr.smgr ||
+			 head_cluster->bmr.rel != bmr.rel ||
+			 head_cluster->forknum != forknum ||
+			 head_cluster->blocknum + head_cluster->nblocks != blocknum))
+		{
+			/* Submit it so we can start a new one. */
+			head_cluster = pg_streaming_read_submit(pgsr);
+			Assert(head_cluster->nblocks == 0);
+		}
+
+		if (head_cluster->nblocks == 0)
+		{
+			/* Initialize the cluster. */
+			head_cluster->bmr = bmr;
+			head_cluster->forknum = forknum;
+			head_cluster->blocknum = blocknum;
+			head_cluster->advice_issued = false;
+			head_cluster->need_complete = need_complete;
+		}
+		else
+		{
+			/* We'll extend an existing cluster by one buffer. */
+			Assert(head_cluster->bmr.smgr == bmr.smgr);
+			Assert(head_cluster->bmr.rel == bmr.rel);
+			Assert(head_cluster->forknum == forknum);
+			Assert(head_cluster->blocknum + head_cluster->nblocks == blocknum);
+			Assert(head_cluster->need_complete);
+		}
+
+		head_cluster->per_io_data_index[head_cluster->nblocks] = pgsr->per_io_data_next++;
+		head_cluster->need_advice[head_cluster->nblocks] = allocated;
+		head_cluster->buffers[head_cluster->nblocks] = buffer;
+		head_cluster->nblocks++;
+
+		if (pgsr->per_io_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_io_data_next = 0;
+
+	} while (pgsr->ios_in_progress < pgsr->max_ios &&
+			 pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+	/*
+	 * Initiate as soon as we can if we can't prepare any more reads right
+	 * now.  This makes sure we issue the advice as soon as possible, since
+	 * any other backend that tries to read the same block won't do that.
+	 */
+	if (pgsr->clusters[pgsr->head].nblocks > 0)
+		pg_streaming_read_submit(pgsr);
+}
+
+void
+pg_streaming_read_reset(PgStreamingRead *pgsr)
+{
+	pgsr->finished = false;
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_data)
+{
+	pg_streaming_read_prefetch(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadCluster *tail_cluster;
+
+		tail_cluster = &pgsr->clusters[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * cluster?
+		 */
+		if (tail_cluster->need_complete)
+		{
+			CompleteReadBuffers(tail_cluster->bmr,
+								tail_cluster->buffers,
+								tail_cluster->forknum,
+								tail_cluster->blocknum,
+								tail_cluster->nblocks,
+								false,
+								pgsr->strategy);
+			tail_cluster->need_complete = false;
+
+			/* We only counted this I/O as running if we issued advice. */
+			if (tail_cluster->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this cluster? */
+		if (pgsr->next_tail_buffer < tail_cluster->nblocks)
+		{
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_io_data)
+				*per_io_data = (char *) pgsr->per_io_data +
+					tail_cluster->per_io_data_index[pgsr->next_tail_buffer] *
+					pgsr->per_io_data_size;
+
+			return tail_cluster->buffers[pgsr->next_tail_buffer++];
+		}
+
+		/* Advance tail to next cluster, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop reading ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	for (;;)
+	{
+		buffer = pg_streaming_read_buffer_get_next(pgsr, NULL);
+		if (buffer == InvalidBuffer)
+			break;
+		ReleaseBuffer(buffer);
+	}
+
+	if (pgsr->per_io_data)
+		pfree(pgsr->per_io_data);
+	pfree(pgsr);
+}
+
+int
+pg_streaming_read_ios(PgStreamingRead *pgsr)
+{
+	return pgsr->ios_in_progress;
+}
+
+int
+pg_streaming_read_pins(PgStreamingRead *pgsr)
+{
+	return pgsr->pinned_buffers;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d647708f7f..678e3239a9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1878,7 +1878,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 6ea9faa439..b196d9f885 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 5a2f66ed47..435239955e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -271,6 +271,7 @@ extern void AbortBufferIO(Buffer buffer);
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
 extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
+extern void LimitAdditionalPins(uint32 *additional_pins);
 
 /* in buf_init.c */
 extern void InitBufferPool(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..7717cdf056
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,38 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/*
+ * For most sequential access, callers can user this size to build full sized
+ * reads without pinning many extra buffers.
+ */
+#define PG_STREAMING_READ_DEFAULT_MAX_IOS MAX_BUFFERS_PER_TRANSFER
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
+													  uintptr_t pgsr_private,
+													  void *per_io_private,
+													  BufferManagerRelation *bmr,
+													  ForkNumber *forkNum,
+													  BlockNumber *blockNum,
+													  ReadBufferMode *mode);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int max_ios,
+													   size_t per_io_private_size,
+													   uintptr_t pgsr_private,
+													   BufferAccessStrategy strategy,
+													   PgStreamingReadBufferDetermineNextCB determine_next_cb);
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
+extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+extern int pg_streaming_read_ios(PgStreamingRead *pgsr);
+extern int pg_streaming_read_pins(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 49a33c0387..a0752fa30e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2076,6 +2076,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadCluster
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.39.2

v1-0007-Use-streaming-reads-in-pg_prewarm.patchtext/x-patch; charset=US-ASCII; name=v1-0007-Use-streaming-reads-in-pg_prewarm.patchDownload

From 8b7fac9da6c2cd5c0ba5a8ff5ad19bbb5b161b2e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v1 07/14] Use streaming reads in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use streaming reads.  This
provides a simple example of such a transformation, and generates fewer
system calls.

Author: Thomas Munro <thomas.munro@gmail.com>
---
 contrib/pg_prewarm/pg_prewarm.c | 53 ++++++++++++++++++++++++++++++++-
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index e464d0d4d2..30f4704541 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,38 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static bool
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   uintptr_t pgsr_private,
+							   void *per_io_data,
+							   BufferManagerRelation *bmr,
+							   ForkNumber *forknum,
+							   BlockNumber *blocknum,
+							   ReadBufferMode *mode)
+{
+	struct pg_prewarm_streaming_read_private *p =
+		(struct pg_prewarm_streaming_read_private *) pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+	{
+		*bmr = p->bmr;
+		*forknum = p->forknum;
+		*mode = RBM_NORMAL;
+		*blocknum = p->blocknum++;
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +216,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.bmr = BMR_REL(rel);
+		p.forknum = forkNumber;
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PG_STREAMING_READ_DEFAULT_MAX_IOS,
+											  0,
+											  (uintptr_t) &p,
+											  NULL,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.2

v1-0008-WIP-Use-streaming-reads-in-heapam-scans.patchtext/x-patch; charset=US-ASCII; name=v1-0008-WIP-Use-streaming-reads-in-heapam-scans.patchDownload

From 091bc69a8b02e755d55b27d58fb1c17e9ab9a6dc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Apr 2023 18:16:06 -0700
Subject: [PATCH v1 08/14] WIP: Use streaming reads in heapam scans.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/heap/heapam.c         | 220 +++++++++++++++++++++--
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/include/access/heapam.h              |   5 +-
 3 files changed, 207 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6a66214a58..01cca53586 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -65,6 +65,7 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
@@ -225,6 +226,99 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static bool
+heap_pgsr_next_single(PgStreamingRead *pgsr,
+					  uintptr_t pgsr_private, void *per_io_data,
+					  BufferManagerRelation *bmr, ForkNumber *fork,
+					  BlockNumber *block, ReadBufferMode *mode)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	BlockNumber blockno;
+
+	Assert(!scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	if (scan->rs_prefetch_block == InvalidBlockNumber)
+	{
+		scan->rs_prefetch_block = blockno = scan->rs_startblock;
+	}
+	else
+	{
+		blockno = ++scan->rs_prefetch_block;
+
+		/* wrap back to the start of the heap */
+		if (blockno >= scan->rs_nblocks)
+			scan->rs_prefetch_block = blockno = 0;
+
+		/* we're done if we're back at where we started */
+		if (blockno == scan->rs_startblock)
+			return false;
+
+		/* check if the limit imposed by heap_setscanlimits() is met */
+		if (scan->rs_numblocks != InvalidBlockNumber)
+		{
+			if (--scan->rs_numblocks == 0)
+				return false;
+		}
+	}
+
+	*bmr = BMR_REL(scan->rs_base.rs_rd);
+	*fork = MAIN_FORKNUM;
+	*block = blockno;
+	*mode = RBM_NORMAL;
+
+	return true;
+}
+
+static bool
+heap_pgsr_next_parallel(PgStreamingRead *pgsr,
+						uintptr_t pgsr_private, void *per_io_data,
+						BufferManagerRelation *bmr, ForkNumber *fork,
+						BlockNumber *block, ReadBufferMode *mode)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	ParallelBlockTableScanDesc pbscan =
+		(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+	ParallelBlockTableScanWorker pbscanwork =
+		scan->rs_parallelworkerdata;
+	BlockNumber blockno;
+
+	Assert(scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	/* Note that other processes might have already finished the scan */
+	blockno = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+												pbscanwork, pbscan);
+	if (blockno == InvalidBlockNumber)
+		return false;
+
+	*bmr = BMR_REL(scan->rs_base.rs_rd);
+	*fork = MAIN_FORKNUM;
+	*block = blockno;
+	*mode = RBM_NORMAL;
+
+	return true;
+}
+
+static PgStreamingRead *
+heap_pgsr_single_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PG_STREAMING_READ_DEFAULT_MAX_IOS,
+										  0, (uintptr_t) scan,
+										  scan->rs_strategy,
+										  heap_pgsr_next_single);
+}
+
+static PgStreamingRead *
+heap_pgsr_parallel_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PG_STREAMING_READ_DEFAULT_MAX_IOS,
+										  0, (uintptr_t) scan,
+										  scan->rs_strategy,
+										  heap_pgsr_next_parallel);
+}
+
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -342,6 +436,26 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 */
 	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN)
 		pgstat_count_heap_scan(scan->rs_base.rs_rd);
+
+	scan->rs_prefetch_block = InvalidBlockNumber;
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
+	/*
+	 * FIXME: This probably should be done in the !rs_inited blocks instead.
+	 */
+	scan->pgsr = NULL;
+	if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
+		(scan->rs_base.rs_flags & SO_TYPE_SEQSCAN))
+	{
+		if (scan->rs_base.rs_parallel)
+			scan->pgsr = heap_pgsr_parallel_alloc(scan);
+		else
+			scan->pgsr = heap_pgsr_single_alloc(scan);
+	}
 }
 
 /*
@@ -374,7 +488,7 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
  * which tuples on the page are visible.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer pgsr_buffer)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
 	Buffer		buffer;
@@ -401,9 +515,20 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
+	if (BufferIsValid(pgsr_buffer))
+	{
+		Assert(scan->pgsr);
+		Assert(BufferGetBlockNumber(pgsr_buffer) == block);
+		scan->rs_cbuf = pgsr_buffer;
+	}
+	else
+	{
+		Assert(!scan->pgsr);
+
+		/* read page using selected strategy */
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+										   RBM_NORMAL, scan->rs_strategy);
+	}
 	scan->rs_cblock = block;
 
 	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
@@ -491,7 +616,7 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
  * of the pages before we can get a chance to get our first page.
  */
 static BlockNumber
-heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
+heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir, Buffer *pgsr_buf)
 {
 	Assert(!scan->rs_inited);
 
@@ -501,16 +626,25 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		if (scan->rs_base.rs_parallel != NULL)
+			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+													 scan->rs_parallelworkerdata,
+													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		/* FIXME: Integrate more neatly */
+		if (scan->pgsr)
+		{
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+
 		/* serial scan */
 		if (scan->rs_base.rs_parallel == NULL)
 			return scan->rs_startblock;
 		else
 		{
-			/* parallel scan */
-			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
-													 scan->rs_parallelworkerdata,
-													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
-
 			/* may return InvalidBlockNumber if there are no more blocks */
 			return table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
 													 scan->rs_parallelworkerdata,
@@ -530,6 +664,12 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 		 */
 		scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
 
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/*
 		 * Start from last page of the scan.  Ensure we take into account
 		 * rs_numblocks if it's been adjusted by heap_setscanlimits().
@@ -635,11 +775,38 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * heap_setscanlimits().
  */
 static inline BlockNumber
-heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
+heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir,
+						 Buffer *pgsr_buf)
 {
 	if (ScanDirectionIsForward(dir))
 	{
-		if (scan->rs_base.rs_parallel == NULL)
+		if (scan->pgsr)
+		{
+#ifdef USE_ASSERT_CHECKING
+			block++;
+
+			/* wrap back to the start of the heap */
+			if (block >= scan->rs_nblocks)
+				block = 0;
+
+			/* we're done if we're back at where we started */
+			if (block == scan->rs_startblock)
+				block = InvalidBlockNumber;
+			else if (scan->rs_numblocks != InvalidBlockNumber)
+			{
+				if (--scan->rs_numblocks == 0)
+					block = InvalidBlockNumber;
+			}
+#endif
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+
+			Assert(scan->rs_base.rs_parallel ||
+				   block == BufferGetBlockNumber(*pgsr_buf));
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+		else if (scan->rs_base.rs_parallel == NULL)
 		{
 			block++;
 
@@ -684,6 +851,12 @@ heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir
 	}
 	else
 	{
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/* we're done if the last block is the start position */
 		if (block == scan->rs_startblock)
 			return InvalidBlockNumber;
@@ -733,13 +906,14 @@ heapgettup(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -760,7 +934,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -812,9 +986,10 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+		pgsr_buf = InvalidBuffer;
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -848,13 +1023,14 @@ heapgettup_pagemode(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -882,7 +1058,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		page = BufferGetPage(scan->rs_cbuf);
 		TestForOldSnapshot(scan->rs_base.rs_snapshot, scan->rs_base.rs_rd, page);
 		linesleft = scan->rs_ntuples;
@@ -915,7 +1091,7 @@ continue_page:
 		}
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -963,6 +1139,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_parallel = parallel_scan;
 	scan->rs_strategy = NULL;	/* set in initscan */
 
+	scan->pgsr = NULL;
+
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
 	 */
@@ -1069,6 +1247,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5a17112c91..26378cbb4b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2339,7 +2339,7 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
+	heapgetpage(scan, blockno, InvalidBuffer);
 	hscan->rs_inited = true;
 
 	return true;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index faf5026519..852d47b1f8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -59,6 +59,7 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	OffsetNumber rs_coffset;	/* current offset # in non-page-at-a-time mode */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+	BlockNumber rs_prefetch_block;	/* block being prefetched */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
@@ -72,6 +73,8 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
+	struct PgStreamingRead *pgsr;
+
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
@@ -212,7 +215,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer buffer);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.39.2

v1-0009-WIP-Use-streaming-reads-in-vacuum.patchtext/x-patch; charset=US-ASCII; name=v1-0009-WIP-Use-streaming-reads-in-vacuum.patchDownload

From faa521e91fda707b5f3ac2f760c5bd70ae66b53b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Apr 2023 15:44:19 -0700
Subject: [PATCH v1 09/14] WIP: Use streaming reads in vacuum.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes (I still need to
understand why regress/regress's stats test reuse counter doesn't like
this).

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/heap/vacuumlazy.c | 305 +++++++++++++++++++++------
 src/test/regress/expected/stats.out  |  10 +-
 src/test/regress/sql/stats.sql       |   5 +-
 src/tools/pgindent/typedefs.list     |   2 +
 4 files changed, 244 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6a41ee635d..cd1a14c112 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
+#include "storage/streaming_read.h"
 #include "tcop/tcopprot.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -195,6 +196,17 @@ typedef struct LVRelState
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 
+	/*
+	 * State managed by vacuum_scan_pgsr_next et al follows
+	 */
+	BlockNumber blkno_prefetch;
+	BlockNumber next_unskippable_block;
+	Buffer		vmbuffer_prefetch;
+	Buffer		vmbuffer_data;
+
+	bool		next_unskippable_allvis;
+	bool		skipping_current_range;
+
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
 	double		new_live_tuples;	/* new estimated total # of live tuples */
@@ -785,6 +797,58 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 }
 
+static bool
+vacuum_scan_pgsr_next(PgStreamingRead *pgsr,
+					  uintptr_t pgsr_private, void *io_private,
+					  BufferManagerRelation *bmr, ForkNumber *fork, BlockNumber *block,
+					  ReadBufferMode *mode)
+{
+	LVRelState *vacrel = (LVRelState *) pgsr_private;
+
+	while (vacrel->blkno_prefetch < vacrel->rel_pages)
+	{
+		if (vacrel->blkno_prefetch == vacrel->next_unskippable_block)
+		{
+			/*
+			 * XXX: Does a vacuum_delay_point, it'd be better if this were
+			 * outside the callback...
+			 */
+			vacrel->next_unskippable_block = lazy_scan_skip(vacrel,
+															&vacrel->vmbuffer_prefetch,
+															vacrel->blkno_prefetch + 1,
+															&vacrel->next_unskippable_allvis,
+															&vacrel->skipping_current_range);
+			Assert(vacrel->next_unskippable_block >= vacrel->blkno_prefetch + 1);
+		}
+		else
+		{
+			/*
+			 * Can't skip this page safely.  Must scan the page.  But
+			 * determine the next skippable range after the page first.
+			 */
+			Assert(vacrel->blkno_prefetch < vacrel->rel_pages - 1);
+
+			if (vacrel->skipping_current_range)
+			{
+				vacrel->blkno_prefetch++;
+				continue;
+			}
+		}
+
+		Assert(vacrel->blkno_prefetch < vacrel->rel_pages);
+
+		*bmr = BMR_REL(vacrel->rel);
+		*fork = MAIN_FORKNUM;
+		*block = vacrel->blkno_prefetch++;
+		*mode = RBM_NORMAL;
+
+		return true;
+	}
+
+	Assert(vacrel->blkno_prefetch == vacrel->rel_pages);
+	return false;
+}
+
 /*
  *	lazy_scan_heap() -- workhorse function for VACUUM
  *
@@ -825,19 +889,24 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
 	VacDeadItems *dead_items = vacrel->dead_items;
-	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	PgStreamingRead *pgsr;
+
+	{
+		int			iodepth = Max(Min(128, NBuffers / 128), 1);
+
+		pgsr = pg_streaming_read_buffer_alloc(iodepth, 0,
+											  (uintptr_t) vacrel,
+											  vacrel->bstrategy,
+											  vacuum_scan_pgsr_next);
+	}
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
@@ -846,42 +915,44 @@ lazy_scan_heap(LVRelState *vacrel)
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
-	for (blkno = 0; blkno < rel_pages; blkno++)
+	vacrel->next_unskippable_block =
+		lazy_scan_skip(vacrel,
+					   &vacrel->vmbuffer_prefetch,
+					   0,
+					   &vacrel->next_unskippable_allvis,
+					   &vacrel->skipping_current_range);
+
+	while (true)
 	{
 		Buffer		buf;
+		BlockNumber blkno;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vm = false;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
+		if (!BufferIsValid(buf))
+			break;
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
+		CheckBufferIsPinnedOnce(buf);
 
-			if (skipping_current_range)
-				continue;
+		page = BufferGetPage(buf);
 
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/* FIXME: should have more efficient way to determine this. */
+		blkno = BufferGetBlockNumber(buf);
+
+		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+								 blkno, InvalidOffsetNumber);
+
+		vacuum_delay_point();
 
+		/*
+		 * We're not skipping this page using the visibility map, and so it is
+		 * (by definition) a scanned page.  Any tuples from this page are now
+		 * guaranteed to be counted below, after some preparatory checks.
+		 */
 		vacrel->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
@@ -918,10 +989,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * correctness, but we do it anyway to avoid holding the pin
 			 * across a lengthy, unrelated operation.
 			 */
-			if (BufferIsValid(vmbuffer))
+			if (BufferIsValid(vacrel->vmbuffer_data))
+			{
+				ReleaseBuffer(vacrel->vmbuffer_data);
+				vacrel->vmbuffer_data = InvalidBuffer;
+			}
+
+			if (BufferIsValid(vacrel->vmbuffer_prefetch))
 			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
+				ReleaseBuffer(vacrel->vmbuffer_prefetch);
+				vacrel->vmbuffer_prefetch = InvalidBuffer;
 			}
 
 			/* Perform a round of index and heap vacuuming */
@@ -941,12 +1018,26 @@ lazy_scan_heap(LVRelState *vacrel)
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
+		/*
+		 * Normally, the fact that we can't skip this block must mean that
+		 * it's not all-visible.  But in an aggressive vacuum we know only
+		 * that it's not all-frozen, so it might still be all-visible.
+		 *
+		 * FIXME: deduplicate
+		 */
+		if (vacrel->aggressive &&
+			VM_ALL_VISIBLE(vacrel->rel, blkno, &vacrel->vmbuffer_data))
+			all_visible_according_to_vm = true;
+
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.
+		 *
+		 * FIXME: This needs to be updated based on the fact that there's now
+		 * two different pins.
 		 */
-		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
+		visibilitymap_pin(vacrel->rel, blkno, &vacrel->vmbuffer_data);
 
 		/*
 		 * We need a buffer cleanup lock to prune HOT chains and defragment
@@ -954,9 +1045,6 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * a cleanup lock right away, we may be able to settle for reduced
 		 * processing using lazy_scan_noprune.
 		 */
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-								 vacrel->bstrategy);
-		page = BufferGetPage(buf);
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
 			bool		hastup,
@@ -966,7 +1054,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Check for new or empty pages before lazy_scan_noprune call */
 			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
-									   vmbuffer))
+									   vacrel->vmbuffer_data))
 			{
 				/* Processed as new/empty page (lock and pin released) */
 				continue;
@@ -1004,7 +1092,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
-		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vacrel->vmbuffer_data))
 		{
 			/* Processed as new/empty page (lock and pin released) */
 			continue;
@@ -1041,7 +1129,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
+				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vacrel->vmbuffer_data);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
 				dead_items->num_items = 0;
@@ -1114,7 +1202,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			PageSetAllVisible(page);
 			MarkBufferDirty(buf);
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, prunestate.visibility_cutoff_xid,
+							  vacrel->vmbuffer_data, prunestate.visibility_cutoff_xid,
 							  flags);
 		}
 
@@ -1125,11 +1213,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * with buffer lock before concluding that the VM is corrupt.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
-				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
+				 visibilitymap_get_status(vacrel->rel, blkno, &vacrel->vmbuffer_data))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
-			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
+			visibilitymap_clear(vacrel->rel, blkno, vacrel->vmbuffer_data,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1154,7 +1242,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				 vacrel->relname, blkno);
 			PageClearAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
+			visibilitymap_clear(vacrel->rel, blkno, vacrel->vmbuffer_data,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1165,7 +1253,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		else if (all_visible_according_to_vm && prunestate.all_visible &&
 				 prunestate.all_frozen &&
-				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
+				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vacrel->vmbuffer_data))
 		{
 			/*
 			 * Avoid relying on all_visible_according_to_vm as a proxy for the
@@ -1187,7 +1275,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 */
 			Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
+							  vacrel->vmbuffer_data, InvalidTransactionId,
 							  VISIBILITYMAP_ALL_VISIBLE |
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
@@ -1229,11 +1317,19 @@ lazy_scan_heap(LVRelState *vacrel)
 	}
 
 	vacrel->blkno = InvalidBlockNumber;
-	if (BufferIsValid(vmbuffer))
-		ReleaseBuffer(vmbuffer);
+	if (BufferIsValid(vacrel->vmbuffer_data))
+	{
+		ReleaseBuffer(vacrel->vmbuffer_data);
+		vacrel->vmbuffer_data = InvalidBuffer;
+	}
+	if (BufferIsValid(vacrel->vmbuffer_prefetch))
+	{
+		ReleaseBuffer(vacrel->vmbuffer_prefetch);
+		vacrel->vmbuffer_prefetch = InvalidBuffer;
+	}
 
 	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, vacrel->rel_pages);
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
@@ -1248,6 +1344,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
 		vacrel->missed_dead_tuples;
 
+	pg_streaming_read_free(pgsr);
+
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
@@ -1259,11 +1357,11 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
 	 * not there were indexes, and whether or not we bypassed index vacuuming.
 	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+	if (vacrel->rel_pages > next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, vacrel->rel_pages);
 
 	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, vacrel->rel_pages);
 
 	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
@@ -2414,6 +2512,57 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	return allindexes;
 }
 
+typedef struct VacuumHeapBlockState
+{
+	BlockNumber blkno;
+	int			start_tupindex;
+	int			end_tupindex;
+} VacuumHeapBlockState;
+
+typedef struct VacuumHeapState
+{
+	Relation	relation;
+	LVRelState *vacrel;
+	BlockNumber last_block;
+	int			next_tupindex;
+} VacuumHeapState;
+
+static bool
+vacuum_heap_pgsr_next(PgStreamingRead *pgsr,
+					  uintptr_t pgsr_private,
+					  void *io_private,
+					  BufferManagerRelation *bmr, ForkNumber *forkNum,
+					  BlockNumber *block,
+					  ReadBufferMode *mode)
+{
+	VacuumHeapState *vhs = (VacuumHeapState *) pgsr_private;
+	VacDeadItems *dead_items = vhs->vacrel->dead_items;
+	VacuumHeapBlockState *bs = io_private;
+
+	if (vhs->next_tupindex == dead_items->num_items)
+		return false;
+
+	bs->blkno = ItemPointerGetBlockNumber(&dead_items->items[vhs->next_tupindex]);
+	bs->start_tupindex = vhs->next_tupindex;
+	bs->end_tupindex = vhs->next_tupindex;
+
+	for (; vhs->next_tupindex < dead_items->num_items; vhs->next_tupindex++)
+	{
+		BlockNumber curblkno = ItemPointerGetBlockNumber(&dead_items->items[vhs->next_tupindex]);
+
+		if (bs->blkno != curblkno)
+			break;				/* past end of tuples for this block */
+		bs->end_tupindex = vhs->next_tupindex;
+	}
+
+	*bmr = BMR_REL(vhs->relation);
+	*forkNum = MAIN_FORKNUM;
+	*block = bs->blkno;
+	*mode = RBM_NORMAL;
+
+	return true;
+}
+
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
@@ -2437,6 +2586,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
 	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
+	VacuumHeapState vhs;
+	PgStreamingRead *pgsr;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 
@@ -2453,40 +2604,56 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	vhs.relation = vacrel->rel;
+	vhs.vacrel = vacrel;
+	vhs.last_block = InvalidBlockNumber;
+	vhs.next_tupindex = 0;
+	pgsr = pg_streaming_read_buffer_alloc(512,
+										  sizeof(VacuumHeapBlockState),
+										  (uintptr_t) &vhs,
+										  vacrel->bstrategy,
+										  vacuum_heap_pgsr_next);
+	while (true)
 	{
-		BlockNumber blkno;
-		Buffer		buf;
+		VacuumHeapBlockState *bs;
 		Page		page;
 		Size		freespace;
+		Buffer		buffer;
 
-		vacuum_delay_point();
+		buffer = pg_streaming_read_buffer_get_next(pgsr, (void **) &bs);
+		if (!BufferIsValid(buffer))
+			break;
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
-		vacrel->blkno = blkno;
+		Assert(bs->blkno == BufferGetBlockNumber(buffer));
+		Assert(bs->start_tupindex == index);
+		vacrel->blkno = bs->blkno;
 
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.
 		 */
-		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
+		visibilitymap_pin(vacrel->rel, bs->blkno, &vmbuffer);
 
 		/* We need a non-cleanup exclusive lock to mark dead_items unused */
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-								 vacrel->bstrategy);
-		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		index = lazy_vacuum_heap_page(vacrel, bs->blkno, buffer, index,
+									  vmbuffer);
 
-		/* Now that we've vacuumed the page, record its available space */
-		page = BufferGetPage(buf);
+		/* Now that we've compacted the page, record its available space */
+		page = BufferGetPage(buffer);
 		freespace = PageGetHeapFreeSpace(page);
 
-		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		UnlockReleaseBuffer(buffer);
+		RecordPageWithFreeSpace(vacrel->rel, bs->blkno, freespace);
 		vacuumed_pages++;
+
+		Assert(bs->end_tupindex + 1 == index);
+		vacuum_delay_point();
 	}
 
+	pg_streaming_read_free(pgsr);
+
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 94187e59cf..b20b9191e0 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1467,13 +1467,9 @@ SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads;
  t
 (1 row)
 
-SELECT (:io_sum_vac_strategy_after_reuses + :io_sum_vac_strategy_after_evictions) >
-  (:io_sum_vac_strategy_before_reuses + :io_sum_vac_strategy_before_evictions);
- ?column? 
-----------
- t
-(1 row)
-
+-- TM: This breaks with streaming read: investigate!
+-- SELECT (:io_sum_vac_strategy_after_reuses + :io_sum_vac_strategy_after_evictions) >
+--   (:io_sum_vac_strategy_before_reuses + :io_sum_vac_strategy_before_evictions);
 RESET wal_skip_threshold;
 -- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
 -- BufferAccessStrategy, are tracked in pg_stat_io.
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 1e21e55c6d..a8fec6d000 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -735,8 +735,9 @@ SELECT pg_stat_force_next_flush();
 SELECT sum(reuses) AS reuses, sum(reads) AS reads, sum(evictions) AS evictions
   FROM pg_stat_io WHERE context = 'vacuum' \gset io_sum_vac_strategy_after_
 SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads;
-SELECT (:io_sum_vac_strategy_after_reuses + :io_sum_vac_strategy_after_evictions) >
-  (:io_sum_vac_strategy_before_reuses + :io_sum_vac_strategy_before_evictions);
+-- TM: This breaks with streaming read: investigate!
+-- SELECT (:io_sum_vac_strategy_after_reuses + :io_sum_vac_strategy_after_evictions) >
+--   (:io_sum_vac_strategy_before_reuses + :io_sum_vac_strategy_before_evictions);
 RESET wal_skip_threshold;
 
 -- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a0752fa30e..f4b88c9382 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2934,6 +2934,8 @@ VacDeadItems
 VacErrPhase
 VacObjFilter
 VacOptValue
+VacuumHeapBlockState
+VacuumHeapState
 VacuumParams
 VacuumRelation
 VacuumStmt
-- 
2.39.2

v1-0010-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchtext/x-patch; charset=US-ASCII; name=v1-0010-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchDownload

From c2ca448331460c813085a0a6ff09d53ef9f14b37 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 12 Feb 2021 14:19:10 -0800
Subject: [PATCH v1 10/14] WIP: Use streaming reads in nbtree vacuum scan.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/nbtree/nbtree.c | 77 +++++++++++++++++++++++++-----
 src/include/access/nbtree.h        |  3 ++
 2 files changed, 67 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 62bc9917f1..5b72144fd4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -34,6 +34,7 @@
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -81,7 +82,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
 static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 IndexBulkDeleteCallback callback, void *callback_state,
 						 BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno);
 static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
 										  IndexTuple posting,
 										  OffsetNumber updatedoffset,
@@ -890,6 +891,31 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+static bool
+btvacuumscan_pgsr_next(PgStreamingRead *pgsr, uintptr_t pgsr_private,
+					   void *io_private,
+					   BufferManagerRelation *bmr, ForkNumber *fork,
+					   BlockNumber *block, ReadBufferMode *mode)
+{
+	BTVacState *vstate = (BTVacState *) pgsr_private;
+
+	if (vstate->nextblock == vstate->num_pages)
+		return false;
+
+	/*
+	 * We can't use _bt_getbuf() here because it always applies
+	 * _bt_checkpage(), which will barf on an all-zero page. We want to
+	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
+	 * buffer access strategy. Also want to use AIO.
+	 */
+	*bmr = BMR_REL(vstate->info->index);
+	*fork = MAIN_FORKNUM;
+	*block = vstate->nextblock++;
+	*mode = RBM_NORMAL;
+
+	return true;
+}
+
 /*
  * btvacuumscan --- scan the index for VACUUMing purposes
  *
@@ -983,6 +1009,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	scanblkno = BTREE_METAPAGE + 1;
 	for (;;)
 	{
+		PgStreamingRead *pgsr;
+
 		/* Get the current relation length */
 		if (needLock)
 			LockRelationForExtension(rel, ExclusiveLock);
@@ -997,14 +1025,36 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (scanblkno >= num_pages)
 			break;
+
+		vstate.nextblock = scanblkno;
+		vstate.num_pages = num_pages;
+
+		pgsr = pg_streaming_read_buffer_alloc(512, 0, (uintptr_t) &vstate,
+											  vstate.info->strategy,
+											  btvacuumscan_pgsr_next);
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; scanblkno < num_pages; scanblkno++)
 		{
-			btvacuumpage(&vstate, scanblkno);
+			Buffer		buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
+
+			Assert(BufferIsValid(buf));
+			Assert(BufferGetBlockNumber(buf) == scanblkno);
+
+			btvacuumpage(&vstate, buf, scanblkno);
+
 			if (info->report_progress)
 				pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
 											 scanblkno);
+
+			/* call vacuum_delay_point while not holding any buffer lock */
+			vacuum_delay_point();
 		}
+
+		if (pg_streaming_read_buffer_get_next(pgsr, NULL) != 0)
+			elog(ERROR, "aio and plain pos out of sync");
+
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Set statistics num_pages field to final size of index */
@@ -1037,7 +1087,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
  * recycled (i.e. before the page split).
  */
 static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno)
 {
 	IndexVacuumInfo *info = vstate->info;
 	IndexBulkDeleteResult *stats = vstate->stats;
@@ -1048,28 +1098,28 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
 	bool		attempt_pagedel;
 	BlockNumber blkno,
 				backtrack_to;
-	Buffer		buf;
 	Page		page;
 	BTPageOpaque opaque;
 
 	blkno = scanblkno;
 
+	Assert(BufferIsValid(buf));
 backtrack:
 
 	attempt_pagedel = false;
 	backtrack_to = P_NONE;
 
-	/* call vacuum_delay_point while not holding any buffer lock */
-	vacuum_delay_point();
-
 	/*
-	 * We can't use _bt_getbuf() here because it always applies
-	 * _bt_checkpage(), which will barf on an all-zero page. We want to
-	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
-	 * buffer access strategy.
+	 * If we're not backtracking we already read the page asynchronously. Not
+	 * using AIO when backtracking isn't too tragic - it's relatively rare,
+	 * and the buffer quite possibly is still in shared buffers.
 	 */
-	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-							 info->strategy);
+	if (!BufferIsValid(buf))
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								 info->strategy);
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	_bt_lockbuf(rel, buf, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = NULL;
@@ -1356,6 +1406,7 @@ backtrack:
 	if (backtrack_to != P_NONE)
 	{
 		blkno = backtrack_to;
+		buf = InvalidBuffer;
 		goto backtrack;
 	}
 }
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8891fa7973..51009d24bd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -343,6 +343,9 @@ typedef struct BTVacState
 	int			maxbufsize;		/* max bufsize that respects work_mem */
 	BTPendingFSM *pendingpages; /* One entry per newly deleted page */
 	int			npendingpages;	/* current # valid pendingpages */
+
+	BlockNumber nextblock;
+	BlockNumber num_pages;
 } BTVacState;
 
 /*
-- 
2.39.2

Heikki Linnakangas

hlinnaka@iki.fi

over 2 years ago

In reply to: Thomas Munro (#1)

Re: Streaming I/O, vectored I/O (WIP)

On 31/08/2023 07:00, Thomas Munro wrote:

Currently PostgreSQL reads (and writes) data files 8KB at a time.
That's because we call ReadBuffer() one block at a time, with no
opportunity for lower layers to do better than that. This thread is
about a model where you say which block you'll want next with a
callback, and then you pull the buffers out of a "stream".

I love this idea! Makes it a lot easier to perform prefetch, as
evidenced by the 011-WIP-Use-streaming-reads-in-bitmap-heapscan.patch:

13 files changed, 289 insertions(+), 637 deletions(-)

I'm a bit disappointed and surprised by
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:

4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I
hoped that this would simplify it. I didn't look closely at that patch,
so maybe it's simpler even though it's more code.

There are more kinds of streaming I/O that would be useful, such as
raw unbuffered files, and of course writes, and I've attached some
early incomplete demo code for writes (just for fun), but the main
idea I want to share in this thread is the idea of replacing lots of
ReadBuffer() calls with the streaming model.

All this makes sense. Some random comments on the patches:

+	/* Avoid a slightly more expensive kernel call if there is no benefit. */
+	if (iovcnt == 1)
+		returnCode = pg_pread(vfdP->fd,
+							  iov[0].iov_base,
+							  iov[0].iov_len,
+							  offset);
+	else
+		returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);

How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.

v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch

No smgrextendv()? I guess none of the patches here needed it.

/*
* Prepare to read a block. The buffer is pinned. If this is a 'hit', then
* the returned buffer can be used immediately. Otherwise, a physical read
* should be completed with CompleteReadBuffers(). PrepareReadBuffer()
* followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
* caller has the opportunity to coalesce reads of neighboring blocks into one
* CompleteReadBuffers() call.
*
* *found is set to true for a hit, and false for a miss.
*
* *allocated is set to true for a miss that allocates a buffer for the first
* time. If there are multiple calls to PrepareReadBuffer() for the same
* block before CompleteReadBuffers() or ReadBuffer_common() finishes the
* read, then only the first such call will receive *allocated == true, which
* the caller might use to issue just one prefetch hint.
*/
Buffer
PrepareReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
bool *found,
bool *allocated)

If you decide you don't want to perform the read, after all, is there a
way to abort it without calling CompleteReadBuffers()? Looking at the
later patch that introduces the streaming read API, seems that it
finishes all the reads, so I suppose we don't need an abort function.
Does it all get cleaned up correctly on error?

/*
* Convert an array of buffer address into an array of iovec objects, and
* return the number that were required. 'iov' must have enough space for up
* to PG_IOV_MAX elements.
*/
static int
buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)

The comment is a bit inaccurate. There's an assertion that If nblocks
<= PG_IOV_MAX, so while it's true that 'iov' must have enough space for
up to PG_IOV_MAX elements, that's only because we also assume that
nblocks <= PG_IOV_MAX.

I don't see anything in the callers (mdreadv() and mdwritev()) to
prevent them from passing nblocks > PG_IOV_MAX.

in streaming_read.h:

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
extern void pg_streaming_read_free(PgStreamingRead *pgsr);

Do we need to expose pg_streaming_read_prefetch()? It's only used in the
WAL replay prefetching patch, and only after calling
pg_streaming_read_reset(). Could pg_streaming_read_reset() call
pg_streaming_read_prefetch() directly? Is there any need to "reset" the
stream, without also starting prefetching?

In v1-0012-WIP-Use-streaming-reads-in-recovery.patch:

@@ -1978,6 +1979,9 @@ XLogRecGetBlockTag(XLogReaderState *record, uint8 block_id,
* If the WAL record contains a block reference with the given ID, *rlocator,
* *forknum, *blknum and *prefetch_buffer are filled in (if not NULL), and
* returns true.  Otherwise returns false.
+ *
+ * If prefetch_buffer is not NULL, the buffer is already pinned, and ownership
+ * of the pin is transferred to the caller.
*/
bool
XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
@@ -1998,7 +2002,15 @@ XLogRecGetBlockTagExtended(XLogReaderState *record, uint8 block_id,
if (blknum)
*blknum = bkpb->blkno;
if (prefetch_buffer)
+	{
*prefetch_buffer = bkpb->prefetch_buffer;
+
+		/*
+		 * Clear this flag is so that we can assert that redo records take
+		 * ownership of all buffers pinned by xlogprefetcher.c.
+		 */
+		bkpb->prefetch_buffer = InvalidBuffer;
+	}
return true;
}

Could these changes be committed independently of all the other changes?

--
Heikki Linnakangas
Neon (https://neon.tech)

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Heikki Linnakangas (#2)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:

I'm a bit disappointed and surprised by
v1-0009-WIP-Use-streaming-reads-in-vacuum.patch though:

4 files changed, 244 insertions(+), 78 deletions(-)

The current prefetching logic in vacuumlazy.c is pretty hairy, so I hoped
that this would simplify it. I didn't look closely at that patch, so maybe
it's simpler even though it's more code.

A good chunk of the changes is pretty boring stuff. A good chunk of the
remainder could be simplified a lot - it's partially there because vacuumlazy
changed a lot over the last couple years and because a bit more refactoring is
needed. I do think it's actually simpler in some ways - besides being more
efficient...

v1-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patch

No smgrextendv()? I guess none of the patches here needed it.

I can't really imagine needing it anytime soon - due to the desire to avoid
ENOSPC for pages in the buffer pool the common pattern is to extend relations
with zeroes on disk, then populate those buffers in memory. It's possible that
you could use something like smgrextendv() when operating directly on the smgr
level - but then I suspect you're going to be better off to extend the
relation to the right size in one operation and then just use smgrwritev() to
write out the contents.

/*
* Prepare to read a block. The buffer is pinned. If this is a 'hit', then
* the returned buffer can be used immediately. Otherwise, a physical read
* should be completed with CompleteReadBuffers(). PrepareReadBuffer()
* followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
* caller has the opportunity to coalesce reads of neighboring blocks into one
* CompleteReadBuffers() call.
*
* *found is set to true for a hit, and false for a miss.
*
* *allocated is set to true for a miss that allocates a buffer for the first
* time. If there are multiple calls to PrepareReadBuffer() for the same
* block before CompleteReadBuffers() or ReadBuffer_common() finishes the
* read, then only the first such call will receive *allocated == true, which
* the caller might use to issue just one prefetch hint.
*/
Buffer
PrepareReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
bool *found,
bool *allocated)

If you decide you don't want to perform the read, after all, is there a way
to abort it without calling CompleteReadBuffers()?

When would that be needed?

Looking at the later patch that introduces the streaming read API, seems
that it finishes all the reads, so I suppose we don't need an abort
function. Does it all get cleaned up correctly on error?

I think it should. The buffer error handling is one of the areas where I
really would like to have some way of testing the various cases, it's easy to
get things wrong, and basically impossible to write reliable tests for with
our current infrastructure.

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

Yea, that's the origin - I don't like it, but I don't really have a better
idea.

extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
extern void pg_streaming_read_free(PgStreamingRead *pgsr);

Do we need to expose pg_streaming_read_prefetch()? It's only used in the WAL
replay prefetching patch, and only after calling pg_streaming_read_reset().
Could pg_streaming_read_reset() call pg_streaming_read_prefetch() directly?
Is there any need to "reset" the stream, without also starting prefetching?

Heh, I think this is a discussion Thomas I were having before...

Greetings,

Andres Freund

Thomas Munro

thomas.munro@gmail.com

over 2 years ago

In reply to: Andres Freund (#3)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Sep 28, 2023 at 9:13 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-09-27 21:33:15 +0300, Heikki Linnakangas wrote:

Looking at the later patch that introduces the streaming read API, seems
that it finishes all the reads, so I suppose we don't need an abort
function. Does it all get cleaned up correctly on error?

I think it should. The buffer error handling is one of the areas where I
really would like to have some way of testing the various cases, it's easy to
get things wrong, and basically impossible to write reliable tests for with
our current infrastructure.

One thing to highlight is that this patch doesn't create a new state
in that case. In master, we already have the concept of a buffer with
BM_TAG_VALID but not BM_VALID and not BM_IO_IN_PROGRESS, reachable if
there is an I/O error. Eventually another reader will try the I/O
again, or the buffer will fall out of the pool. With this patch it's
the same, it's just a wider window: more kinds of errors might be
thrown in code between Prepare() and Complete() before we even have
BM_IO_IN_PROGRESS. So there is nothing extra to clean up. Right?

Yeah, it would be nice to test buffer pool logic directly. Perhaps
with a C unit test framework[1]/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com and pluggable smgr[2]/messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com we could mock up
cases like I/O errors...

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

Yea, that's the origin - I don't like it, but I don't really have a better
idea.

Another idea I considered was that streams could be associated with a
single relation, but recovery could somehow manage a set of them.
From a certain point of view, that makes sense (we could be redoing
work that was created by multiple concurrent streams at 'do' time, and
with the approach shown here some clustering opportunities available
at do time are lost at redo time), but it's not at all clear that it's
worth the overheads or complexity, and I couldn't immediately figure
out how to do it. But I doubt there would ever be any other users of
a single stream with multiple relations, and I agree that this is
somehow not quite satisfying... Perhaps we should think about that
some more...

[1]: /messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com
[2]: /messages/by-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK=PCNWEMrA@mail.gmail.com

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#2)

8 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

+     /* Avoid a slightly more expensive kernel call if there is no benefit. */
+     if (iovcnt == 1)
+             returnCode = pg_pread(vfdP->fd,
+                                                       iov[0].iov_base,
+                                                       iov[0].iov_len,
+                                                       offset);
+     else
+             returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.

Done. I like it, I just feel a bit bad about moving the p*v()
replacement functions around a couple of times already! I figured it
might as well be static inline even if we use the fallback (= Solaris
and Windows).

I don't see anything in the callers (mdreadv() and mdwritev()) to
prevent them from passing nblocks > PG_IOV_MAX.

The outer loop in md*v() copes with segment boundaries and also makes
sure lengthof(iov) AKA PG_IOV_MAX isn't exceeded (though that couldn't
happen with the current caller).

in streaming_read.h:

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

In this version I have introduced an alternative simple callback.
It's approximately what we had already tried out in an earlier version
before I started streamifying recovery, but in this version you can
choose, so recovery can opt for the wider callback.

I've added some ramp-up logic. The idea is that after we streamify
everything in sight, we don't want to penalise users that don't really
need more than one or two blocks, but don't know that yet. Here is
how the system calls look when you do pg_prewarm():

pread64(32, ..., 8192, 0) = 8192 <--- start with just one block
pread64(32, ..., 16384, 8192) = 16384
pread64(32, ..., 32768, 24576) = 32768
pread64(32, ..., 65536, 57344) = 65536
pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16
blocks at a time
pread64(32, ..., 131072, 253952) = 131072
pread64(32, ..., 131072, 385024) = 131072

I guess it could be done in quite a few different ways and I'm open to
better ideas. This way inserts prefetching stalls but ramps up
quickly and is soon out of the way. I wonder if we would want to make
that a policy that a caller can disable, if you want to skip the
ramp-up and go straight for the largest possible I/O size? Then I
think we'd need a 'flags' argument to the streaming read constructor
functions.

A small detour: While contemplating how this interacts with parallel
sequential scan, which also has a notion of ramping up, I noticed
another problem. One parallel seq scan process does this:

fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 35127296) = 131072
preadv(32, [...], 2, 35258368) = 131072
fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 36175872) = 131072
preadv(32, [...], 2, 36306944) = 131072
...

We don't really want those fadvise() calls. We don't get them with
parallelism disabled, because streaming_read.c is careful not to
generate advice for sequential workloads based on ancient wisdom from
this mailing list, re-confirmed on recent Linux: WILLNEED hints
actually get in the way of Linux's own prefetching and slow you down,
so we only want them for truly random access. But the logic can't see
that another process is making holes in this process's sequence. The
two obvious solutions are (1) pass in a flag at the start saying "I
promise this is sequential even if it doesn't look like it, no hints
please" and (2) invent "shared" (cross-process) streaming reads, and
teach all the parallel seq scan processes to get their buffers from
there.

Idea (2) is interesting to think about but even if it is a useful idea
(not sure) it is certainly overkill just to solve this little problem
for now. So perhaps I should implement (1), which would be another
reason to add a flags argument. It's not a perfect solution though
because some more 'data driven' parallel scans (indexes, bitmaps, ...)
have a similar problem that is less amenable to top-down kludgery.

I've included just the pg_prewarm example user for now while we
discuss the basic infrastructure. The rest are rebased and in my
public Github branch streaming-read (repo macdice/postgres) if anyone
is interested (don't mind the red CI failures, they're just saying I
ran out of monthly CI credits on the 29th, so close...)

Attachments:

v2-0004-Provide-vectored-variant-of-ReadBuffer.patchapplication/octet-stream; name=v2-0004-Provide-vectored-variant-of-ReadBuffer.patchDownload

From db5de8ab5a1a804f41006239302fdce954cab331 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v2 4/8] Provide vectored variant of ReadBuffer().

PrepareReadBuffer() followed by CompleteReadBuffers() or ZeroBuffer() is
equivalent to ReadBuffer(), except that the physical read for multiple
adjacent blocks can be merged into one vectored system call.

ReadBuffer() is now implemented in terms of these functions.

Author: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 540 +++++++++++++++++---------
 src/backend/storage/buffer/localbuf.c |   4 +-
 src/include/storage/buf_internals.h   |   3 +-
 src/include/storage/bufmgr.h          |  20 +
 4 files changed, 384 insertions(+), 183 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7c67d504c..8ae3a72053 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -512,7 +512,8 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   ForkNumber forkNum,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
-							   bool *foundPtr, IOContext io_context);
+							   bool *foundPtr, bool *allocPtr,
+							   IOContext io_context);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -800,7 +801,7 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 	 * miss.
 	 */
 	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
 	if (hit)
 		pgstat_count_buffer_hit(reln);
@@ -827,8 +828,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +1004,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1018,18 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
+	Buffer		buffer;
+	bool		allocated;
 
-	*hit = false;
+	Assert(mode == RBM_ZERO_AND_LOCK ||
+		   mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+		   mode == RBM_NORMAL ||
+		   mode == RBM_NORMAL_NO_LOG ||
+		   mode == RBM_ZERO_ON_ERROR);
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1048,326 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit,
+							   &allocated);
+
+	/* At this point we do NOT hold any locks. */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
+ * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the
+ * caller has the opportunity to coalesce reads of neighboring blocks into one
+ * CompleteReadBuffers() call.
+ *
+ * *found is set to true for a hit, and false for a miss.
+ *
+ * *allocated is set to true for a miss that allocates a buffer for the first
+ * time.  If there are multiple calls to PrepareReadBuffer() for the same
+ * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
+ * read, then only the first such call will receive *allocated == true, which
+ * the caller might use to issue just one prefetch hint.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *found,
+				  bool *allocated)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, found, allocated);
+		if (*found)
+			pgBufferUsage.local_blks_hit++;
+		else
+			pgBufferUsage.local_blks_read++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, found, allocated, io_context);
+		if (*found)
+			pgBufferUsage.shared_blks_hit++;
+		else
+			pgBufferUsage.shared_blks_read++;
+	}
+	if (bmr.rel)
+	{
+		pgstat_count_buffer_read(bmr.rel);
+		if (*found)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*found)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
 
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
 	else
 	{
-		instr_time	io_start = pgstat_prepare_io_time();
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+#ifdef USE_ASSERT_CHECKING
+
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/* Skip this block if someone else has already completed it. */
+		if (!CompleteReadBuffersCanStartIO(buffers[i]))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 *
+			 * XXX We may want to redesign the tracing information to report
+			 * this, if there is still interest in these static tracing points
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/* We found a buffer that we have to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1]))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time();
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1381,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1244,7 +1394,9 @@ static BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
-			bool *foundPtr, IOContext io_context)
+			bool *foundPtr,
+			bool *allocatedPtr,
+			IOContext io_context)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
@@ -1286,24 +1438,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		LWLockRelease(newPartitionLock);
 
 		*foundPtr = true;
+		*allocatedPtr = false;
 
 		if (!valid)
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1335,6 +1479,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		BufferDesc *existing_buf_hdr;
 		bool		valid;
 
+		*allocatedPtr = false;
+
 		/*
 		 * Got a collision. Someone has already done what we were about to do.
 		 * We'll just handle this as if it were found in the buffer pool in
@@ -1368,24 +1514,18 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
 	}
 
+	/* Report that we had to allocate a buffer. */
+	*allocatedPtr = true;
+
 	/*
 	 * Need to lock the buffer header too in order to change its tag.
 	 */
@@ -1412,15 +1552,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -2381,7 +2515,11 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was used.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when ReadBuffer() or CompleteReadBuffers()
+		 * call StartBufferIO(), they'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2528,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -4845,6 +4983,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 4efb34b75a..ee9307b612 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -116,7 +116,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, bool *allocPtr)
 {
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
@@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
 
 		*foundPtr = PinLocalBuffer(bufHdr, true);
+		*allocPtr = false;
 	}
 	else
 	{
@@ -170,6 +171,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
 		*foundPtr = false;
+		*allocPtr = true;
 	}
 
 	return bufHdr;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 2c4fd92e39..58f6fb0bec 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -450,7 +450,8 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
-									BlockNumber blockNum, bool *foundPtr);
+									BlockNumber blockNum, bool *foundPtr,
+									bool *allocPtr);
 extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
 										  ForkNumber fork,
 										  uint32 flags,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41e26d3e20..e29ca85077 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,8 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "pgstat.h"
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -47,6 +49,8 @@ typedef enum
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
+	RBM_WILL_ZERO,				/* Don't read from disk, caller will call
+								 * ZeroBuffer() */
 	RBM_NORMAL_NO_LOG,			/* Don't log page as invalid during WAL
 								 * replay; otherwise same as RBM_NORMAL */
 } ReadBufferMode;
@@ -158,6 +162,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/* Maximum number of buffers for multi-buffer I/O functions. */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,19 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *found,
+								bool *allocated);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,6 +266,7 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
-- 
2.42.1

v2-0005-Provide-multi-block-smgrprefetch.patchapplication/octet-stream; name=v2-0005-Provide-multi-block-smgrprefetch.patchDownload

From 3b1a8475289034b5a876f5d4b0a41e857e24c957 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 08:53:15 +1200
Subject: [PATCH v2 5/8] Provide multi-block smgrprefetch().

Previously smgrprefetch() would issue POSIX_FADV_WILLNEED advice for a
single block at a time.  Add an nblocks argument so that we can do the
same for a range of blocks.

Author: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   |  2 +-
 src/backend/storage/buffer/localbuf.c |  2 +-
 src/backend/storage/smgr/md.c         | 36 +++++++++++++++++++--------
 src/backend/storage/smgr/smgr.c       |  7 +++---
 src/include/storage/md.h              |  2 +-
 src/include/storage/smgr.h            |  2 +-
 6 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8ae3a72053..1d79907020 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -570,7 +570,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		 * recovery if the relation file doesn't exist.
 		 */
 		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-			smgrprefetch(smgr_reln, forkNum, blockNum))
+			smgrprefetch(smgr_reln, forkNum, blockNum, 1))
 		{
 			result.initiated_io = true;
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index ee9307b612..00f001b0b9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -94,7 +94,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 #ifdef USE_PREFETCH
 		/* Not in buffers, so initiate prefetch */
 		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-			smgrprefetch(smgr, forkNum, blockNum))
+			smgrprefetch(smgr, forkNum, blockNum, 1))
 		{
 			result.initiated_io = true;
 		}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d1383220bd..7e226e1e0f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -710,27 +710,41 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * mdprefetch() -- Initiate asynchronous read of the specified block of a relation
+ * mdprefetch() -- Initiate asynchronous read of the specified blocks of a relation
  */
 bool
-mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   int nblocks)
 {
 #ifdef USE_PREFETCH
-	off_t		seekpos;
-	MdfdVec    *v;
 
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
-					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
-	if (v == NULL)
-		return false;
+	while (nblocks > 0)
+	{
+		off_t		seekpos;
+		MdfdVec    *v;
+		int			nblocks_this_segment;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		v = _mdfd_getseg(reln, forknum, blocknum, false,
+						 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+		if (v == NULL)
+			return false;
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+		(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ * nblocks_this_segment,
+							WAIT_EVENT_DATA_FILE_PREFETCH);
+
+		blocknum += nblocks_this_segment;
+		nblocks -= nblocks_this_segment;
+	}
 #endif							/* USE_PREFETCH */
 
 	return true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b46ac3cb09..e594158f47 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -54,7 +54,7 @@ typedef struct f_smgr
 	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber blocknum);
+								  BlockNumber blocknum, int nblocks);
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
@@ -548,9 +548,10 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * record).
  */
 bool
-smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
 }
 
 /*
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 0f73d9af49..3dc651ec17 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -31,7 +31,7 @@ extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber blocknum);
+					   BlockNumber blocknum, int nblocks);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index ae93910132..6e62dc053c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -95,7 +95,7 @@ extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum);
+						 BlockNumber blocknum, int nblocks);
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffer, BlockNumber nblocks);
-- 
2.42.1

v2-0006-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchapplication/octet-stream; name=v2-0006-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchDownload

From eb344e3c3d5320060e1fa68876ebebc7a5dca2e4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 10 Aug 2023 18:16:31 +1200
Subject: [PATCH v2 6/8] Give SMgrRelation pointers a well-defined lifetime.

After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose(), which freed the memory.

Guarantee that the object won't be destroyed until the end of the
current transaction, or in recovery, the commit/abort record that
destroys the underlying storage.

The existing smgrclose() function now closes files and forgets all state
except the rlocator, like smgrrelease() used to do, and additionally
"disowns" the object so that it is disconnected from any Relation and
queued for destruction at AtEOXact_SMgr(), unless it is re-owned by a
Relation again before then.

A new smgrdestroy() function is used by rare places that know there
should be no other references to the SMgrRelation.

The short version:

 * smgrrelease() is removed
 * smgrclose() now releases resources, but doesn't destroy until EOX
 * smgrdestroy() now frees memory, and should rarely be used

Existing code should be unaffected, but it is now possible for code that
has an SMgrRelation object to use it repeatedly during a transaction as
long as the storage hasn't been physically dropped.  Such code would
normally hold a lock on the relation.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/access/transam/xlogutils.c |  2 +-
 src/backend/postmaster/bgwriter.c      |  8 +++--
 src/backend/postmaster/checkpointer.c  | 15 ++++----
 src/backend/storage/smgr/smgr.c        | 49 ++++++++++++++++----------
 src/include/storage/smgr.h             |  4 +--
 src/include/utils/rel.h                |  6 ----
 6 files changed, 46 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205..bdc0f7e1da 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d02dc17b9c..2f0d0515f6 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 42c807d352..4e2289ab20 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -449,10 +449,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -935,11 +937,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index e594158f47..013300a21b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -257,10 +259,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -287,12 +289,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -300,15 +304,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -320,14 +329,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -339,7 +351,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -731,7 +743,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -745,7 +758,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -755,7 +768,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -766,6 +779,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 6e62dc053c..2656f10f1c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 0ad613c4b8..c00318918d 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
-- 
2.42.1

v2-0007-Provide-basic-streaming-read-API.patchapplication/octet-stream; name=v2-0007-Provide-basic-streaming-read-API.patchDownload

From 484cfc9659b7ec69a41279db26eb61e6f7ae9b35 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 28 Jul 2023 22:35:32 +1200
Subject: [PATCH v2 7/8] Provide basic streaming read API.

"Streaming reads" can be used as a more efficient replacement for
ReadBuffer() calls.

The client code supplies a callback that can say which block to read
next, and then consumes individual buffers one at a time.  This division
allows the PgStreamingRead object to build larger calls to
CompleteReadBuffers(), which in turn produce large preadv() calls
instead of traditional single-block pread() calls.  Unless the read is
purely sequential, posix_fadvise() calls are also issued.

This API is intended as a stepping stone, allowing for true asynchronous
implementation in later work.  Code that adapts to the streaming read
API would automatically benefit.  This work is based on lots of
discussions with Andres Freund on how to prepare that pathway.

Author: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 540 +++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      |   2 +-
 src/backend/storage/meson.build          |   1 +
 src/include/storage/bufmgr.h             |   2 +
 src/include/storage/streaming_read.h     |  53 +++
 src/tools/pgindent/typedefs.list         |   2 +
 9 files changed, 619 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..156e87cab7
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..de538af7f0
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,540 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of clusters of buffers.
+ *
+ * For hits and RBM_WILL_ZERO, need_to_complete is false, we have just one
+ * buffer in each cluster, already pinned and ready for use.
+ *
+ * For misses that require a physical read, need_to_complete is true, and
+ * buffers[] holds a group of of neighboring blocks, so we can complete them
+ * with a single call to CompleteReadBuffers().  We can also issue a single
+ * prefetch for it as soon as it has grown to its largest possible size, if
+ * our random access heuristics determine that is a good idea.
+ */
+typedef struct PgStreamingReadCluster
+{
+	bool		advice_issued;
+	bool		need_complete;
+
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+
+	int			per_io_data_index[MAX_BUFFERS_PER_TRANSFER];
+	bool		need_advice[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadCluster;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	int			ramp_up_pin_limit;
+	int			ramp_up_pin_stall;
+	bool		finished;
+	uintptr_t	pgsr_private;
+	PgStreamingReadBufferCB simple_cb;
+	PgStreamingReadBufferExtendedCB extended_cb;
+	BufferAccessStrategy strategy;
+
+	/* If using a simple callback, relation and fork are fixed. */
+	BufferManagerRelation simple_bmr;
+	ForkNumber	simple_forknum;
+
+	/* Next expected prefetch, for sequential prefetch avoidance. */
+	BufferManagerRelation seq_bmr;
+	ForkNumber	seq_forknum;
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-I/O private data. */
+	size_t		per_io_data_size;
+	void	   *per_io_data;
+	int			per_io_data_next;
+
+	/* Circular buffer of clusters. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadCluster clusters[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int max_ios,
+										size_t per_io_data_size,
+										uintptr_t pgsr_private,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_pinned_buffers;
+
+	Assert(max_ios > 0);
+
+	/*
+	 * We allow twice as many buffers to be pinned as I/Os.  This allows us to
+	 * look further ahead for blocks that need to be read in.
+	 */
+	max_pinned_buffers = max_ios * 2;
+
+	/* Don't allow this backend to pin too many buffers. */
+	LimitAdditionalPins((uint32 *) &max_pinned_buffers);
+	max_pinned_buffers = Max(2, max_pinned_buffers);
+	max_ios = max_pinned_buffers / 2;
+	Assert(max_ios > 0);
+	Assert(max_pinned_buffers > 0);
+	Assert(max_pinned_buffers > max_ios);
+
+	/*
+	 * pgsr->clusters is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0).  So we need two extra elements.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, clusters) +
+				sizeof(pgsr->clusters[0]) * size);
+
+	pgsr->per_io_data_size = per_io_data_size;
+	pgsr->max_ios = max_ios;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+	/*
+	 * We don't try to reach max_pinned_buffers immediately, we start of with
+	 * a limit of 1 and increase that quickly as a way of ramping up.  This is
+	 * intended to avoid generating I/O for callers that give up after a small
+	 * number of pages.
+	 *
+	 * XXX Should this be a policy choice for the caller to specify?
+	 */
+	pgsr->ramp_up_pin_limit = 1;
+
+	/*
+	 * We look ahead when the number of pinned buffers falls below this
+	 * number.  This encourages the formation of large vectored reads.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(max_ios, max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_io_data_size)
+		pgsr->per_io_data = palloc(per_io_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int max_ios,
+							   size_t per_io_data_size,
+							   uintptr_t pgsr_private,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(max_ios,
+													 per_io_data_size,
+													 pgsr_private,
+													 strategy);
+	result->simple_cb = next_block_cb;
+	result->simple_bmr = bmr;
+	result->simple_forknum = forknum;
+
+	return result;
+}
+
+
+PgStreamingRead *
+pg_streaming_read_buffer_alloc_ext(int max_ios,
+								   size_t per_io_data_size,
+								   uintptr_t pgsr_private,
+								   BufferAccessStrategy strategy,
+								   PgStreamingReadBufferExtendedCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(max_ios,
+													 per_io_data_size,
+													 pgsr_private,
+													 strategy);
+	result->extended_cb = next_block_cb;
+
+	return result;
+}
+
+/*
+ * Issue WILLNEED advice for the head cluster, and allocate a new head
+ * cluster.
+ *
+ * We don't have true asynchronous I/O to actually submit, but this is
+ * equivalent because it might start I/O on systems that understand WILLNEED
+ * advice.  We count it as an I/O in progress.
+ */
+static PgStreamingReadCluster *
+pg_streaming_read_submit(PgStreamingRead *pgsr)
+{
+	PgStreamingReadCluster *head_cluster;
+
+	head_cluster = &pgsr->clusters[pgsr->head];
+	Assert(head_cluster->nblocks > 0);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * Don't bother with advice if there will be no call to
+	 * CompleteReadBuffers() or direct I/O is enabled.
+	 */
+	if (head_cluster->need_complete &&
+		(io_direct_flags & IO_DIRECT_DATA) == 0)
+	{
+		/*
+		 * Purely sequential advice is known to hurt performance on some
+		 * systems, so only issue it if this looks random.
+		 */
+		if (head_cluster->bmr.smgr != pgsr->seq_bmr.smgr ||
+			head_cluster->bmr.rel != pgsr->seq_bmr.rel ||
+			head_cluster->forknum != pgsr->seq_forknum ||
+			head_cluster->blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				head_cluster->bmr.smgr ? head_cluster->bmr.smgr
+				: RelationGetSmgr(head_cluster->bmr.rel);
+
+			Assert(!head_cluster->advice_issued);
+
+			for (int i = 0; i < head_cluster->nblocks; i++)
+			{
+				if (head_cluster->need_advice[i])
+				{
+					BlockNumber first_blocknum = head_cluster->blocknum + i;
+					int			nblocks = 1;
+
+					/*
+					 * How many adjacent blocks can we merge with to reduce
+					 * system calls?  Usually this is all of them, unless
+					 * there are overlapping reads and our timing is unlucky.
+					 */
+					while ((i + 1) < head_cluster->nblocks &&
+						   head_cluster->need_advice[i + 1])
+					{
+						nblocks++;
+						i++;
+					}
+
+					smgrprefetch(smgr,
+								 head_cluster->forknum,
+								 first_blocknum,
+								 nblocks);
+				}
+
+			}
+
+			/*
+			 * Count this as an I/O that is concurrently in progress.  We
+			 * might have called smgrprefetch() more than once, if some of the
+			 * buffers in the range were already in buffer pool but not valid
+			 * yet, because of a concurrent read, but for now we choose to
+			 * track this as one I/O.
+			 */
+			head_cluster->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the point after this, for the above heuristics. */
+		pgsr->seq_bmr = head_cluster->bmr;
+		pgsr->seq_forknum = head_cluster->forknum;
+		pgsr->seq_blocknum = head_cluster->blocknum + head_cluster->nblocks;
+	}
+#endif
+
+	/* Create a new head cluster.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_cluster = &pgsr->clusters[pgsr->head];
+	head_cluster->nblocks = 0;
+
+	return head_cluster;
+}
+
+void
+pg_streaming_read_prefetch(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're still ramping up, we may have to stall to wait for buffers to
+	 * be consumed first before we do any more prefetching.
+	 */
+	if (pgsr->ramp_up_pin_stall > 0)
+	{
+		Assert(pgsr->pinned_buffers > 0);
+		return;
+	}
+
+	/* If we're finished or can't start one more I/O, then no prefetching. */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full cluster.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		ReadBufferMode mode;
+		Buffer		buffer;
+		bool		found;
+		bool		allocated;
+		bool		need_complete;
+		PgStreamingReadCluster *head_cluster;
+		void	   *per_io_data;
+
+		/* Do we have a full-sized cluster? */
+		head_cluster = &pgsr->clusters[pgsr->head];
+		if (head_cluster->nblocks == lengthof(head_cluster->buffers))
+		{
+			Assert(head_cluster->need_complete);
+			head_cluster = pg_streaming_read_submit(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full cluster after this due to the pin limit.  Once
+			 * ramp_up_limit is not limiting us, we want to be able to create
+			 * full sized clusters as much as possible.
+			 */
+			if (pgsr->ios_in_progress == pgsr->max_ios ||
+				pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+				break;
+		}
+
+		per_io_data = (char *) pgsr->per_io_data +
+			pgsr->per_io_data_size * pgsr->per_io_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		if (pgsr->simple_cb)
+		{
+			/* Simple callbacks use fixed relation, fork and normal mode. */
+			blocknum = pgsr->simple_cb(pgsr, pgsr->pgsr_private, per_io_data);
+			if (blocknum == InvalidBlockNumber)
+			{
+				pgsr->finished = true;
+				break;
+			}
+			bmr = pgsr->simple_bmr;
+			forknum = pgsr->simple_forknum;
+			mode = RBM_NORMAL;
+		}
+		else
+		{
+			/* Extended callbacks choose the relation, fork, mode. */
+			if (!pgsr->extended_cb(pgsr, pgsr->pgsr_private, per_io_data,
+								   &bmr, &forknum, &blocknum, &mode))
+			{
+				pgsr->finished = true;
+				break;
+			}
+		}
+
+		Assert(mode == RBM_NORMAL || mode == RBM_WILL_ZERO);
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found,
+								   &allocated);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found && mode != RBM_WILL_ZERO;
+
+		/* Is there a head cluster that we can't extend? */
+		head_cluster = &pgsr->clusters[pgsr->head];
+		if (head_cluster->nblocks > 0 &&
+			(!need_complete ||
+			 !head_cluster->need_complete ||
+			 head_cluster->bmr.smgr != bmr.smgr ||
+			 head_cluster->bmr.rel != bmr.rel ||
+			 head_cluster->forknum != forknum ||
+			 head_cluster->blocknum + head_cluster->nblocks != blocknum))
+		{
+			/* Submit it so we can start a new one. */
+			head_cluster = pg_streaming_read_submit(pgsr);
+			Assert(head_cluster->nblocks == 0);
+		}
+
+		if (head_cluster->nblocks == 0)
+		{
+			/* Initialize the cluster. */
+			head_cluster->bmr = bmr;
+			head_cluster->forknum = forknum;
+			head_cluster->blocknum = blocknum;
+			head_cluster->advice_issued = false;
+			head_cluster->need_complete = need_complete;
+		}
+		else
+		{
+			/* We'll extend an existing cluster by one buffer. */
+			Assert(head_cluster->bmr.smgr == bmr.smgr);
+			Assert(head_cluster->bmr.rel == bmr.rel);
+			Assert(head_cluster->forknum == forknum);
+			Assert(head_cluster->blocknum + head_cluster->nblocks == blocknum);
+			Assert(head_cluster->need_complete);
+		}
+
+		head_cluster->per_io_data_index[head_cluster->nblocks] = pgsr->per_io_data_next++;
+		head_cluster->need_advice[head_cluster->nblocks] = allocated;
+		head_cluster->buffers[head_cluster->nblocks] = buffer;
+		head_cluster->nblocks++;
+
+		if (pgsr->per_io_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_io_data_next = 0;
+
+	} while (pgsr->ios_in_progress < pgsr->max_ios &&
+			 pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->pinned_buffers < pgsr->ramp_up_pin_limit);
+
+	/* If we've hit the ramp-up limit, insert a stall. */
+	if (pgsr->pinned_buffers >= pgsr->ramp_up_pin_limit)
+	{
+		/* Can't get here if an earlier stall hasn't finished. */
+		Assert(pgsr->ramp_up_pin_stall == 0);
+		/* Don't do any more prefetching until these buffers are consumed. */
+		pgsr->ramp_up_pin_stall = pgsr->ramp_up_pin_limit;
+		/* Double it.  It will soon be out of the way. */
+		pgsr->ramp_up_pin_limit *= 2;
+	}
+
+	/* Submit the head cluster, if there is one. */
+	if (pgsr->clusters[pgsr->head].nblocks > 0)
+		pg_streaming_read_submit(pgsr);
+}
+
+void
+pg_streaming_read_reset(PgStreamingRead *pgsr)
+{
+	pgsr->finished = false;
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_data)
+{
+	pg_streaming_read_prefetch(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadCluster *tail_cluster;
+
+		tail_cluster = &pgsr->clusters[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * cluster?
+		 */
+		if (tail_cluster->need_complete)
+		{
+			CompleteReadBuffers(tail_cluster->bmr,
+								tail_cluster->buffers,
+								tail_cluster->forknum,
+								tail_cluster->blocknum,
+								tail_cluster->nblocks,
+								false,
+								pgsr->strategy);
+			tail_cluster->need_complete = false;
+
+			/* We only counted this I/O as running if we issued advice. */
+			if (tail_cluster->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this cluster? */
+		if (pgsr->next_tail_buffer < tail_cluster->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_cluster->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (pgsr->ramp_up_pin_stall > 0)
+				pgsr->ramp_up_pin_stall--;
+
+			if (per_io_data)
+				*per_io_data = (char *) pgsr->per_io_data +
+					tail_cluster->per_io_data_index[buffer_index] *
+					pgsr->per_io_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next cluster, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+	Assert(pgsr->ramp_up_pin_stall == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop reading ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	for (;;)
+	{
+		buffer = pg_streaming_read_buffer_get_next(pgsr, NULL);
+		if (buffer == InvalidBuffer)
+			break;
+		ReleaseBuffer(buffer);
+	}
+
+	if (pgsr->per_io_data)
+		pfree(pgsr->per_io_data);
+	pfree(pgsr);
+}
+
+int
+pg_streaming_read_ios(PgStreamingRead *pgsr)
+{
+	return pgsr->ios_in_progress;
+}
+
+int
+pg_streaming_read_pins(PgStreamingRead *pgsr)
+{
+	return pgsr->pinned_buffers;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1d79907020..179f90ab99 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1908,7 +1908,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 6ea9faa439..b196d9f885 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e29ca85077..6e1e99a4db 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -270,6 +270,8 @@ extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..dbf8c279d9
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,53 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/*
+ * For most sequential access, callers can user this size to build full sized
+ * reads without pinning many extra buffers.
+ */
+#define PG_STREAMING_READ_DEFAULT_MAX_IOS MAX_BUFFERS_PER_TRANSFER
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* "Simple" callback that returns a block number. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												uintptr_t pgsr_private,
+												void *per_io_private);
+
+/* Callback allowing the relation, fork, block and mode to vary. */
+typedef bool (*PgStreamingReadBufferExtendedCB) (PgStreamingRead *pgsr,
+												 uintptr_t pgsr_private,
+												 void *per_io_private,
+												 BufferManagerRelation *bmr,
+												 ForkNumber *forknum,
+												 BlockNumber *blockNum,
+												 ReadBufferMode *mode);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int max_ios,
+													   size_t per_io_private_size,
+													   uintptr_t pgsr_private,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc_ext(int max_ios,
+														   size_t per_io_private_size,
+														   uintptr_t pgsr_private,
+														   BufferAccessStrategy strategy,
+														   PgStreamingReadBufferExtendedCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_io_private);
+extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+extern int pg_streaming_read_ios(PgStreamingRead *pgsr);
+extern int pg_streaming_read_pins(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index dba3498a13..7d00eff1d5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2081,6 +2081,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadCluster
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.42.1

v2-0008-Use-streaming-reads-in-pg_prewarm.patchapplication/octet-stream; name=v2-0008-Use-streaming-reads-in-pg_prewarm.patchDownload

From 1d26ac1a225a191e79892c666435f7fb564e3217 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v2 8/8] Use streaming reads in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use streaming reads.  This
provides a simple example of such a transformation, and generates fewer
system calls.

Author: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 41 ++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 01fc2c8ad9..b56d90bc2a 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,26 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   uintptr_t pgsr_private,
+							   void *per_io_data)
+{
+	struct pg_prewarm_streaming_read_private *p =
+		(struct pg_prewarm_streaming_read_private *) pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +204,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PG_STREAMING_READ_DEFAULT_MAX_IOS,
+											  0,
+											  (uintptr_t) &p,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
-- 
2.42.1

v2-0001-Optimize-pg_readv-pg_pwritev-for-single-vector.patchapplication/octet-stream; name=v2-0001-Optimize-pg_readv-pg_pwritev-for-single-vector.patchDownload

From df0341fd3126fab496c73354dae0e52ba6acb395 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 26 Nov 2023 09:59:05 +1300
Subject: [PATCH v2 1/8] Optimize pg_readv/pg_pwritev for single vector.

For some systems that have preadv and pwritev (almost all Unixes), it is
known to be slightly faster to call plain old pread and pwrite if you
have only one vector, because the kernel avoids extra work following
pointers, copying data in, and possibly allocating space.  It is
debatable where we should do that, but since we already have to use pg_
wrappers for other reasons, that suggests the idea of doing it in there.
Avoid adding extra function call overheads by making it static inline,
which might also allow the compiler to avoid the branch in some cases
where it can see iovcnt's value.

For systems that don't have preadv and pwritev (currently: Windows and
[closed] Solaris), we might as well pull the replacement functions up
into the static inline functions too.

Suggested-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 configure                   | 23 -----------
 configure.ac                |  7 +---
 src/include/port/pg_iovec.h | 76 +++++++++++++++++++++++++++++++++----
 src/port/meson.build        |  2 -
 src/port/preadv.c           | 43 ---------------------
 src/port/pwritev.c          | 43 ---------------------
 src/tools/msvc/Mkvcbuild.pm |  2 +-
 7 files changed, 72 insertions(+), 124 deletions(-)
 delete mode 100644 src/port/preadv.c
 delete mode 100644 src/port/pwritev.c

diff --git a/configure b/configure
index c064115038..617da20553 100755
--- a/configure
+++ b/configure
@@ -16056,9 +16056,6 @@ cat >>confdefs.h <<_ACEOF
 #define HAVE_DECL_STRNLEN $ac_have_decl
 _ACEOF
 
-
-# We can't use AC_REPLACE_FUNCS to replace these functions, because it
-# won't handle deployment target restrictions on macOS
 ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
 "
 if test "x$ac_cv_have_decl_preadv" = xyes; then :
@@ -16070,16 +16067,6 @@ fi
 cat >>confdefs.h <<_ACEOF
 #define HAVE_DECL_PREADV $ac_have_decl
 _ACEOF
-if test $ac_have_decl = 1; then :
-
-else
-  case " $LIBOBJS " in
-  *" preadv.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS preadv.$ac_objext"
- ;;
-esac
-
-fi
 
 ac_fn_c_check_decl "$LINENO" "pwritev" "ac_cv_have_decl_pwritev" "#include <sys/uio.h>
 "
@@ -16092,16 +16079,6 @@ fi
 cat >>confdefs.h <<_ACEOF
 #define HAVE_DECL_PWRITEV $ac_have_decl
 _ACEOF
-if test $ac_have_decl = 1; then :
-
-else
-  case " $LIBOBJS " in
-  *" pwritev.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS pwritev.$ac_objext"
- ;;
-esac
-
-fi
 
 
 # This is probably only present on macOS, but may as well check always
diff --git a/configure.ac b/configure.ac
index f220b379b3..547bfdb514 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1815,11 +1815,8 @@ AC_CHECK_DECLS(posix_fadvise, [], [], [#include <fcntl.h>])
 
 AC_CHECK_DECLS(fdatasync, [], [], [#include <unistd.h>])
 AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
-
-# We can't use AC_REPLACE_FUNCS to replace these functions, because it
-# won't handle deployment target restrictions on macOS
-AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include <sys/uio.h>])
-AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include <sys/uio.h>])
+AC_CHECK_DECLS([preadv], [], [], [#include <sys/uio.h>])
+AC_CHECK_DECLS([pwritev], [], [], [#include <sys/uio.h>])
 
 # This is probably only present on macOS, but may as well check always
 AC_CHECK_DECLS(F_FULLFSYNC, [], [], [#include <fcntl.h>])
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index 689799c425..0637d6acab 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -17,6 +17,7 @@
 
 #include <limits.h>
 #include <sys/uio.h>
+#include <unistd.h>
 
 #else
 
@@ -36,20 +37,81 @@ struct iovec
 #define PG_IOV_MAX Min(IOV_MAX, 32)
 
 /*
- * Note that pg_preadv and pg_pwritev have a pg_ prefix as a warning that the
- * Windows implementations have the side-effect of changing the file position.
+ * Like preadv(), but with a prefix to remind us of a side-effect: on Windows
+ * this changes the current file position.
  */
-
+static inline ssize_t
+pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
 #if HAVE_DECL_PREADV
-#define pg_preadv preadv
+	/*
+	 * Avoid a small amount of argument copying overhead in the kernel if
+	 * there is only one iovec.
+	 */
+	if (iovcnt == 1)
+		return pread(fd, iov[0].iov_base, iov[0].iov_len, offset);
+	else
+		return preadv(fd, iov, iovcnt, offset);
 #else
-extern ssize_t pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+	ssize_t		sum = 0;
+	ssize_t		part;
+
+	for (int i = 0; i < iovcnt; ++i)
+	{
+		part = pg_pread(fd, iov[i].iov_base, iov[i].iov_len, offset);
+		if (part < 0)
+		{
+			if (i == 0)
+				return -1;
+			else
+				return sum;
+		}
+		sum += part;
+		offset += part;
+		if (part < iov[i].iov_len)
+			return sum;
+	}
+	return sum;
 #endif
+}
 
+/*
+ * Like pwritev(), but with a prefix to remind us of a side-effect: on Windows
+ * this changes the current file position.
+ */
+static inline ssize_t
+pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
 #if HAVE_DECL_PWRITEV
-#define pg_pwritev pwritev
+	/*
+	 * Avoid a small amount of argument copying overhead in the kernel if
+	 * there is only one iovec.
+	 */
+	if (iovcnt == 1)
+		return pwrite(fd, iov[0].iov_base, iov[0].iov_len, offset);
+	else
+		return pwritev(fd, iov, iovcnt, offset);
 #else
-extern ssize_t pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+	ssize_t		sum = 0;
+	ssize_t		part;
+
+	for (int i = 0; i < iovcnt; ++i)
+	{
+		part = pg_pwrite(fd, iov[i].iov_base, iov[i].iov_len, offset);
+		if (part < 0)
+		{
+			if (i == 0)
+				return -1;
+			else
+				return sum;
+		}
+		sum += part;
+		offset += part;
+		if (part < iov[i].iov_len)
+			return sum;
+	}
+	return sum;
 #endif
+}
 
 #endif							/* PG_IOVEC_H */
diff --git a/src/port/meson.build b/src/port/meson.build
index a0d0a9583a..576a48b48c 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -66,8 +66,6 @@ replace_funcs_neg = [
   ['getpeereid'],
   ['inet_aton'],
   ['mkdtemp'],
-  ['preadv', 'HAVE_DECL_PREADV'],
-  ['pwritev', 'HAVE_DECL_PWRITEV'],
   ['strlcat'],
   ['strlcpy'],
   ['strnlen'],
diff --git a/src/port/preadv.c b/src/port/preadv.c
deleted file mode 100644
index e762283e67..0000000000
--- a/src/port/preadv.c
+++ /dev/null
@@ -1,43 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * preadv.c
- *	  Implementation of preadv(2) for platforms that lack one.
- *
- * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *	  src/port/preadv.c
- *
- *-------------------------------------------------------------------------
- */
-
-
-#include "c.h"
-
-#include <unistd.h>
-
-#include "port/pg_iovec.h"
-
-ssize_t
-pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
-{
-	ssize_t		sum = 0;
-	ssize_t		part;
-
-	for (int i = 0; i < iovcnt; ++i)
-	{
-		part = pg_pread(fd, iov[i].iov_base, iov[i].iov_len, offset);
-		if (part < 0)
-		{
-			if (i == 0)
-				return -1;
-			else
-				return sum;
-		}
-		sum += part;
-		offset += part;
-		if (part < iov[i].iov_len)
-			return sum;
-	}
-	return sum;
-}
diff --git a/src/port/pwritev.c b/src/port/pwritev.c
deleted file mode 100644
index 519de45037..0000000000
--- a/src/port/pwritev.c
+++ /dev/null
@@ -1,43 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * pwritev.c
- *	  Implementation of pwritev(2) for platforms that lack one.
- *
- * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *	  src/port/pwritev.c
- *
- *-------------------------------------------------------------------------
- */
-
-
-#include "c.h"
-
-#include <unistd.h>
-
-#include "port/pg_iovec.h"
-
-ssize_t
-pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
-{
-	ssize_t		sum = 0;
-	ssize_t		part;
-
-	for (int i = 0; i < iovcnt; ++i)
-	{
-		part = pg_pwrite(fd, iov[i].iov_base, iov[i].iov_len, offset);
-		if (part < 0)
-		{
-			if (i == 0)
-				return -1;
-			else
-				return sum;
-		}
-		sum += part;
-		offset += part;
-		if (part < iov[i].iov_len)
-			return sum;
-	}
-	return sum;
-}
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 84f648c174..d26d1a84e8 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -104,7 +104,7 @@ sub mkvcbuild
 	  inet_net_ntop.c kill.c open.c
 	  snprintf.c strlcat.c strlcpy.c dirmod.c noblock.c path.c
 	  dirent.c getopt.c getopt_long.c
-	  preadv.c pwritev.c pg_bitutils.c
+	  pg_bitutils.c
 	  pg_strong_random.c pgcheckdir.c pgmkdirp.c pgsleep.c pgstrcasecmp.c
 	  pqsignal.c mkdtemp.c qsort.c qsort_arg.c bsearch_arg.c quotes.c system.c
 	  strerror.c tar.c
-- 
2.42.1

v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri.patchapplication/octet-stream; name=v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri.patchDownload

From 2f5cbdacace9e6b58c5680f0c039175a5c74a898 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Jul 2023 12:32:51 +1200
Subject: [PATCH v2 2/8] Provide vectored variants of FileRead() and
 FileWrite().

FileReadV() and FileWriteV() adapt preadv() and pwritev() for
PostgreSQL's virtual file descriptors.  The traditional FileRead() and
FileWrite() functions are implemented in terms of the new functions.

Author: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/file/fd.c | 25 ++++++++++++++++---------
 src/include/storage/fd.h      | 34 ++++++++++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f691ba0932..8fd7cb39c6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2110,8 +2110,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 int
-FileRead(File file, void *buffer, size_t amount, off_t offset,
-		 uint32 wait_event_info)
+FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		  uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2131,7 +2131,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 
 retry:
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pread(vfdP->fd, buffer, amount, offset);
+	returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
 	if (returnCode < 0)
@@ -2166,8 +2166,8 @@ retry:
 }
 
 int
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
-		  uint32 wait_event_info)
+FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		   uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2195,7 +2195,14 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset + amount;
+		size_t		size = 0;
+		off_t		past_write;
+
+		/* Compute the total transfer size. */
+		for (int i = 0; i < iovcnt; ++i)
+			size += iov[i].iov_len;
+
+		past_write = offset + size;
 
 		if (past_write > vfdP->fileSize)
 		{
@@ -2213,11 +2220,11 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 retry:
 	errno = 0;
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
+	returnCode = pg_pwritev(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
 	/* if write didn't set errno, assume problem is no disk space */
-	if (returnCode != amount && errno == 0)
+	if (returnCode < 0 && errno == 0)
 		errno = ENOSPC;
 
 	if (returnCode >= 0)
@@ -2227,7 +2234,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + amount;
+			off_t		past_write = offset + returnCode;
 
 			if (past_write > vfdP->fileSize)
 			{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index d9d5d9da5f..1e01d28c0d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -43,6 +43,8 @@
 #ifndef FD_H
 #define FD_H
 
+#include "port/pg_iovec.h"
+
 #include <dirent.h>
 #include <fcntl.h>
 
@@ -105,8 +107,8 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
-extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
+extern int	FileReadV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileWriteV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
@@ -189,4 +191,32 @@ extern int	durable_unlink(const char *fname, int elevel);
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
+static inline int
+FileRead(File file, void *buffer, size_t amount, off_t offset,
+		 uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = buffer,
+		.iov_len = amount
+	};
+
+	return FileReadV(file, &iov, 1, offset, wait_event_info);
+}
+
+static inline int
+FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+		  uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = unconstify(void *, buffer),
+		.iov_len = amount
+	};
+
+	return FileWriteV(file, &iov, 1, offset, wait_event_info);
+}
+
+/* Filename components */
+#define PG_TEMP_FILES_DIR "pgsql_tmp"
+#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
+
 #endif							/* FD_H */
-- 
2.42.1

v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patchapplication/octet-stream; name=v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patchDownload

From 32752f2beed547f718887e4e6bd54f64a2e067e0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Jul 2023 14:50:41 +1200
Subject: [PATCH v2 3/8] Provide vectored variants of smgrread() and
 smgrwrite().

smgrreadv() and smgrwritev() map to FileReadV() and FileWriteV(), and
thence the OS's vector I/O routines, after dealing with segmentation.
These functions allow for contiguous regions of relation files to be
transferred into and out of non-neighboring buffers with a single system
call.  The traditional smgrread() and smgrwrite() functions are
implemented in terms of the new functions.

The md.c implementation automatically handles short transfers by
retrying.  They were probably always possible even when we were only
doing 8kB reads and write, but with larger transfers they become more
likely.

Author: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/smgr/md.c   | 353 +++++++++++++++++++++++---------
 src/backend/storage/smgr/smgr.c |  37 ++--
 src/include/storage/fd.h        |   2 +-
 src/include/storage/md.h        |   9 +-
 src/include/storage/smgr.h      |  25 ++-
 5 files changed, 304 insertions(+), 122 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad170..d1383220bd 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -452,7 +452,7 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 /*
  * mdextend() -- Add a block to the specified relation.
  *
- * The semantics are nearly the same as mdwrite(): write at the
+ * The semantics are nearly the same as mdwritev(): write at the
  * specified position.  However, this is to be used for the case of
  * extending a relation (i.e., blocknum is at or beyond the current
  * EOF).  Note that we assume writing a block beyond current EOF
@@ -737,136 +737,297 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * mdread() -- Read the specified block from a relation.
+ * Convert an array of buffer address into an array of iovec objects, and
+ * return the number that were required.  'iov' must have enough space for up
+ * to PG_IOV_MAX elements.
  */
-void
-mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   void *buffer)
+static int
+buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
+	struct iovec *iovp;
+	int			iovcnt;
 
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+	Assert(nblocks >= 1);
+	Assert(nblocks <= PG_IOV_MAX);
 
-	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend);
+	/* If this build supports direct I/O, buffers must be I/O aligned. */
+	for (int i = 0; i < nblocks; ++i)
+	{
+		if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
+			Assert((uintptr_t) buffers[i] ==
+				   TYPEALIGN(PG_IO_ALIGN_SIZE, buffers[i]));
+	}
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+	/* Start the first iovec off with the first buffer. */
+	iovp = &iov[0];
+	iovp->iov_base = buffers[0];
+	iovp->iov_len = BLCKSZ;
+	iovcnt = 1;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	/* Merge in the rest. */
+	for (int i = 1; i < nblocks; ++i)
+	{
+		void	   *buffer = buffers[i];
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		if (((char *) iovp->iov_base + iovp->iov_len) == buffer)
+		{
+			/* Contiguous with the last iovec. */
+			iovp->iov_len += BLCKSZ;
+		}
+		else
+		{
+			/* Need a new iovec. */
+			iovp++;
+			iovp->iov_base = buffer;
+			iovp->iov_len = BLCKSZ;
+			iovcnt++;
+		}
+	}
+
+	return iovcnt;
+}
 
-	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
+/*
+ * To continue a vector I/O after a short transfer, we need a way to remove
+ * the whole or partial iovecs that have already been transferred.  We assume
+ * that partial transfers with direct I/O will happen on suitable alignment
+ * boundaries, so that we don't have to make further adjustments for that.
+ */
+static int
+adjust_iovec_for_partial_transfer(struct iovec *iov,
+								  int iovcnt,
+								  size_t transferred)
+{
+	struct iovec *keep = iov;
 
-	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
-									   reln->smgr_rlocator.locator.spcOid,
-									   reln->smgr_rlocator.locator.dbOid,
-									   reln->smgr_rlocator.locator.relNumber,
-									   reln->smgr_rlocator.backend,
-									   nbytes,
-									   BLCKSZ);
+	Assert(iovcnt > 0);
 
-	if (nbytes != BLCKSZ)
+	/* Remove wholly transferred iovecs. */
+	while (keep->iov_len <= transferred)
 	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
+		transferred -= keep->iov_len;
+		keep++;
+		iovcnt--;
+	}
+	if (keep != iov)
+		memmove(iov, keep, sizeof(*keep) * iovcnt);
 
-		/*
-		 * Short read: we are at or past EOF, or we read a partial block at
-		 * EOF.  Normally this is an error; upper levels should never try to
-		 * read a nonexistent block.  However, if zero_damaged_pages is ON or
-		 * we are InRecovery, we should instead return zeroes without
-		 * complaining.  This allows, for example, the case of trying to
-		 * update a block that was later truncated away.
-		 */
-		if (zero_damaged_pages || InRecovery)
-			MemSet(buffer, 0, BLCKSZ);
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
-							blocknum, FilePathName(v->mdfd_vfd),
-							nbytes, BLCKSZ)));
+	/* Shouldn't be called with 'transferred' >= total size. */
+	Assert(iovcnt > 0);
+
+	/* Adjust leading iovec, which may have been partially transferred. */
+	Assert(iov->iov_len > transferred);
+	iov->iov_base = (char *) iov->iov_base + transferred;
+	iov->iov_len -= transferred;
+
+	return iovcnt;
+}
+
+/*
+ * mdreadv() -- Read the specified blocks from a relation.
+ */
+void
+mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		void **buffers, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
+
+		v = _mdfd_getseg(reln, forknum, blocknum, false,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
+
+		iovcnt = buffers_to_iov(iov, buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
+
+		/* Inner loop to continue after a short reads. */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend);
+			nbytes = FileReadV(v->mdfd_vfd, iov, iovcnt, seekpos,
+							   WAIT_EVENT_DATA_FILE_READ);
+			TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
+											   reln->smgr_rlocator.locator.spcOid,
+											   reln->smgr_rlocator.locator.dbOid,
+											   reln->smgr_rlocator.locator.relNumber,
+											   reln->smgr_rlocator.backend,
+											   nbytes,
+											   BLCKSZ);
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read %u blocks at block %u in file \"%s\": %m",
+								nblocks_this_segment, blocknum,
+								FilePathName(v->mdfd_vfd))));
+
+			if (nbytes == 0)
+			{
+				/*
+				 * We are at or past EOF, or we read a partial block at EOF.
+				 * Normally this is an error; upper levels should never try to
+				 * read a nonexistent block.  However, if zero_damaged_pages
+				 * is ON or we are InRecovery, we should instead return zeroes
+				 * without complaining.  This allows, for example, the case of
+				 * trying to update a block that was later truncated away.
+				 */
+				if (zero_damaged_pages || InRecovery)
+				{
+					for (BlockNumber i = transferred_this_segment / BLCKSZ;
+						 i < nblocks_this_segment;
+						 ++i)
+						memset(buffers[i], 0, BLCKSZ);
+					break;
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("could not read %u blocks at block %u in file \"%s\": read only %zu of %zu bytes",
+									nblocks_this_segment,
+									blocknum,
+									FilePathName(v->mdfd_vfd),
+									transferred_this_segment,
+									size_this_segment)));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short transfer. */
+			seekpos += nbytes;
+			iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes);
+		}
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
 	}
 }
 
 /*
- * mdwrite() -- Write the supplied block at the appropriate location.
+ * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
  * use mdextend().
  */
 void
-mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		const void *buffer, bool skipFsync)
+mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
-
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
-
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
 #endif
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
-										 reln->smgr_rlocator.locator.spcOid,
-										 reln->smgr_rlocator.locator.dbOid,
-										 reln->smgr_rlocator.locator.relNumber,
-										 reln->smgr_rlocator.backend);
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend,
-										nbytes,
-										BLCKSZ);
+		iovcnt = buffers_to_iov(iov, (void **) buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
 
-	if (nbytes != BLCKSZ)
-	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
-		/* short write: complain appropriately */
-		ereport(ERROR,
-				(errcode(ERRCODE_DISK_FULL),
-				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
-						blocknum,
-						FilePathName(v->mdfd_vfd),
-						nbytes, BLCKSZ),
-				 errhint("Check free disk space.")));
-	}
+		/* Inner loop to continue after a short writes. */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
+												 reln->smgr_rlocator.locator.spcOid,
+												 reln->smgr_rlocator.locator.dbOid,
+												 reln->smgr_rlocator.locator.relNumber,
+												 reln->smgr_rlocator.backend);
+			nbytes = FileWriteV(v->mdfd_vfd, iov, iovcnt, seekpos,
+								WAIT_EVENT_DATA_FILE_WRITE);
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend,
+												nbytes,
+												BLCKSZ);
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write %u blocks at block %u in file \"%s\": %m",
+								nblocks_this_segment, blocknum,
+								FilePathName(v->mdfd_vfd))));
 
-	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+			if (nbytes == 0)
+			{
+				/* short write: complain appropriately */
+				ereport(ERROR,
+						(errcode(ERRCODE_DISK_FULL),
+						 errmsg("could not write %u blocks at block %u in file \"%s\": wrote only %zu of %zu bytes",
+								nblocks_this_segment,
+								blocknum,
+								FilePathName(v->mdfd_vfd),
+								transferred_this_segment,
+								size_this_segment),
+						 errhint("Check free disk space.")));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short transfer. */
+			seekpos += nbytes;
+			iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes);
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
+	}
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c..b46ac3cb09 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -55,10 +55,13 @@ typedef struct f_smgr
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
-	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
-							  BlockNumber blocknum, void *buffer);
-	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-							   BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum,
+							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum,
+								const void **buffers, BlockNumber nblocks,
+								bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -80,8 +83,8 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
-		.smgr_read = mdread,
-		.smgr_write = mdwrite,
+		.smgr_readv = mdreadv,
+		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -551,22 +554,23 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * smgrread() -- read a particular block from a relation into the supplied
- *				 buffer.
+ * smgrreadv() -- read a particular block range from a relation into the
+ *				 supplied buffers.
  *
  * This routine is called from the buffer manager in order to
  * instantiate pages in the shared buffer cache.  All storage managers
  * return pages in the format that POSTGRES expects.
  */
 void
-smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 void *buffer)
+smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  void **buffers, BlockNumber nblocks)
 {
-	smgrsw[reln->smgr_which].smgr_read(reln, forknum, blocknum, buffer);
+	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
+										nblocks);
 }
 
 /*
- * smgrwrite() -- Write the supplied buffer out.
+ * smgrwritev() -- Write the supplied buffers out.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
@@ -581,14 +585,13 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * do not require fsync.
  */
 void
-smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  const void *buffer, bool skipFsync)
+smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_write(reln, forknum, blocknum,
-										buffer, skipFsync);
+	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
+										 buffers, nblocks, skipFsync);
 }
 
-
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1e01d28c0d..a145cf25cf 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -208,7 +208,7 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 		  uint32 wait_event_info)
 {
 	struct iovec iov = {
-		.iov_base = unconstify(void *, buffer),
+		.iov_base = (void *) buffer,
 		.iov_len = amount
 	};
 
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a..0f73d9af49 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,10 +32,11 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-				   void *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-					BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					void **buffers, BlockNumber nblocks);
+extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum,
+					 const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..ae93910132 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -96,10 +96,13 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
-extern void smgrread(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum, void *buffer);
-extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-					  BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
+					  BlockNumber blocknum,
+					  void **buffer, BlockNumber nblocks);
+extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum,
+					   const void **buffer, BlockNumber nblocks,
+					   bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,4 +113,18 @@ extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
+static inline void
+smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 void *buffer)
+{
+	smgrreadv(reln, forknum, blocknum, &buffer, 1);
+}
+
+static inline void
+smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  const void *buffer, bool skipFsync)
+{
+	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
+}
+
 #endif							/* SMGR_H */
-- 
2.42.1

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Thomas Munro (#5)

Re: Streaming I/O, vectored I/O (WIP)

On 28/11/2023 14:17, Thomas Munro wrote:

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
+     /* Avoid a slightly more expensive kernel call if there is no benefit. */
+     if (iovcnt == 1)
+             returnCode = pg_pread(vfdP->fd,
+                                                       iov[0].iov_base,
+                                                       iov[0].iov_len,
+                                                       offset);
+     else
+             returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
How about pushing down this optimization to pg_preadv() itself?
pg_readv() is currently just a macro if the system provides preadv(),
but it could be a "static inline" that does the above dance. I think
that optimization is platform-dependent anyway, pread() might not be any
faster on some OSs. In particular, if the system doesn't provide
preadv() and we use the implementation in src/port/preadv.c, it's the
same kernel call anyway.
Done. I like it, I just feel a bit bad about moving the p*v()
replacement functions around a couple of times already! I figured it
might as well be static inline even if we use the fallback (= Solaris
and Windows).

LGTM. I think this 0001 patch is ready for commit, independently of the
rest of the patches.

In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:

+/* Filename components */
+#define PG_TEMP_FILES_DIR "pgsql_tmp"
+#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
+

These seem out of place, we already have them in common/file_utils.h.
Other than that,
v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and
v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look
good to me.

--
Heikki Linnakangas
Neon (https://neon.tech)

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Thomas Munro (#5)

Re: Streaming I/O, vectored I/O (WIP)

On 28/11/2023 14:17, Thomas Munro wrote:

On Thu, Sep 28, 2023 at 7:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

in streaming_read.h:

typedef bool (*PgStreamingReadBufferDetermineNextCB) (PgStreamingRead *pgsr,
uintptr_t pgsr_private,
void *per_io_private,
BufferManagerRelation *bmr,
ForkNumber *forkNum,
BlockNumber *blockNum,
ReadBufferMode *mode);

I was surprised that 'bmr', 'forkNum' and 'mode' are given separately on
each read. I see that you used that in the WAL replay prefetching, so I
guess that makes sense.

In this version I have introduced an alternative simple callback.
It's approximately what we had already tried out in an earlier version
before I started streamifying recovery, but in this version you can
choose, so recovery can opt for the wider callback.

Ok. Two APIs is a bit redundant, but because most callers would prefer
the simpler API, that's probably a good tradeoff.

I've added some ramp-up logic. The idea is that after we streamify
everything in sight, we don't want to penalise users that don't really
need more than one or two blocks, but don't know that yet. Here is
how the system calls look when you do pg_prewarm():

pread64(32, ..., 8192, 0) = 8192 <--- start with just one block
pread64(32, ..., 16384, 8192) = 16384
pread64(32, ..., 32768, 24576) = 32768
pread64(32, ..., 65536, 57344) = 65536
pread64(32, ..., 131072, 122880) = 131072 <--- soon reading 16
blocks at a time
pread64(32, ..., 131072, 253952) = 131072
pread64(32, ..., 131072, 385024) = 131072

I guess it could be done in quite a few different ways and I'm open to
better ideas. This way inserts prefetching stalls but ramps up
quickly and is soon out of the way. I wonder if we would want to make
that a policy that a caller can disable, if you want to skip the
ramp-up and go straight for the largest possible I/O size? Then I
think we'd need a 'flags' argument to the streaming read constructor
functions.

I think a 'flags' argument and a way to opt-out of the slow start would
make sense. pg_prewarm in particular knows that it will read the whole
relation.

A small detour: While contemplating how this interacts with parallel
sequential scan, which also has a notion of ramping up, I noticed
another problem. One parallel seq scan process does this:

fadvise64(32, 35127296, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 35127296) = 131072
preadv(32, [...], 2, 35258368) = 131072
fadvise64(32, 36175872, 131072, POSIX_FADV_WILLNEED) = 0
preadv(32, [...], 2, 36175872) = 131072
preadv(32, [...], 2, 36306944) = 131072
...

We don't really want those fadvise() calls. We don't get them with
parallelism disabled, because streaming_read.c is careful not to
generate advice for sequential workloads based on ancient wisdom from
this mailing list, re-confirmed on recent Linux: WILLNEED hints
actually get in the way of Linux's own prefetching and slow you down,
so we only want them for truly random access. But the logic can't see
that another process is making holes in this process's sequence.

Hmm, aside from making the sequential pattern invisible to this process,
are we defeating Linux's logic too, just by performing the reads from
multiple processes? The processes might issue the reads to the kernel
out-of-order.

How bad is the slowdown when you issue WILLNEED hints on sequential access?

The two obvious solutions are (1) pass in a flag at the start saying
"I promise this is sequential even if it doesn't look like it, no
hints please" and (2) invent "shared" (cross-process) streaming
reads, and teach all the parallel seq scan processes to get their
buffers from there.

Idea (2) is interesting to think about but even if it is a useful idea
(not sure) it is certainly overkill just to solve this little problem
for now. So perhaps I should implement (1), which would be another
reason to add a flags argument. It's not a perfect solution though
because some more 'data driven' parallel scans (indexes, bitmaps, ...)
have a similar problem that is less amenable to top-down kludgery.

(1) seems fine to me.

--
Heikki Linnakangas
Neon (https://neon.tech)

Melanie Plageman

melanieplageman@gmail.com

about 2 years ago

In reply to: Thomas Munro (#5)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Nov 29, 2023 at 01:17:19AM +1300, Thomas Munro wrote:

Thanks for posting a new version. I've included a review of 0004.

I've included just the pg_prewarm example user for now while we
discuss the basic infrastructure. The rest are rebased and in my
public Github branch streaming-read (repo macdice/postgres) if anyone
is interested (don't mind the red CI failures, they're just saying I
ran out of monthly CI credits on the 29th, so close...)

I agree it makes sense to commit the interface with just prewarm as a
user. Then we can start new threads for the various streaming read users
(e.g. vacuum, sequential scan, bitmapheapscan).

From db5de8ab5a1a804f41006239302fdce954cab331 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v2 4/8] Provide vectored variant of ReadBuffer().

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7c67d504c..8ae3a72053 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1046,175 +1048,326 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
flags |= EB_LOCK_FIRST;

-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
}

-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit,
+							   &allocated);
+
+	/* At this point we do NOT hold any locks. */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers().  PrepareReadBuffer()
+ * followed by CompleteReadBuffers() is equivalent ot ReadBuffer(), but the

ot -> to

+ * caller has the opportunity to coalesce reads of neighboring blocks into one
+ * CompleteReadBuffers() call.
+ *
+ * *found is set to true for a hit, and false for a miss.
+ *
+ * *allocated is set to true for a miss that allocates a buffer for the first
+ * time.  If there are multiple calls to PrepareReadBuffer() for the same
+ * block before CompleteReadBuffers() or ReadBuffer_common() finishes the
+ * read, then only the first such call will receive *allocated == true, which
+ * the caller might use to issue just one prefetch hint.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *found,
+				  bool *allocated)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;

+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
if (isLocalBuf)
{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
io_context = IOCONTEXT_NORMAL;
io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
}
else
{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
io_context = IOContextForStrategy(strategy);
io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;

You've lost this test in your new version. You can do the same thing
(avoid counting zeroed buffers as blocks read) by moving this
pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common()
where you know if you called ZeroBuffer() or CompleteReadBuffers().

}

-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);

-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, found, allocated);
+		if (*found)
+			pgBufferUsage.local_blks_hit++;
+		else
+			pgBufferUsage.local_blks_read++;

See comment above.

+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, found, allocated, io_context);
+		if (*found)
+			pgBufferUsage.shared_blks_hit++;
+		else
+			pgBufferUsage.shared_blks_read++;
+	}
+	if (bmr.rel)
+	{
+		pgstat_count_buffer_read(bmr.rel);

This is double-counting reads. You've left the call in
ReadBufferExtended() as well as adding this here. It should be fine to
remove it from ReadBufferExtended(). Because you test bmr.rel, leaving
the call here in PrepareReadBuffer() wouldn't have an effect on
ReadBuffer_common() callers who don't pass a relation (like recovery).
The other current callers of ReadBuffer_common() (by way of
ExtendBufferedRelTo()) who do pass a relation are visibility map and
freespace map extension, and I don't think we track relation stats for
the VM and FSM.

This does continue the practice of counting zeroed buffers as reads in
table-level stats. But, that is the same as master.

-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{

...

-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/* We found a buffer that we have to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;

-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1]))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time();
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1);

I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes
will be BLCKSZ and multiplying that by the number of reads should
produce the number of bytes read.

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 4efb34b75a..ee9307b612 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -116,7 +116,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
*/
BufferDesc *
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-				 bool *foundPtr)
+				 bool *foundPtr, bool *allocPtr)
{
BufferTag	newTag;			/* identity of requested block */
LocalBufferLookupEnt *hresult;
@@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
Assert(BufferTagsEqual(&bufHdr->tag, &newTag));

*foundPtr = PinLocalBuffer(bufHdr, true);
+ *allocPtr = false;
}
else
{
@@ -170,6 +171,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);

*foundPtr = false;
+ *allocPtr = true;
}

I would prefer you use consistent naming for
allocPtr/allocatedPtr/allocated. I also think that all the functions
taking it as an output argument should explain what it is
(BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of
digging around to figure it out. You have a nice comment about it above
PrepareReadBuffer(). I think you may need to resign yourself to
restating that bit (or some version of it) for all of the functions
taking it as an argument.

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 41e26d3e20..e29ca85077 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,8 @@
#ifndef BUFMGR_H
#define BUFMGR_H

+#include "pgstat.h"

I don't know what we are supposed to do, but I would have included this
in bufmgr.c (where I actually needed it) instead of including it here.

+#include "port/pg_iovec.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -47,6 +49,8 @@ typedef enum
RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page
* in "cleanup" mode */
RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */

+	RBM_WILL_ZERO,				/* Don't read from disk, caller will call
+								 * ZeroBuffer() */

It's confusing that this (RBM_WILL_ZERO) is part of this commit since it
isn't used in this commit.

RBM_NORMAL_NO_LOG, /* Don't log page as invalid during WAL
* replay; otherwise same as RBM_NORMAL */
} ReadBufferMode;

- Melanie

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#6)

2 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Nov 29, 2023 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

LGTM. I think this 0001 patch is ready for commit, independently of the
rest of the patches.

Done.

In v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch, fd.h:
+/* Filename components */
+#define PG_TEMP_FILES_DIR "pgsql_tmp"
+#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
+
These seem out of place, we already have them in common/file_utils.h.

Yeah, they moved from there in f39b2658 and I messed up the rebase. Fixed.

Other than that,
v2-0002-Provide-vectored-variants-of-FileRead-and-FileWri-1.patch and
v2-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patch look
good to me.

One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
callers believe that short writes set errno: they error out with a
message including %m. We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier). In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error. Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point). I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change. The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short. Thoughts?

[1]: https://utcc.utoronto.ca/~cks/space/blog/unix/WritesNotShortOften

Attachments:

v3-0001-Provide-vectored-variants-of-FileRead-and-FileWri.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Provide-vectored-variants-of-FileRead-and-FileWri.patchDownload

From 7808d7241088f8dc02d7d828dc4ee0227e8e2a01 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Jul 2023 12:32:51 +1200
Subject: [PATCH v3 1/7] Provide vectored variants of FileRead() and
 FileWrite().

FileReadV() and FileWriteV() adapt preadv() and pwritev() for
PostgreSQL's virtual file descriptors.  The traditional FileRead() and
FileWrite() functions are implemented in terms of the new functions.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/file/fd.c | 33 +++++++++++++++++++++++----------
 src/include/storage/fd.h      | 30 ++++++++++++++++++++++++++++--
 2 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f691ba0932..aa89c755b3 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2110,8 +2110,8 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 int
-FileRead(File file, void *buffer, size_t amount, off_t offset,
-		 uint32 wait_event_info)
+FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		  uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2131,7 +2131,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 
 retry:
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pread(vfdP->fd, buffer, amount, offset);
+	returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
 	if (returnCode < 0)
@@ -2166,8 +2166,8 @@ retry:
 }
 
 int
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
-		  uint32 wait_event_info)
+FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		   uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
@@ -2195,7 +2195,14 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset + amount;
+		size_t		size = 0;
+		off_t		past_write;
+
+		/* Compute the total transfer size. */
+		for (int i = 0; i < iovcnt; ++i)
+			size += iov[i].iov_len;
+
+		past_write = offset + size;
 
 		if (past_write > vfdP->fileSize)
 		{
@@ -2213,11 +2220,17 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 retry:
 	errno = 0;
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
+	returnCode = pg_pwritev(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
-	/* if write didn't set errno, assume problem is no disk space */
-	if (returnCode != amount && errno == 0)
+	/*
+	 * Some callers expect short writes to set errno, even though that isn't
+	 * the behavior of the underlying system calls.  We don't want to waste
+	 * CPU cycles adding up the total write size here, so we'll just set it to
+	 * ENOSPC even on success, if the syscall didn't set it.  On successful
+	 * non-short write, no caller should consult errno.
+	 */
+	if (errno == 0)
 		errno = ENOSPC;
 
 	if (returnCode >= 0)
@@ -2227,7 +2240,7 @@ retry:
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + amount;
+			off_t		past_write = offset + returnCode;
 
 			if (past_write > vfdP->fileSize)
 			{
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index d9d5d9da5f..a1d6056d9d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -43,6 +43,8 @@
 #ifndef FD_H
 #define FD_H
 
+#include "port/pg_iovec.h"
+
 #include <dirent.h>
 #include <fcntl.h>
 
@@ -105,8 +107,8 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
-extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
+extern int	FileReadV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileWriteV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
@@ -189,4 +191,28 @@ extern int	durable_unlink(const char *fname, int elevel);
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
+static inline int
+FileRead(File file, void *buffer, size_t amount, off_t offset,
+		 uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = buffer,
+		.iov_len = amount
+	};
+
+	return FileReadV(file, &iov, 1, offset, wait_event_info);
+}
+
+static inline int
+FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+		  uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = unconstify(void *, buffer),
+		.iov_len = amount
+	};
+
+	return FileWriteV(file, &iov, 1, offset, wait_event_info);
+}
+
 #endif							/* FD_H */
-- 
2.42.1

v3-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Provide-vectored-variants-of-smgrread-and-smgrwri.patchDownload

From 5673ae46baa2cf8fa006f04462b9a60004873c65 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Jul 2023 14:50:41 +1200
Subject: [PATCH v3 2/7] Provide vectored variants of smgrread() and
 smgrwrite().

smgrreadv() and smgrwritev() map to FileReadV() and FileWriteV(), and
thence the OS's vector I/O routines, after dealing with segmentation.
These functions allow for contiguous regions of relation files to be
transferred into and out of non-neighboring buffers with a single system
call.  The traditional smgrread() and smgrwrite() functions are
implemented in terms of the new functions.

The md.c implementation automatically handles short transfers by
retrying.  They were probably always possible even when we were only
doing 8kB reads and write, but with larger transfers they might become
more likely.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/smgr/md.c   | 353 +++++++++++++++++++++++---------
 src/backend/storage/smgr/smgr.c |  37 ++--
 src/include/storage/fd.h        |   2 +-
 src/include/storage/md.h        |   9 +-
 src/include/storage/smgr.h      |  25 ++-
 5 files changed, 304 insertions(+), 122 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad170..d1383220bd 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -452,7 +452,7 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 /*
  * mdextend() -- Add a block to the specified relation.
  *
- * The semantics are nearly the same as mdwrite(): write at the
+ * The semantics are nearly the same as mdwritev(): write at the
  * specified position.  However, this is to be used for the case of
  * extending a relation (i.e., blocknum is at or beyond the current
  * EOF).  Note that we assume writing a block beyond current EOF
@@ -737,136 +737,297 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * mdread() -- Read the specified block from a relation.
+ * Convert an array of buffer address into an array of iovec objects, and
+ * return the number that were required.  'iov' must have enough space for up
+ * to PG_IOV_MAX elements.
  */
-void
-mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   void *buffer)
+static int
+buffers_to_iov(struct iovec *iov, void **buffers, int nblocks)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
+	struct iovec *iovp;
+	int			iovcnt;
 
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
+	Assert(nblocks >= 1);
+	Assert(nblocks <= PG_IOV_MAX);
 
-	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend);
+	/* If this build supports direct I/O, buffers must be I/O aligned. */
+	for (int i = 0; i < nblocks; ++i)
+	{
+		if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
+			Assert((uintptr_t) buffers[i] ==
+				   TYPEALIGN(PG_IO_ALIGN_SIZE, buffers[i]));
+	}
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+	/* Start the first iovec off with the first buffer. */
+	iovp = &iov[0];
+	iovp->iov_base = buffers[0];
+	iovp->iov_len = BLCKSZ;
+	iovcnt = 1;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	/* Merge in the rest. */
+	for (int i = 1; i < nblocks; ++i)
+	{
+		void	   *buffer = buffers[i];
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		if (((char *) iovp->iov_base + iovp->iov_len) == buffer)
+		{
+			/* Contiguous with the last iovec. */
+			iovp->iov_len += BLCKSZ;
+		}
+		else
+		{
+			/* Need a new iovec. */
+			iovp++;
+			iovp->iov_base = buffer;
+			iovp->iov_len = BLCKSZ;
+			iovcnt++;
+		}
+	}
+
+	return iovcnt;
+}
 
-	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
+/*
+ * To continue a vector I/O after a short transfer, we need a way to remove
+ * the whole or partial iovecs that have already been transferred.  We assume
+ * that partial transfers with direct I/O will happen on suitable alignment
+ * boundaries, so that we don't have to make further adjustments for that.
+ */
+static int
+adjust_iovec_for_partial_transfer(struct iovec *iov,
+								  int iovcnt,
+								  size_t transferred)
+{
+	struct iovec *keep = iov;
 
-	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
-									   reln->smgr_rlocator.locator.spcOid,
-									   reln->smgr_rlocator.locator.dbOid,
-									   reln->smgr_rlocator.locator.relNumber,
-									   reln->smgr_rlocator.backend,
-									   nbytes,
-									   BLCKSZ);
+	Assert(iovcnt > 0);
 
-	if (nbytes != BLCKSZ)
+	/* Remove wholly transferred iovecs. */
+	while (keep->iov_len <= transferred)
 	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
+		transferred -= keep->iov_len;
+		keep++;
+		iovcnt--;
+	}
+	if (keep != iov)
+		memmove(iov, keep, sizeof(*keep) * iovcnt);
 
-		/*
-		 * Short read: we are at or past EOF, or we read a partial block at
-		 * EOF.  Normally this is an error; upper levels should never try to
-		 * read a nonexistent block.  However, if zero_damaged_pages is ON or
-		 * we are InRecovery, we should instead return zeroes without
-		 * complaining.  This allows, for example, the case of trying to
-		 * update a block that was later truncated away.
-		 */
-		if (zero_damaged_pages || InRecovery)
-			MemSet(buffer, 0, BLCKSZ);
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
-							blocknum, FilePathName(v->mdfd_vfd),
-							nbytes, BLCKSZ)));
+	/* Shouldn't be called with 'transferred' >= total size. */
+	Assert(iovcnt > 0);
+
+	/* Adjust leading iovec, which may have been partially transferred. */
+	Assert(iov->iov_len > transferred);
+	iov->iov_base = (char *) iov->iov_base + transferred;
+	iov->iov_len -= transferred;
+
+	return iovcnt;
+}
+
+/*
+ * mdreadv() -- Read the specified blocks from a relation.
+ */
+void
+mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		void **buffers, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
+
+		v = _mdfd_getseg(reln, forknum, blocknum, false,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
+
+		iovcnt = buffers_to_iov(iov, buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
+
+		/* Inner loop to continue after a short reads. */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend);
+			nbytes = FileReadV(v->mdfd_vfd, iov, iovcnt, seekpos,
+							   WAIT_EVENT_DATA_FILE_READ);
+			TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
+											   reln->smgr_rlocator.locator.spcOid,
+											   reln->smgr_rlocator.locator.dbOid,
+											   reln->smgr_rlocator.locator.relNumber,
+											   reln->smgr_rlocator.backend,
+											   nbytes,
+											   BLCKSZ);
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read %u blocks at block %u in file \"%s\": %m",
+								nblocks_this_segment, blocknum,
+								FilePathName(v->mdfd_vfd))));
+
+			if (nbytes == 0)
+			{
+				/*
+				 * We are at or past EOF, or we read a partial block at EOF.
+				 * Normally this is an error; upper levels should never try to
+				 * read a nonexistent block.  However, if zero_damaged_pages
+				 * is ON or we are InRecovery, we should instead return zeroes
+				 * without complaining.  This allows, for example, the case of
+				 * trying to update a block that was later truncated away.
+				 */
+				if (zero_damaged_pages || InRecovery)
+				{
+					for (BlockNumber i = transferred_this_segment / BLCKSZ;
+						 i < nblocks_this_segment;
+						 ++i)
+						memset(buffers[i], 0, BLCKSZ);
+					break;
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("could not read %u blocks at block %u in file \"%s\": read only %zu of %zu bytes",
+									nblocks_this_segment,
+									blocknum,
+									FilePathName(v->mdfd_vfd),
+									transferred_this_segment,
+									size_this_segment)));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short transfer. */
+			seekpos += nbytes;
+			iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes);
+		}
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
 	}
 }
 
 /*
- * mdwrite() -- Write the supplied block at the appropriate location.
+ * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
  * use mdextend().
  */
 void
-mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		const void *buffer, bool skipFsync)
+mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
-
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
-
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
 #endif
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
-										 reln->smgr_rlocator.locator.spcOid,
-										 reln->smgr_rlocator.locator.dbOid,
-										 reln->smgr_rlocator.locator.relNumber,
-										 reln->smgr_rlocator.backend);
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend,
-										nbytes,
-										BLCKSZ);
+		iovcnt = buffers_to_iov(iov, (void **) buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
 
-	if (nbytes != BLCKSZ)
-	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
-		/* short write: complain appropriately */
-		ereport(ERROR,
-				(errcode(ERRCODE_DISK_FULL),
-				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
-						blocknum,
-						FilePathName(v->mdfd_vfd),
-						nbytes, BLCKSZ),
-				 errhint("Check free disk space.")));
-	}
+		/* Inner loop to continue after a short writes. */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
+												 reln->smgr_rlocator.locator.spcOid,
+												 reln->smgr_rlocator.locator.dbOid,
+												 reln->smgr_rlocator.locator.relNumber,
+												 reln->smgr_rlocator.backend);
+			nbytes = FileWriteV(v->mdfd_vfd, iov, iovcnt, seekpos,
+								WAIT_EVENT_DATA_FILE_WRITE);
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend,
+												nbytes,
+												BLCKSZ);
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write %u blocks at block %u in file \"%s\": %m",
+								nblocks_this_segment, blocknum,
+								FilePathName(v->mdfd_vfd))));
 
-	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+			if (nbytes == 0)
+			{
+				/* short write: complain appropriately */
+				ereport(ERROR,
+						(errcode(ERRCODE_DISK_FULL),
+						 errmsg("could not write %u blocks at block %u in file \"%s\": wrote only %zu of %zu bytes",
+								nblocks_this_segment,
+								blocknum,
+								FilePathName(v->mdfd_vfd),
+								transferred_this_segment,
+								size_this_segment),
+						 errhint("Check free disk space.")));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short transfer. */
+			seekpos += nbytes;
+			iovcnt = adjust_iovec_for_partial_transfer(iov, iovcnt, nbytes);
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
+	}
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c..b46ac3cb09 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -55,10 +55,13 @@ typedef struct f_smgr
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
-	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
-							  BlockNumber blocknum, void *buffer);
-	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-							   BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum,
+							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum,
+								const void **buffers, BlockNumber nblocks,
+								bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -80,8 +83,8 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
-		.smgr_read = mdread,
-		.smgr_write = mdwrite,
+		.smgr_readv = mdreadv,
+		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -551,22 +554,23 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * smgrread() -- read a particular block from a relation into the supplied
- *				 buffer.
+ * smgrreadv() -- read a particular block range from a relation into the
+ *				 supplied buffers.
  *
  * This routine is called from the buffer manager in order to
  * instantiate pages in the shared buffer cache.  All storage managers
  * return pages in the format that POSTGRES expects.
  */
 void
-smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 void *buffer)
+smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  void **buffers, BlockNumber nblocks)
 {
-	smgrsw[reln->smgr_which].smgr_read(reln, forknum, blocknum, buffer);
+	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
+										nblocks);
 }
 
 /*
- * smgrwrite() -- Write the supplied buffer out.
+ * smgrwritev() -- Write the supplied buffers out.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
@@ -581,14 +585,13 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * do not require fsync.
  */
 void
-smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  const void *buffer, bool skipFsync)
+smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_write(reln, forknum, blocknum,
-										buffer, skipFsync);
+	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
+										 buffers, nblocks, skipFsync);
 }
 
-
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index a1d6056d9d..7025af618a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -208,7 +208,7 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 		  uint32 wait_event_info)
 {
 	struct iovec iov = {
-		.iov_base = unconstify(void *, buffer),
+		.iov_base = (void *) buffer,
 		.iov_len = amount
 	};
 
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a..0f73d9af49 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,10 +32,11 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-				   void *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-					BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					void **buffers, BlockNumber nblocks);
+extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum,
+					 const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..ae93910132 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -96,10 +96,13 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
-extern void smgrread(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum, void *buffer);
-extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-					  BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
+					  BlockNumber blocknum,
+					  void **buffer, BlockNumber nblocks);
+extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum,
+					   const void **buffer, BlockNumber nblocks,
+					   bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,4 +113,18 @@ extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
+static inline void
+smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 void *buffer)
+{
+	smgrreadv(reln, forknum, blocknum, &buffer, 1);
+}
+
+static inline void
+smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  const void *buffer, bool skipFsync)
+{
+	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
+}
+
 #endif							/* SMGR_H */
-- 
2.42.1

#10

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Thomas Munro (#9)

Re: Streaming I/O, vectored I/O (WIP)

On 29/11/2023 21:39, Thomas Munro wrote:

One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
callers believe that short writes set errno: they error out with a
message including %m. We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier). In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error. Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?

If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point). I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change. The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short. Thoughts?

Feels pretty ugly, but I don't see anything outright wrong with that.

--
Heikki Linnakangas
Neon (https://neon.tech)

#11

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#10)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 29/11/2023 21:39, Thomas Munro wrote:

One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
callers believe that short writes set errno: they error out with a
message including %m. We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier). In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error. Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?
If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

It's cheap for md.c, because it already has nblocks_this_segment.
That's one reason I chose to put the retry there. If we push it down
to fd.c in order to be able to help other callers, you're right that
we could pass in the total size (and I guess assert that it's
correct), but that is sort of annoyingly redundant and further from
the interface we're wrapping.

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC". What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

With the previous version of the patch, we'd have to change a couple
of other callers not to believe that short writes are errors and set
errno (callers are inconsistent on this point). I don't really love
that we have "fake" system errors but I also want to stay focused
here, so in this new version V3 I tried a new approach: I realised I
can just always set errno without needing the total size, so that
(undocumented) aspect of the interface doesn't change. The point
being that it doesn't matter if you clobber errno with a bogus value
when the write was non-short. Thoughts?

Feels pretty ugly, but I don't see anything outright wrong with that.

Cool. I would consider cleaning up all the callers and get rid of
this ENOSPC stuff in independent work, but I didn't want discussion of
that (eg what external/extension code knows about this API?) to derail
THIS project, hence desire to preserve existing behaviour.

#12

Andres Freund

andres@anarazel.de

about 2 years ago

In reply to: Thomas Munro (#11)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 29/11/2023 21:39, Thomas Munro wrote:

One thing I wasn't 100% happy with was the treatment of ENOSPC. A few
callers believe that short writes set errno: they error out with a
message including %m. We have historically set errno = ENOSPC inside
FileWrite() if the write size was unexpectedly small AND the kernel
didn't set errno to a non-zero value (having set it to zero ourselves
earlier). In FileWriteV(), I didn't want to do that because it is
expensive to compute the total write size from the vector array and we
managed to measure an effect due to that in some workloads.

Note that the smgr patch actually handles short writes by continuing,
instead of raising an error. Short writes do already occur in the
wild on various systems for various rare technical reasons other than
ENOSPC I have heard (imagine transient failure to acquire some
temporary memory that the kernel chooses not to wait for, stuff like
that, though certainly many people and programs believe they should
not happen[1]), and it seems like a good idea to actually handle them
as our write sizes increase and the probability of short writes might
presumably increase.

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?
If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

It's cheap for md.c, because it already has nblocks_this_segment.
That's one reason I chose to put the retry there. If we push it down
to fd.c in order to be able to help other callers, you're right that
we could pass in the total size (and I guess assert that it's
correct), but that is sort of annoyingly redundant and further from
the interface we're wrapping.

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC". What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?

Greetings,

Andres Freund

#13

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Andres Freund (#12)

Re: Streaming I/O, vectored I/O (WIP)

On Sat, Dec 9, 2023 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?
If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC". What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?

Heikki, what do you think about this: we could go with the v3 fd.c
and md.c patches, but move adjust_iovec_for_partial_transfer() into
src/common/file_utils.c, so that at least that slightly annoying part
of the job is available for re-use by future code that faces the same
problem?

Note that in file_utils.c we already have pg_pwritev_with_retry(),
which is clearly related to all this: that is a function that
guarantees to either complete the full pwritev() or throw an ERROR,
but leaves it undefined whether any data has been written on ERROR.
It has to add up the size too, and it adjusts the iovec array at the
same time, so it wouldn't use adjust_iovec_for_partial_transfer().
This is essentially the type of interface that I declined to put into
fd.c's FileWrite() and FileRead() because I feel like it doesn't fit
with the existing functions' primary business of adding vfd support to
well known basic I/O functions that return bytes transferred and set
errno. Perhaps someone might later want to introduce File*WithRetry()
wrappers or something if that proves useful? I wouldn't want them for
md.c though because I already know the size.

#14

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Thomas Munro (#13)

Re: Streaming I/O, vectored I/O (WIP)

On 09/12/2023 02:41, Thomas Munro wrote:

On Sat, Dec 9, 2023 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-11-30 13:01:46 +1300, Thomas Munro wrote:

On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Maybe we should bite the bullet and always retry short writes in
FileWriteV(). Is that what you meant by "handling them"?
If the total size is expensive to calculate, how about passing it as an
extra argument? Presumably it is cheap for the callers to calculate at
the same time that they build the iovec array?

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC". What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

I think that's the right call. I think for AIO we can't do retry handling
purely in fd.c, or at least it'd be quite awkward. It doesn't seem like it'd
buy us that much in md.c anyway, we still need to handle the cross segment
case and such, from what I can tell?

Heikki, what do you think about this: we could go with the v3 fd.c
and md.c patches, but move adjust_iovec_for_partial_transfer() into
src/common/file_utils.c, so that at least that slightly annoying part
of the job is available for re-use by future code that faces the same
problem?

Ok, works for me.

--
Heikki Linnakangas
Neon (https://neon.tech)

#15

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#14)

4 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Sat, Dec 9, 2023 at 10:23 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Ok, works for me.

I finished up making a few more improvements:

1. I eventually figured out how to generalise
compute_remaining_iovec() (as I now call it) so that the existing
pg_pwritev_with_retry() in file_utils.c could also use it, so that's
now done in a patch of its own.

2. FileReadV/FileWriteV patch:

* further simplification of the traditional ENOSPC 'guess'
* unconstify() changed to raw cast (pending [1]/messages/by-id/CA+hUKGK3OXFjkOyZiw-DgL2bUqk9by1uGuCnViJX786W+fyDSw@mail.gmail.com)
* fixed the DO_DB()-wrapped debugging code

3. smgrreadv/smgrwritev patch:

* improved ENOSPC handling
* improve description of EOF and ENOSPC handling
* fixed the sizes reported in dtrace static probes
* fixed some words in the docs about that
* changed error messages to refer to "blocks %u..%u"

4. smgrprefetch-with-nblocks patch has no change, hasn't drawn any
comments hopefully because it is uncontroversial.

I'm planning to commit these fairly soon.

[1]: /messages/by-id/CA+hUKGK3OXFjkOyZiw-DgL2bUqk9by1uGuCnViJX786W+fyDSw@mail.gmail.com

Attachments:

v4-0001-Provide-helper-routine-for-partial-vector-I-O-ret.patchapplication/octet-stream; name=v4-0001-Provide-helper-routine-for-partial-vector-I-O-ret.patchDownload

From 559fdd043c9a4de2b8cda3a779a39ccd4af97374 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 11 Dec 2023 15:02:58 +1300
Subject: [PATCH v4 1/8] Provide helper routine for partial vector I/O retry.

compute_remaining_iovec() is a re-usable routine for cases where
pg_readv()/pg_writev() reports a short transfer.  This will gain a new
user in a later patch, but can already replace the open-coded equivalent
code in the existing pg_pwritev_with_retry() function.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/common/file_utils.c         | 80 ++++++++++++++++++++++-----------
 src/include/common/file_utils.h |  5 +++
 2 files changed, 60 insertions(+), 25 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index abe5129412..54ef9c9561 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -581,6 +581,55 @@ get_dirent_type(const char *path,
 	return result;
 }
 
+/*
+ * Compute what remains to be transferred after a possibly partial vectored
+ * read or write.  The region of 'source' that begins after 'transferred'
+ * bytes is written to 'destination', and its length is returned.  'source'
+ * and 'destination' may point to the same array, in which case it is adjusted
+ * in-place; otherwise 'destination' must have enough space for 'iovcnt'
+ * elements.  Returns 0 for a fully completed transfer, though some callers
+ * may already know the total size and be able to avoid calling this function
+ * in that common case.
+ */
+int
+compute_remaining_iovec(struct iovec *destination,
+						const struct iovec *source,
+						int iovcnt,
+						size_t transferred)
+{
+	Assert(iovcnt > 0);
+
+	/* Skip wholly transferred iovecs. */
+	while (source->iov_len <= transferred)
+	{
+		transferred -= source->iov_len;
+		source++;
+		iovcnt--;
+
+		/* All iovecs transferred? */
+		if (iovcnt == 0)
+		{
+			/*
+			 * We don't expect the kernel to transfer more than we think we
+			 * asked for, or something is out of sync.
+			 */
+			Assert(transferred == 0);
+			return 0;
+		}
+	}
+
+	/* Copy the remaining iovecs to the front of the array. */
+	if (source != destination)
+		memmove(destination, source, sizeof(*source) * iovcnt);
+
+	/* Adjust leading iovec, which may have been partially transferred. */
+	Assert(destination->iov_len > transferred);
+	destination->iov_base = (char *) destination->iov_base + transferred;
+	destination->iov_len -= transferred;
+
+	return iovcnt;
+}
+
 /*
  * pg_pwritev_with_retry
  *
@@ -601,7 +650,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 		return -1;
 	}
 
-	for (;;)
+	do
 	{
 		/* Write as much as we can. */
 		part = pg_pwritev(fd, iov, iovcnt, offset);
@@ -616,33 +665,14 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 		sum += part;
 		offset += part;
 
-		/* Step over iovecs that are done. */
-		while (iovcnt > 0 && iov->iov_len <= part)
-		{
-			part -= iov->iov_len;
-			++iov;
-			--iovcnt;
-		}
-
-		/* Are they all done? */
-		if (iovcnt == 0)
-		{
-			/* We don't expect the kernel to write more than requested. */
-			Assert(part == 0);
-			break;
-		}
-
 		/*
-		 * Move whatever's left to the front of our mutable copy and adjust
-		 * the leading iovec.
+		 * See what is left.  On the first loop we used the caller's array,
+		 * but later loops we'll use our local copy that we are allowed to
+		 * mutate.
 		 */
-		Assert(iovcnt > 0);
-		memmove(iov_copy, iov, sizeof(*iov) * iovcnt);
-		Assert(iov->iov_len > part);
-		iov_copy[0].iov_base = (char *) iov_copy[0].iov_base + part;
-		iov_copy[0].iov_len -= part;
+		iovcnt = compute_remaining_iovec(iov_copy, iov, iovcnt, part);
 		iov = iov_copy;
-	}
+	} while (iovcnt > 0);
 
 	return sum;
 }
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 3bb20170cb..02a940e310 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -46,6 +46,11 @@ extern PGFileType get_dirent_type(const char *path,
 								  bool look_through_symlinks,
 								  int elevel);
 
+extern int	compute_remaining_iovec(struct iovec *destination,
+									const struct iovec *source,
+									int iovcnt,
+									size_t transferred);
+
 extern ssize_t pg_pwritev_with_retry(int fd,
 									 const struct iovec *iov,
 									 int iovcnt,
-- 
2.39.3 (Apple Git-145)

v4-0002-Provide-vectored-variants-of-FileRead-and-FileWri.patchapplication/octet-stream; name=v4-0002-Provide-vectored-variants-of-FileRead-and-FileWri.patchDownload

From 3f38077d266d089747b468d83d3c936f90ffcabf Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 11 Dec 2023 09:41:27 +1300
Subject: [PATCH v4 2/8] Provide vectored variants of FileRead() and
 FileWrite().

FileReadV() and FileWriteV() adapt pg_preadv() and pg_pwritev() for
fd.c's virtual file descriptors.  The simple FileRead() and FileWrite()
functions are now implemented in terms of the new functions, to avoid
duplicating error and limit logic.

The traditional behavior of reporting a "fake" ENOSPC error is
preserved, but simplified.  It's now always set for non-failing system
calls, for the benefit of callers that raise an error with %m for short
writes.  (We should probably consider getting rid of that expectation.)

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/file/fd.c | 43 ++++++++++++++++++++---------------
 src/include/storage/fd.h      | 32 +++++++++++++++++++++++---
 2 files changed, 54 insertions(+), 21 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a185fb3d08..17e910da29 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2110,18 +2110,18 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info)
 }
 
 int
-FileRead(File file, void *buffer, size_t amount, off_t offset,
-		 uint32 wait_event_info)
+FileReadV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		  uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
 
 	Assert(FileIsValid(file));
 
-	DO_DB(elog(LOG, "FileRead: %d (%s) " INT64_FORMAT " %zu %p",
+	DO_DB(elog(LOG, "FileReadV: %d (%s) " INT64_FORMAT " %d",
 			   file, VfdCache[file].fileName,
 			   (int64) offset,
-			   amount, buffer));
+			   iovcnt));
 
 	returnCode = FileAccess(file);
 	if (returnCode < 0)
@@ -2131,7 +2131,7 @@ FileRead(File file, void *buffer, size_t amount, off_t offset,
 
 retry:
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pread(vfdP->fd, buffer, amount, offset);
+	returnCode = pg_preadv(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
 	if (returnCode < 0)
@@ -2166,18 +2166,18 @@ retry:
 }
 
 int
-FileWrite(File file, const void *buffer, size_t amount, off_t offset,
-		  uint32 wait_event_info)
+FileWriteV(File file, const struct iovec *iov, int iovcnt, off_t offset,
+		   uint32 wait_event_info)
 {
 	int			returnCode;
 	Vfd		   *vfdP;
 
 	Assert(FileIsValid(file));
 
-	DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %zu %p",
+	DO_DB(elog(LOG, "FileWriteV: %d (%s) " INT64_FORMAT " %d",
 			   file, VfdCache[file].fileName,
 			   (int64) offset,
-			   amount, buffer));
+			   iovcnt));
 
 	returnCode = FileAccess(file);
 	if (returnCode < 0)
@@ -2195,7 +2195,10 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	 */
 	if (temp_file_limit >= 0 && (vfdP->fdstate & FD_TEMP_FILE_LIMIT))
 	{
-		off_t		past_write = offset + amount;
+		off_t		past_write = offset;
+
+		for (int i = 0; i < iovcnt; ++i)
+			past_write += iov[i].iov_len;
 
 		if (past_write > vfdP->fileSize)
 		{
@@ -2211,23 +2214,27 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 	}
 
 retry:
-	errno = 0;
 	pgstat_report_wait_start(wait_event_info);
-	returnCode = pg_pwrite(VfdCache[file].fd, buffer, amount, offset);
+	returnCode = pg_pwritev(vfdP->fd, iov, iovcnt, offset);
 	pgstat_report_wait_end();
 
-	/* if write didn't set errno, assume problem is no disk space */
-	if (returnCode != amount && errno == 0)
-		errno = ENOSPC;
-
 	if (returnCode >= 0)
 	{
+		/*
+		 * Some callers expect short writes to set errno, and traditionally we
+		 * have assumed that they imply disk space shortage.  We don't want to
+		 * waste CPU cycles adding up the total size here, so we'll just set
+		 * errno to ENOSPC for any successful write (partial or full) in case
+		 * such a caller determines that the write is short.
+		 */
+		errno = ENOSPC;
+
 		/*
 		 * Maintain fileSize and temporary_files_size if it's a temp file.
 		 */
 		if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
 		{
-			off_t		past_write = offset + amount;
+			off_t		past_write = offset + returnCode;
 
 			if (past_write > vfdP->fileSize)
 			{
@@ -2239,7 +2246,7 @@ retry:
 	else
 	{
 		/*
-		 * See comments in FileRead()
+		 * See comments in FileReadV()
 		 */
 #ifdef WIN32
 		DWORD		error = GetLastError();
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index d9d5d9da5f..f885d0ef15 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -15,7 +15,7 @@
 /*
  * calls:
  *
- *	File {Close, Read, Write, Size, Sync}
+ *	File {Close, Read, ReadV, Write, WriteV, Size, Sync}
  *	{Path Name Open, Allocate, Free} File
  *
  * These are NOT JUST RENAMINGS OF THE UNIX ROUTINES.
@@ -43,6 +43,8 @@
 #ifndef FD_H
 #define FD_H
 
+#include "port/pg_iovec.h"
+
 #include <dirent.h>
 #include <fcntl.h>
 
@@ -105,8 +107,8 @@ extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fil
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info);
-extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
-extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
+extern int	FileReadV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
+extern int	FileWriteV(File file, const struct iovec *ioc, int iovcnt, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
 extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
 extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
@@ -189,4 +191,28 @@ extern int	durable_unlink(const char *fname, int elevel);
 extern void SyncDataDirectory(void);
 extern int	data_sync_elevel(int elevel);
 
+static inline int
+FileRead(File file, void *buffer, size_t amount, off_t offset,
+		 uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = buffer,
+		.iov_len = amount
+	};
+
+	return FileReadV(file, &iov, 1, offset, wait_event_info);
+}
+
+static inline int
+FileWrite(File file, const void *buffer, size_t amount, off_t offset,
+		  uint32 wait_event_info)
+{
+	struct iovec iov = {
+		.iov_base = (void *) buffer,
+		.iov_len = amount
+	};
+
+	return FileWriteV(file, &iov, 1, offset, wait_event_info);
+}
+
 #endif							/* FD_H */
-- 
2.39.3 (Apple Git-145)

v4-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patchapplication/octet-stream; name=v4-0003-Provide-vectored-variants-of-smgrread-and-smgrwri.patchDownload

From 38ad5ab077cdbba64da8ea019293030b81b8394e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 19 Jul 2023 14:50:41 +1200
Subject: [PATCH v4 3/8] Provide vectored variants of smgrread() and
 smgrwrite().

smgrreadv() and smgrwritev() map to FileReadV() and FileWriteV(), and
thence the OS's vector I/O routines, after dealing with segmentation.
These functions allow for contiguous regions of relation files to be
transferred into and out of non-neighboring buffers with a single system
call.  The traditional smgrread() and smgrwrite() functions are
implemented in terms of the new functions.

The md.c implementation automatically handles short transfers by
retrying, where previously it would fail.  This is in line with what
xlog.c does with its writes, based on the suspicion that larger I/Os
might be more likely to be partially completed.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/monitoring.sgml    |   4 +-
 src/backend/storage/smgr/md.c   | 327 ++++++++++++++++++++++----------
 src/backend/storage/smgr/smgr.c |  37 ++--
 src/include/storage/md.h        |   9 +-
 src/include/storage/smgr.h      |  25 ++-
 5 files changed, 280 insertions(+), 122 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 42509042ad..4f8058d8b1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6868,7 +6868,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
       arg6 is the number of bytes actually read, while arg7 is the number
-      requested (if these are different it indicates trouble).</entry>
+      requested (if these are different it indicates a short read).</entry>
     </row>
     <row>
      <entry><literal>smgr-md-write-start</literal></entry>
@@ -6890,7 +6890,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
       arg6 is the number of bytes actually written, while arg7 is the number
-      requested (if these are different it indicates trouble).</entry>
+      requested (if these are different it indicates a short write).</entry>
     </row>
     <row>
      <entry><literal>sort-start</literal></entry>
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad170..8e58637792 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -452,7 +453,7 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 /*
  * mdextend() -- Add a block to the specified relation.
  *
- * The semantics are nearly the same as mdwrite(): write at the
+ * The semantics are nearly the same as mdwritev(): write at the
  * specified position.  However, this is to be used for the case of
  * extending a relation (i.e., blocknum is at or beyond the current
  * EOF).  Note that we assume writing a block beyond current EOF
@@ -737,138 +738,274 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * mdread() -- Read the specified block from a relation.
+ * Convert an array of buffer address into an array of iovec objects, and
+ * return the number that were required.  'iov' must have enough space for up
+ * to 'nblocks' elements, but the number used may be less depending on
+ * merging.  In the case of a run of fully contiguous buffers, a single iovec
+ * will be populated that can be handled as a plain non-vectored I/O.
  */
-void
-mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   void *buffer)
+static int
+buffers_to_iovec(struct iovec *iov, void **buffers, int nblocks)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
+	struct iovec *iovp;
+	int			iovcnt;
 
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
-
-	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend);
+	Assert(nblocks >= 1);
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+	/* If this build supports direct I/O, buffers must be I/O aligned. */
+	for (int i = 0; i < nblocks; ++i)
+	{
+		if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
+			Assert((uintptr_t) buffers[i] ==
+				   TYPEALIGN(PG_IO_ALIGN_SIZE, buffers[i]));
+	}
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+	/* Start the first iovec off with the first buffer. */
+	iovp = &iov[0];
+	iovp->iov_base = buffers[0];
+	iovp->iov_len = BLCKSZ;
+	iovcnt = 1;
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+	/* Try to merge the rest. */
+	for (int i = 1; i < nblocks; ++i)
+	{
+		void	   *buffer = buffers[i];
 
-	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_READ);
+		if (((char *) iovp->iov_base + iovp->iov_len) == buffer)
+		{
+			/* Contiguous with the last iovec. */
+			iovp->iov_len += BLCKSZ;
+		}
+		else
+		{
+			/* Need a new iovec. */
+			iovp++;
+			iovp->iov_base = buffer;
+			iovp->iov_len = BLCKSZ;
+			iovcnt++;
+		}
+	}
 
-	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
-									   reln->smgr_rlocator.locator.spcOid,
-									   reln->smgr_rlocator.locator.dbOid,
-									   reln->smgr_rlocator.locator.relNumber,
-									   reln->smgr_rlocator.backend,
-									   nbytes,
-									   BLCKSZ);
+	return iovcnt;
+}
 
-	if (nbytes != BLCKSZ)
+/*
+ * mdreadv() -- Read the specified blocks from a relation.
+ */
+void
+mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		void **buffers, BlockNumber nblocks)
+{
+	while (nblocks > 0)
 	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not read block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
+
+		v = _mdfd_getseg(reln, forknum, blocknum, false,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
+
+		iovcnt = buffers_to_iovec(iov, buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
 
 		/*
-		 * Short read: we are at or past EOF, or we read a partial block at
-		 * EOF.  Normally this is an error; upper levels should never try to
-		 * read a nonexistent block.  However, if zero_damaged_pages is ON or
-		 * we are InRecovery, we should instead return zeroes without
-		 * complaining.  This allows, for example, the case of trying to
-		 * update a block that was later truncated away.
+		 * Inner loop to continue after a short read.  We'll keep going until
+		 * we hit EOF rather than assuming that a short read means we hit the
+		 * end.
 		 */
-		if (zero_damaged_pages || InRecovery)
-			MemSet(buffer, 0, BLCKSZ);
-		else
-			ereport(ERROR,
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
-							blocknum, FilePathName(v->mdfd_vfd),
-							nbytes, BLCKSZ)));
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend);
+			nbytes = FileReadV(v->mdfd_vfd, iov, iovcnt, seekpos,
+							   WAIT_EVENT_DATA_FILE_READ);
+			TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
+											   reln->smgr_rlocator.locator.spcOid,
+											   reln->smgr_rlocator.locator.dbOid,
+											   reln->smgr_rlocator.locator.relNumber,
+											   reln->smgr_rlocator.backend,
+											   nbytes,
+											   size_this_segment - transferred_this_segment);
+
+#ifdef SIMULATE_SHORT_READ
+			nbytes = Min(nbytes, 4096);
+#endif
+
+			if (nbytes < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read blocks %u..%u in file \"%s\": %m",
+								blocknum,
+								blocknum + nblocks_this_segment - 1,
+								FilePathName(v->mdfd_vfd))));
+
+			if (nbytes == 0)
+			{
+				/*
+				 * We are at or past EOF, or we read a partial block at EOF.
+				 * Normally this is an error; upper levels should never try to
+				 * read a nonexistent block.  However, if zero_damaged_pages
+				 * is ON or we are InRecovery, we should instead return zeroes
+				 * without complaining.  This allows, for example, the case of
+				 * trying to update a block that was later truncated away.
+				 */
+				if (zero_damaged_pages || InRecovery)
+				{
+					for (BlockNumber i = transferred_this_segment / BLCKSZ;
+						 i < nblocks_this_segment;
+						 ++i)
+						memset(buffers[i], 0, BLCKSZ);
+					break;
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("could not read blocks %u..%u in file \"%s\": read only %zu of %zu bytes",
+									blocknum,
+									blocknum + nblocks_this_segment - 1,
+									FilePathName(v->mdfd_vfd),
+									transferred_this_segment,
+									size_this_segment)));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and vectors after a short read. */
+			seekpos += nbytes;
+			iovcnt = compute_remaining_iovec(iov, iov, iovcnt, nbytes);
+		}
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
 	}
 }
 
 /*
- * mdwrite() -- Write the supplied block at the appropriate location.
+ * mdwritev() -- Write the supplied blocks at the appropriate location.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
  * use mdextend().
  */
 void
-mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		const void *buffer, bool skipFsync)
+mdwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	off_t		seekpos;
-	int			nbytes;
-	MdfdVec    *v;
-
-	/* If this build supports direct I/O, the buffer must be I/O aligned. */
-	if (PG_O_DIRECT != 0 && PG_IO_ALIGN_SIZE <= BLCKSZ)
-		Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer));
-
 	/* This assert is too expensive to have on normally ... */
 #ifdef CHECK_WRITE_VS_EXTEND
 	Assert(blocknum < mdnblocks(reln, forknum));
 #endif
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
-										 reln->smgr_rlocator.locator.spcOid,
-										 reln->smgr_rlocator.locator.dbOid,
-										 reln->smgr_rlocator.locator.relNumber,
-										 reln->smgr_rlocator.backend);
+	while (nblocks > 0)
+	{
+		struct iovec iov[PG_IOV_MAX];
+		int			iovcnt;
+		off_t		seekpos;
+		int			nbytes;
+		MdfdVec    *v;
+		BlockNumber nblocks_this_segment;
+		size_t		transferred_this_segment;
+		size_t		size_this_segment;
 
-	v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
-					 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
+		v = _mdfd_getseg(reln, forknum, blocknum, skipFsync,
+						 EXTENSION_FAIL | EXTENSION_CREATE_RECOVERY);
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
 
-	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_WRITE);
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+		nblocks_this_segment = Min(nblocks_this_segment, lengthof(iov));
 
-	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber,
-										reln->smgr_rlocator.backend,
-										nbytes,
-										BLCKSZ);
+		iovcnt = buffers_to_iovec(iov, (void **) buffers, nblocks_this_segment);
+		size_this_segment = nblocks_this_segment * BLCKSZ;
+		transferred_this_segment = 0;
 
-	if (nbytes != BLCKSZ)
-	{
-		if (nbytes < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write block %u in file \"%s\": %m",
-							blocknum, FilePathName(v->mdfd_vfd))));
-		/* short write: complain appropriately */
-		ereport(ERROR,
-				(errcode(ERRCODE_DISK_FULL),
-				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
-						blocknum,
-						FilePathName(v->mdfd_vfd),
-						nbytes, BLCKSZ),
-				 errhint("Check free disk space.")));
-	}
+		/*
+		 * Inner loop to continue after a short write.  If the reason is that
+		 * we're out of disk space, a future attempt should get an ENOSPC
+		 * error from the kernel.
+		 */
+		for (;;)
+		{
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_START(forknum, blocknum,
+												 reln->smgr_rlocator.locator.spcOid,
+												 reln->smgr_rlocator.locator.dbOid,
+												 reln->smgr_rlocator.locator.relNumber,
+												 reln->smgr_rlocator.backend);
+			nbytes = FileWriteV(v->mdfd_vfd, iov, iovcnt, seekpos,
+								WAIT_EVENT_DATA_FILE_WRITE);
+			TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
+												reln->smgr_rlocator.locator.spcOid,
+												reln->smgr_rlocator.locator.dbOid,
+												reln->smgr_rlocator.locator.relNumber,
+												reln->smgr_rlocator.backend,
+												nbytes,
+												size_this_segment - transferred_this_segment);
+
+#ifdef SIMULATE_SHORT_WRITE
+			nbytes = Min(nbytes, 4096);
+#endif
 
-	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+			if (nbytes < 0)
+			{
+				bool		enospc = errno == ENOSPC;
+
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write blocks %u..%u in file \"%s\": %m",
+								blocknum,
+								blocknum + nblocks_this_segment - 1,
+								FilePathName(v->mdfd_vfd)),
+						 enospc ? errhint("Check free disk space.") : 0));
+			}
+
+			/* One loop should usually be enough. */
+			transferred_this_segment += nbytes;
+			Assert(transferred_this_segment <= size_this_segment);
+			if (transferred_this_segment == size_this_segment)
+				break;
+
+			/* Adjust position and iovecs after a short write. */
+			seekpos += nbytes;
+			iovcnt = compute_remaining_iovec(iov, iov, iovcnt, nbytes);
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		nblocks -= nblocks_this_segment;
+		buffers += nblocks_this_segment;
+		blocknum += nblocks_this_segment;
+	}
 }
 
+
 /*
  * mdwriteback() -- Tell the kernel to write pages back to storage.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4c55264933..15633838c9 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -55,10 +55,13 @@ typedef struct f_smgr
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
-	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
-							  BlockNumber blocknum, void *buffer);
-	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
-							   BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum,
+							   void **buffers, BlockNumber nblocks);
+	void		(*smgr_writev) (SMgrRelation reln, ForkNumber forknum,
+								BlockNumber blocknum,
+								const void **buffers, BlockNumber nblocks,
+								bool skipFsync);
 	void		(*smgr_writeback) (SMgrRelation reln, ForkNumber forknum,
 								   BlockNumber blocknum, BlockNumber nblocks);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -80,8 +83,8 @@ static const f_smgr smgrsw[] = {
 		.smgr_extend = mdextend,
 		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
-		.smgr_read = mdread,
-		.smgr_write = mdwrite,
+		.smgr_readv = mdreadv,
+		.smgr_writev = mdwritev,
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
@@ -553,22 +556,23 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
 }
 
 /*
- * smgrread() -- read a particular block from a relation into the supplied
- *				 buffer.
+ * smgrreadv() -- read a particular block range from a relation into the
+ *				 supplied buffers.
  *
  * This routine is called from the buffer manager in order to
  * instantiate pages in the shared buffer cache.  All storage managers
  * return pages in the format that POSTGRES expects.
  */
 void
-smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 void *buffer)
+smgrreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  void **buffers, BlockNumber nblocks)
 {
-	smgrsw[reln->smgr_which].smgr_read(reln, forknum, blocknum, buffer);
+	smgrsw[reln->smgr_which].smgr_readv(reln, forknum, blocknum, buffers,
+										nblocks);
 }
 
 /*
- * smgrwrite() -- Write the supplied buffer out.
+ * smgrwritev() -- Write the supplied buffers out.
  *
  * This is to be used only for updating already-existing blocks of a
  * relation (ie, those before the current EOF).  To extend a relation,
@@ -583,14 +587,13 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * do not require fsync.
  */
 void
-smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		  const void *buffer, bool skipFsync)
+smgrwritev(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   const void **buffers, BlockNumber nblocks, bool skipFsync)
 {
-	smgrsw[reln->smgr_which].smgr_write(reln, forknum, blocknum,
-										buffer, skipFsync);
+	smgrsw[reln->smgr_which].smgr_writev(reln, forknum, blocknum,
+										 buffers, nblocks, skipFsync);
 }
 
-
 /*
  * smgrwriteback() -- Trigger kernel writeback for the supplied range of
  *					   blocks.
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a..0f73d9af49 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -32,10 +32,11 @@ extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-				   void *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-					BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+					void **buffers, BlockNumber nblocks);
+extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
+					 BlockNumber blocknum,
+					 const void **buffers, BlockNumber nblocks, bool skipFsync);
 extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 						BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..ae93910132 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -96,10 +96,13 @@ extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
-extern void smgrread(SMgrRelation reln, ForkNumber forknum,
-					 BlockNumber blocknum, void *buffer);
-extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
-					  BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
+					  BlockNumber blocknum,
+					  void **buffer, BlockNumber nblocks);
+extern void smgrwritev(SMgrRelation reln, ForkNumber forknum,
+					   BlockNumber blocknum,
+					   const void **buffer, BlockNumber nblocks,
+					   bool skipFsync);
 extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 						  BlockNumber blocknum, BlockNumber nblocks);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -110,4 +113,18 @@ extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void AtEOXact_SMgr(void);
 extern bool ProcessBarrierSmgrRelease(void);
 
+static inline void
+smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 void *buffer)
+{
+	smgrreadv(reln, forknum, blocknum, &buffer, 1);
+}
+
+static inline void
+smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		  const void *buffer, bool skipFsync)
+{
+	smgrwritev(reln, forknum, blocknum, &buffer, 1, skipFsync);
+}
+
 #endif							/* SMGR_H */
-- 
2.39.3 (Apple Git-145)

v4-0004-Provide-multi-block-smgrprefetch.patchapplication/octet-stream; name=v4-0004-Provide-multi-block-smgrprefetch.patchDownload

From 84221895fdcaa13b51f7839def1347924cab4246 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 08:53:15 +1200
Subject: [PATCH v4 4/8] Provide multi-block smgrprefetch().

Previously smgrprefetch() would issue POSIX_FADV_WILLNEED advice for a
single block at a time.  Add an nblocks argument so that we can do the
same for a range of blocks.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Author: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   |  2 +-
 src/backend/storage/buffer/localbuf.c |  2 +-
 src/backend/storage/smgr/md.c         | 36 +++++++++++++++++++--------
 src/backend/storage/smgr/smgr.c       |  7 +++---
 src/include/storage/md.h              |  2 +-
 src/include/storage/smgr.h            |  2 +-
 6 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7c67d504c..de8b5edbfb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -569,7 +569,7 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln,
 		 * recovery if the relation file doesn't exist.
 		 */
 		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-			smgrprefetch(smgr_reln, forkNum, blockNum))
+			smgrprefetch(smgr_reln, forkNum, blockNum, 1))
 		{
 			result.initiated_io = true;
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index aebcf146b4..c75ed184e0 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -94,7 +94,7 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
 #ifdef USE_PREFETCH
 		/* Not in buffers, so initiate prefetch */
 		if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
-			smgrprefetch(smgr, forkNum, blockNum))
+			smgrprefetch(smgr, forkNum, blockNum, 1))
 		{
 			result.initiated_io = true;
 		}
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 8e58637792..48e86b38c6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -711,27 +711,41 @@ mdclose(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * mdprefetch() -- Initiate asynchronous read of the specified block of a relation
+ * mdprefetch() -- Initiate asynchronous read of the specified blocks of a relation
  */
 bool
-mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		   int nblocks)
 {
 #ifdef USE_PREFETCH
-	off_t		seekpos;
-	MdfdVec    *v;
 
 	Assert((io_direct_flags & IO_DIRECT_DATA) == 0);
 
-	v = _mdfd_getseg(reln, forknum, blocknum, false,
-					 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
-	if (v == NULL)
-		return false;
+	while (nblocks > 0)
+	{
+		off_t		seekpos;
+		MdfdVec    *v;
+		int			nblocks_this_segment;
 
-	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
+		v = _mdfd_getseg(reln, forknum, blocknum, false,
+						 InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL);
+		if (v == NULL)
+			return false;
 
-	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+		seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) RELSEG_SIZE));
 
-	(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ, WAIT_EVENT_DATA_FILE_PREFETCH);
+		Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+		nblocks_this_segment =
+			Min(nblocks,
+				RELSEG_SIZE - (blocknum % ((BlockNumber) RELSEG_SIZE)));
+
+		(void) FilePrefetch(v->mdfd_vfd, seekpos, BLCKSZ * nblocks_this_segment,
+							WAIT_EVENT_DATA_FILE_PREFETCH);
+
+		blocknum += nblocks_this_segment;
+		nblocks -= nblocks_this_segment;
+	}
 #endif							/* USE_PREFETCH */
 
 	return true;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 15633838c9..7ce31fd30a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -54,7 +54,7 @@ typedef struct f_smgr
 	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
 									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
-								  BlockNumber blocknum);
+								  BlockNumber blocknum, int nblocks);
 	void		(*smgr_readv) (SMgrRelation reln, ForkNumber forknum,
 							   BlockNumber blocknum,
 							   void **buffers, BlockNumber nblocks);
@@ -550,9 +550,10 @@ smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
  * record).
  */
 bool
-smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			 int nblocks)
 {
-	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum);
+	return smgrsw[reln->smgr_which].smgr_prefetch(reln, forknum, blocknum, nblocks);
 }
 
 /*
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 0f73d9af49..3dc651ec17 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -31,7 +31,7 @@ extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
-					   BlockNumber blocknum);
+					   BlockNumber blocknum, int nblocks);
 extern void mdreadv(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 					void **buffers, BlockNumber nblocks);
 extern void mdwritev(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index ae93910132..6e62dc053c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -95,7 +95,7 @@ extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
 						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
-						 BlockNumber blocknum);
+						 BlockNumber blocknum, int nblocks);
 extern void smgrreadv(SMgrRelation reln, ForkNumber forknum,
 					  BlockNumber blocknum,
 					  void **buffer, BlockNumber nblocks);
-- 
2.39.3 (Apple Git-145)

#16

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Thomas Munro (#15)

Re: Streaming I/O, vectored I/O (WIP)

On 11/12/2023 11:12, Thomas Munro wrote:

1. I eventually figured out how to generalise
compute_remaining_iovec() (as I now call it) so that the existing
pg_pwritev_with_retry() in file_utils.c could also use it, so that's
now done in a patch of its own.

In compute_remaining_iovec():

'source' and 'destination' may point to the same array, in which
case it is adjusted in-place; otherwise 'destination' must have enough
space for 'iovcnt' elements.

Is there any use case for not adjusting it in place?
pg_pwritev_with_retry() takes a const iovec array, but maybe just remove
the 'const' and document that it scribbles on it?

I'm planning to commit these fairly soon.

--
Heikki Linnakangas
Neon (https://neon.tech)

#17

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#16)

Re: Streaming I/O, vectored I/O (WIP)

On Mon, Dec 11, 2023 at 10:28 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 11/12/2023 11:12, Thomas Munro wrote:

1. I eventually figured out how to generalise
compute_remaining_iovec() (as I now call it) so that the existing
pg_pwritev_with_retry() in file_utils.c could also use it, so that's
now done in a patch of its own.

In compute_remaining_iovec():

'source' and 'destination' may point to the same array, in which
case it is adjusted in-place; otherwise 'destination' must have enough
space for 'iovcnt' elements.

Is there any use case for not adjusting it in place?
pg_pwritev_with_retry() takes a const iovec array, but maybe just remove
the 'const' and document that it scribbles on it?

I guess I just wanted to preserve pg_pwritev_with_retry()'s existing
prototype, primarily because it matches standard pwritev()/writev().

#18

Cédric Villemain

cedric.villemain+pgsql@abcsql.com

about 2 years ago

In reply to: Thomas Munro (#15)

Re: Streaming I/O, vectored I/O (WIP)

Le 11/12/2023 à 10:12, Thomas Munro a écrit :

3. smgrreadv/smgrwritev patch:

* improved ENOSPC handling
* improve description of EOF and ENOSPC handling
* fixed the sizes reported in dtrace static probes
* fixed some words in the docs about that
* changed error messages to refer to "blocks %u..%u"

4. smgrprefetch-with-nblocks patch has no change, hasn't drawn any
comments hopefully because it is uncontroversial.

I'm planning to commit these fairly soon.

Thanks, very useful additions.
Not sure what you have already done to come next...

I have 2 smalls patches here:
* to use range prefetch in pg_prewarm (smgrprefetch only at the moment,
using smgrreadv to come next).
* to support nblocks=0 in smgrprefetch (posix_fadvise supports a len=0
to apply flag from offset to end of file).

Should I add to commitfest ?
---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

#19

Melanie Plageman

melanieplageman@gmail.com

about 2 years ago

In reply to: Thomas Munro (#5)

Re: Streaming I/O, vectored I/O (WIP)

I've written a new version of the vacuum streaming read user on top of
the rebased patch set [1]https://github.com/melanieplageman/postgres/tree/stepwise_vac_streaming_read. It differs substantially from Andres' and
includes several refactoring patches that can apply on top of master.
As such, I've proposed those in a separate thread [2]/messages/by-id/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA@mail.gmail.com. I noticed mac
and windows fail to build on CI for my branch with the streaming read
code. I haven't had a chance to investigate -- but I must have done
something wrong on rebase.

- Melanie

[1]: https://github.com/melanieplageman/postgres/tree/stepwise_vac_streaming_read
[2]: /messages/by-id/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA@mail.gmail.com

#20

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Melanie Plageman (#8)

10 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Nov 29, 2023 at 2:21 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Thanks for posting a new version. I've included a review of 0004.

Thanks! I committed the patches up as far as smgr.c before the
holidays. The next thing to commit would be the bufmgr.c one for
vectored ReadBuffer(), v5-0001. Here's my response to your review of
that, which triggered quite a few changes.

See also new version of the streaming_read.c patch, with change list
at end. (I'll talk about v5-0002, the SMgrRelation lifetime one, over
on the separate thread about that where Heikki posted a better
version.)

ot -> to

Fixed.

- if (found)
- pgBufferUsage.shared_blks_hit++;
- else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
- mode == RBM_ZERO_ON_ERROR)
- pgBufferUsage.shared_blks_read++;

You've lost this test in your new version. You can do the same thing
(avoid counting zeroed buffers as blocks read) by moving this
pgBufferUsage.shared/local_blks_read++ back into ReadBuffer_common()
where you know if you called ZeroBuffer() or CompleteReadBuffers().

Yeah, right.

After thinking about that some more, that's true and that placement
would be good, but only if we look just at this patch, where we are
chopping ReadBuffer() into two parts (PrepareReadBuffer() and
CompleteReadBuffers()), and then putting them back together again.
However, soon we'll want to use the two functions separately, and we
won't call ReadBuffer[_common]().

New idea: PrepareReadBuffer() can continue to be in charge of bumping
{local,shared}_blks_hit, but {local,shared}_blks_read++ can happen in
CompleteReadBuffers(). There is no counter for zeroed buffers, but if
there ever is in the future, it can go into ZeroBuffer().

In this version, I've moved that into CompleteReadBuffers(), along
with a new comment to explain a pre-existing deficiency in the whole
scheme: there is a race where you finish up counting a read but
someone else actually does the read, and also counts it. I'm trying
to preserve the existing bean counting logic to the extent possible
across this refactoring.

+     }
+     else
+     {
+             bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+                                                      strategy, found, allocated, io_context);
+             if (*found)
+                     pgBufferUsage.shared_blks_hit++;
+             else
+                     pgBufferUsage.shared_blks_read++;
+     }
+     if (bmr.rel)
+     {
+             pgstat_count_buffer_read(bmr.rel);
This is double-counting reads. You've left the call in
ReadBufferExtended() as well as adding this here. It should be fine to
remove it from ReadBufferExtended(). Because you test bmr.rel, leaving
the call here in PrepareReadBuffer() wouldn't have an effect on
ReadBuffer_common() callers who don't pass a relation (like recovery).
The other current callers of ReadBuffer_common() (by way of
ExtendBufferedRelTo()) who do pass a relation are visibility map and
freespace map extension, and I don't think we track relation stats for
the VM and FSM.

Oh yeah. Right. Fixed.

This does continue the practice of counting zeroed buffers as reads in
table-level stats. But, that is the same as master.

Right. It is a little strange that pgstast_count_buffer_read()
finishes up in a different location than
pgBufferUsage.{local,shared}_blks_read++, but that's precisely due to
this pre-existing difference in accounting policy. That generally
seems like POLA failure, so I've added a comment to help us remember
about that, for another day.

+             io_start = pgstat_prepare_io_time();
+             smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+             pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start, 1);
I'd pass io_buffers_len as cnt to pgstat_count_io_op_time(). op_bytes
will be BLCKSZ and multiplying that by the number of reads should
produce the number of bytes read.

OK, thanks, fixed.

BufferDesc *
LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-                              bool *foundPtr)
+                              bool *foundPtr, bool *allocPtr)
{
BufferTag       newTag;                 /* identity of requested block */
LocalBufferLookupEnt *hresult;
@@ -144,6 +144,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
*foundPtr = PinLocalBuffer(bufHdr, true);
+ *allocPtr = false;
...

I would prefer you use consistent naming for
allocPtr/allocatedPtr/allocated. I also think that all the functions
taking it as an output argument should explain what it is
(BufferAlloc()/LocalBufferAlloc(), etc). I found myself doing a bit of
digging around to figure it out. You have a nice comment about it above
PrepareReadBuffer(). I think you may need to resign yourself to
restating that bit (or some version of it) for all of the functions
taking it as an argument.

I worked on addressing your complaints, but while doing so I had
second thoughts about even having that argument. The need for it came
up while working on streaming recovery. Without it, whenever there
were repeated references to the same block (as happens very often in
the WAL), it'd issue extra useless POSIX_FADV_WILLNEED syscalls, so I
wanted to find a way to distinguish the *first* miss for a given
block, to be able to issue the advice just once before the actual read
happens.

But now I see that other users of streaming reads probably won't
repeatedly stream the same block, and I am not proposing streaming
recovery for PG 17. Time to simplify. I decided to kick allocatedPtr
out to think about some more, but I really hope we'll be able to start
a real background I/O instead of issuing advice in PG 18 proposals,
and that'll be negotiated via IO_IN_PROGRESS, so then we'd never need
allocatedPtr.

Thus, removed.

#ifndef BUFMGR_H
#define BUFMGR_H

+#include "pgstat.h"

I don't know what we are supposed to do, but I would have included this
in bufmgr.c (where I actually needed it) instead of including it here.

Fixed.

+#include "port/pg_iovec.h"
#include "storage/block.h"
#include "storage/buf.h"
#include "storage/bufpage.h"
@@ -47,6 +49,8 @@ typedef enum
RBM_ZERO_AND_CLEANUP_LOCK, /* Like RBM_ZERO_AND_LOCK, but locks the page
* in "cleanup" mode */
RBM_ZERO_ON_ERROR, /* Read, but return an all-zeros page on error */
+     RBM_WILL_ZERO,                          /* Don't read from disk, caller will call
+                                                              * ZeroBuffer() */
It's confusing that this (RBM_WILL_ZERO) is part of this commit since it
isn't used in this commit.

Yeah. Removed.

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.

I now also have a much simplified version of the streaming read patch.
The short version is that it has all advanced features removed, so
that now it *only* does the clustering required to build up large
CompleteReadBuffers() calls. That's short and sweet, and enough for
pg_prewarm to demonstrate 128KB reads, a good first step.

Then WILLNEED advice, and then ramp-up are added as separate patches,
for easier review. I've got some more patches in development that
would re-add "extended" multi-relation mode with wider callback that
can also stream zeroed buffers, as required so far only by recovery --
but I can propose that later.

Other improvements:

* Melanie didn't like the overloaded term "cluster" (off-list
feedback). Now I use "block range" to describe a range of contiguous
blocks (blocknum, nblocks).
* KK didn't like "prefetch" being used with various meanings (off-list
feedback). Better words found.
* pgsr_private might as well be void *; uintptr_t is theoretically
more general but doesn't seem to buy much for realistic use.
* per_io_data might as well be called per_buffer_data (it's not per
"I/O" and that isn't the level of abstraction for this API anyway
which is about blocks and buffers).
* I reordered some function arguments that jumped out after the above changes.
* Once I started down the path of using a flags field to control
various policy stuff as discussed up-thread, it started to seem
clearer that callers probably shouldn't directly control I/O depth,
which was all a bit ad-hoc and unfinished before. I think we'd likely
want that to be centralised. New idea: it should be enough to be able
to specify policies as required now and in future with flags.
Thoughts?

I've also included 4 of the WIP patches from earlier (based on an
obsolete version of the vacuum thing, sorry I know you have a much
better one now), which can be mostly ignored. Here I just wanted to
share working code to drive vectored reads with different access
patterns, and also show the API interactions and how they might each
set flag bits. For example, parallel seq scan uses
PGSR_FLAG_SEQUENTIAL to insist that its scans are sequential despite
appearances, pg_prewarm uses PGSR_FLAG_FULL to declare that it'll read
the whole relation so there is no point in ramping up, and the
vacuum-related stuff uses PGSR_FLAG_MAINTENANCE to select tuning based
on maintenance_io_concurrency instead of effective_io_concurrency.

Attachments:

v5-0001-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From 6b02e701950319c1032db55191867506c829d31d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v5 01/10] Provide vectored variant of ReadBuffer().

PrepareReadBuffer() followed by CompleteReadBuffers() or ZeroBuffer() is
equivalent to ReadBuffer(), except that CompleteReadBuffers() can take
a range of buffers to be handled with a single smgrreadv() call.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 532 +++++++++++++++++---------
 src/backend/storage/buffer/localbuf.c |   7 +-
 src/include/storage/bufmgr.h          |  19 +
 3 files changed, 369 insertions(+), 189 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..d462ef7460 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,331 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
 
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/* Skip this block if someone else has already completed it. */
+		if (!CompleteReadBuffersCanStartIO(buffers[i]))
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1]))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1372,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1432,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1500,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1535,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -2381,7 +2498,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2512,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -4845,6 +4967,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8da..e79a4967ee 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..eb41e455a0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,6 +265,7 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
-- 
2.39.2

v5-0002-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Give-SMgrRelation-pointers-a-well-defined-lifetim.patchDownload

From 21998ff7585e20b53f8c6f3306920acf22bf9d2d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 10 Aug 2023 18:16:31 +1200
Subject: [PATCH v5 02/10] Give SMgrRelation pointers a well-defined lifetime.

XXX This is a stand-in version of this patch, see
https://www.postgresql.org/message-id/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
for later verisons being discussed separately.
---
 src/backend/access/transam/xlogutils.c |  2 +-
 src/backend/postmaster/bgwriter.c      |  8 +++--
 src/backend/postmaster/checkpointer.c  | 15 ++++----
 src/backend/storage/smgr/smgr.c        | 49 ++++++++++++++++----------
 src/include/storage/smgr.h             |  4 +--
 src/include/utils/rel.h                |  6 ----
 6 files changed, 46 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..8775b5789b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7..13e5376619 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885..5d843b6142 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..0d7272e796 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..d8ffe397fa 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff..6636cc82c0 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
-- 
2.39.2

v5-0003-Provide-API-for-streaming-reads-of-relations.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Provide-API-for-streaming-reads-of-relations.patchDownload

From c9d2a0c5ab307f643d908a3431165595541b4b6f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 28 Jul 2023 22:35:32 +1200
Subject: [PATCH v5 03/10] Provide API for "streaming" reads of relations.

"Streaming reads" can be used as a more efficient replacement for
a series of ReadBuffer() calls.

The client code supplies a callback that can say which block to read
next, and then consumes individual buffers one at a time.  This division
allows streaming_read.c to build up calls to CompleteReadBuffers(),
which in turn produce larger preadv() calls instead of traditional
single-block pread() calls.

LimitAdditionalPins() is given external linkage, as the standard way of
making sure we don't take too many pins.  It was originally developed by
commit 31966b151e6a which has similar concerns (it was for multi-block
extension, this is for multi-block reading).

This work is based on an API sketched out by Andres Freund, and many
discussions with him about how to pave the way for asynchronous I/O in
future work.  The main aim is to create an abstraction that insulates
client code from future improvements to the I/O subsystem.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 357 +++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      |   2 +-
 src/backend/storage/buffer/localbuf.c    |   7 +-
 src/backend/storage/meson.build          |   1 +
 src/include/storage/bufmgr.h             |   3 +
 src/include/storage/streaming_read.h     |  37 +++
 src/tools/pgindent/typedefs.list         |   2 +
 10 files changed, 425 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..3279956754
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,357 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+	/*
+	 * For now, max_ios is a nominal value because we don't generate I/O
+	 * concurrency yet.  Later this will serve more purpose.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size,
+ * this is also the place where we would issue advice or start an asynchronous
+ * I/O, in future development.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/* If we're finished, then we can't look ahead. */
+	if (pgsr->finished)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if we wouldn't be able form another full range
+			 * after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d462ef7460..0319d88453 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1891,7 +1891,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index e79a4967ee..717b8f58da 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -287,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb41e455a0..a38f1acb37 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -269,6 +269,9 @@ extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..a445ea5c45
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,37 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5fd46b7bd1..c42558a0a3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.39.2

v5-0004-Use-streaming-reads-in-pg_prewarm.patchtext/x-patch; charset=US-ASCII; name=v5-0004-Use-streaming-reads-in-pg_prewarm.patchDownload

From a1daeb110aad9d1c2582101cdcfd9eecfc8e51e9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v5 04/10] Use streaming reads in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use streaming reads.  This
provides a simple example of such a transformation, and generates fewer
system calls.

Reviewed-by:
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..9617bf130b 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.2

v5-0005-Issue-advice-for-random-streaming-reads.patchtext/x-patch; charset=US-ASCII; name=v5-0005-Issue-advice-for-random-streaming-reads.patchDownload

From 8027a52d95b5865700b879581c8c4848fcf2a501 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 10 Jan 2024 00:36:14 +1300
Subject: [PATCH v5 05/10] Issue advice for random streaming reads.

For random access to blocks that are not yet in our buffer pool, issue
POSIX_FADV_WILLNEED advice as soon as we have formed the largest range
we can.  (This is the point at which we'd start a real asynchronous
read, given more infrastructure.)

The total number of assumed-to-be-running I/Os is tracked and kept at or
below the requested number.

Even though we try to avoid generating purely sequential hints, based on
evidence that that is slower than just letting the kernel (at least on
Linux) detect sequential access, at least one future user needs a way to
disable advice, so provide a flag for that.
---
 src/backend/storage/aio/streaming_read.c | 100 ++++++++++++++++++++---
 src/include/storage/streaming_read.h     |   8 ++
 2 files changed, 97 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 3279956754..19605090fe 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -16,6 +16,7 @@
  */
 typedef struct PgStreamingReadRange
 {
+	bool		advice_issued;
 	bool		need_complete;
 	BlockNumber blocknum;
 	int			nblocks;
@@ -25,6 +26,9 @@ typedef struct PgStreamingReadRange
 
 struct PgStreamingRead
 {
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
 	int			max_pinned_buffers;
 	int			pinned_buffers;
 	int			pinned_buffers_trigger;
@@ -36,6 +40,11 @@ struct PgStreamingRead
 	BufferManagerRelation bmr;
 	ForkNumber	forknum;
 
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
 	/* Space for optional per-buffer private data. */
 	size_t		per_buffer_data_size;
 	void	   *per_buffer_data;
@@ -59,9 +68,12 @@ pg_streaming_read_buffer_alloc_internal(int flags,
 	int			max_ios;
 	uint32		max_pinned_buffers;
 
+
 	/*
-	 * For now, max_ios is a nominal value because we don't generate I/O
-	 * concurrency yet.  Later this will serve more purpose.
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
 	 */
 	if (flags & PGSR_FLAG_MAINTENANCE)
 		max_ios = maintenance_io_concurrency;
@@ -76,6 +88,12 @@ pg_streaming_read_buffer_alloc_internal(int flags,
 	 */
 	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
 
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
 	/*
 	 * Don't allow this backend to pin too many buffers.  For now we'll apply
 	 * the limit for the shared buffer pool and the local buffer pool, without
@@ -99,12 +117,25 @@ pg_streaming_read_buffer_alloc_internal(int flags,
 		palloc0(offsetof(PgStreamingRead, ranges) +
 				sizeof(pgsr->ranges[0]) * size);
 
+	pgsr->max_ios = max_ios;
 	pgsr->per_buffer_data_size = per_buffer_data_size;
 	pgsr->max_pinned_buffers = max_pinned_buffers;
 	pgsr->pgsr_private = pgsr_private;
 	pgsr->strategy = strategy;
 	pgsr->size = size;
 
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
 	/*
 	 * We want to avoid creating ranges that are smaller than they could be
 	 * just because we hit max_pinned_buffers.  We only look ahead when the
@@ -154,9 +185,9 @@ pg_streaming_read_buffer_alloc(int flags,
  * Start building a new range.  This is called after the previous one
  * reached maximum size, or the callback's next block can't be merged with it.
  *
- * Since the previous head range has now reached its full potential size,
- * this is also the place where we would issue advice or start an asynchronous
- * I/O, in future development.
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
  */
 static PgStreamingReadRange *
 pg_streaming_read_new_range(PgStreamingRead *pgsr)
@@ -166,6 +197,40 @@ pg_streaming_read_new_range(PgStreamingRead *pgsr)
 	head_range = &pgsr->ranges[pgsr->head];
 	Assert(head_range->nblocks > 0);
 
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
 	/* Create a new head range.  There must be space. */
 	Assert(pgsr->size > pgsr->max_pinned_buffers);
 	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
@@ -180,8 +245,10 @@ pg_streaming_read_new_range(PgStreamingRead *pgsr)
 static void
 pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 {
-	/* If we're finished, then we can't look ahead. */
-	if (pgsr->finished)
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
 		return;
 
 	/*
@@ -210,10 +277,11 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 			head_range = pg_streaming_read_new_range(pgsr);
 
 			/*
-			 * Give up now if we wouldn't be able form another full range
-			 * after this due to the pin limit.
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
 			 */
-			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
 				break;
 		}
 
@@ -258,6 +326,7 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 			/* Initialize a new range beginning at this block. */
 			head_range->blocknum = blocknum;
 			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
 		}
 		else
 		{
@@ -273,7 +342,8 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
 			pgsr->per_buffer_data_next = 0;
 
-	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
 
 	if (pgsr->ranges[pgsr->head].nblocks > 0)
 		pg_streaming_read_new_range(pgsr);
@@ -305,6 +375,14 @@ pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
 								false,
 								pgsr->strategy);
 			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
 		}
 
 		/* Are there more buffers available in this range? */
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index a445ea5c45..40c3408c54 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -14,6 +14,14 @@
  */
 #define PGSR_FLAG_MAINTENANCE 0x01
 
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
 struct PgStreamingRead;
 typedef struct PgStreamingRead PgStreamingRead;
 
-- 
2.39.2

v5-0006-Allow-streaming-reads-to-ramp-up-in-size.patchtext/x-patch; charset=US-ASCII; name=v5-0006-Allow-streaming-reads-to-ramp-up-in-size.patchDownload

From 30e59a9eee0ac1b10c8016ad3a4e524202f0ce5e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 10 Jan 2024 02:20:07 +1300
Subject: [PATCH v5 06/10] Allow streaming reads to ramp up in size.

Instead of assuming that the consumer of buffers will consume all blocks
until the end of the stream, start by looking ahead just one block, and
doubling each time until we reach the full possible distance.  This can
be disabled by callers that already know they will read all available
blocks, for example pg_prewarm.
---
 contrib/pg_prewarm/pg_prewarm.c          |  2 +-
 src/backend/storage/aio/streaming_read.c | 39 +++++++++++++++++++++++-
 src/include/storage/streaming_read.h     |  7 +++++
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 9617bf130b..1cc84bcb0c 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -214,7 +214,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		p.blocknum = first_block;
 		p.last_block = last_block;
 
-		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_FULL,
 											  &p,
 											  0,
 											  NULL,
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fe..bb5b863be7 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -33,6 +33,8 @@ struct PgStreamingRead
 	int			pinned_buffers;
 	int			pinned_buffers_trigger;
 	int			next_tail_buffer;
+	int			ramp_up_pin_limit;
+	int			ramp_up_pin_stall;
 	bool		finished;
 	void	   *pgsr_private;
 	PgStreamingReadBufferCB callback;
@@ -136,6 +138,16 @@ pg_streaming_read_buffer_alloc_internal(int flags,
 		pgsr->advice_enabled = true;
 #endif
 
+	/*
+	 * We start off building small ranges, but double that quickly, for the
+	 * benefit of users that don't know how far ahead they'll read.  This can
+	 * be disabled by users that already know they'll read all the way.
+	 */
+	if (flags & PGSR_FLAG_FULL)
+		pgsr->ramp_up_pin_limit = INT_MAX;
+	else
+		pgsr->ramp_up_pin_limit = 1;
+
 	/*
 	 * We want to avoid creating ranges that are smaller than they could be
 	 * just because we hit max_pinned_buffers.  We only look ahead when the
@@ -245,6 +257,16 @@ pg_streaming_read_new_range(PgStreamingRead *pgsr)
 static void
 pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 {
+	/*
+	 * If we're still ramping up, we may have to stall to wait for buffers to
+	 * be consumed first before we do any more prefetching.
+	 */
+	if (pgsr->ramp_up_pin_stall > 0)
+	{
+		Assert(pgsr->pinned_buffers > 0);
+		return;
+	}
+
 	/*
 	 * If we're finished or can't start more I/O, then don't look ahead.
 	 */
@@ -343,7 +365,19 @@ pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
 			pgsr->per_buffer_data_next = 0;
 
 	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
-			 pgsr->ios_in_progress < pgsr->max_ios);
+			 pgsr->ios_in_progress < pgsr->max_ios &&
+			 pgsr->pinned_buffers < pgsr->ramp_up_pin_limit);
+
+	/* If we've hit the ramp-up limit, insert a stall. */
+	if (pgsr->pinned_buffers >= pgsr->ramp_up_pin_limit)
+	{
+		/* Can't get here if an earlier stall hasn't finished. */
+		Assert(pgsr->ramp_up_pin_stall == 0);
+		/* Don't do any more prefetching until these buffers are consumed. */
+		pgsr->ramp_up_pin_stall = pgsr->ramp_up_pin_limit;
+		/* Double it.  It will soon be out of the way. */
+		pgsr->ramp_up_pin_limit *= 2;
+	}
 
 	if (pgsr->ranges[pgsr->head].nblocks > 0)
 		pg_streaming_read_new_range(pgsr);
@@ -400,6 +434,9 @@ pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
 			Assert(pgsr->pinned_buffers > 0);
 			pgsr->pinned_buffers--;
 
+			if (pgsr->ramp_up_pin_stall > 0)
+				pgsr->ramp_up_pin_stall--;
+
 			if (per_buffer_data)
 				*per_buffer_data = (char *) pgsr->per_buffer_data +
 					tail_range->per_buffer_data_index[buffer_index] *
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c54..c4d3892bb2 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -22,6 +22,13 @@
  */
 #define PGSR_FLAG_SEQUENTIAL 0x02
 
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
 struct PgStreamingRead;
 typedef struct PgStreamingRead PgStreamingRead;
 
-- 
2.39.2

v5-0007-WIP-Use-streaming-reads-in-heapam-scans.patchtext/x-patch; charset=US-ASCII; name=v5-0007-WIP-Use-streaming-reads-in-heapam-scans.patchDownload

From 766703de5f03bf9d2bda73a0bb703f1c49bb2178 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Apr 2023 18:16:06 -0700
Subject: [PATCH v5 07/10] WIP: Use streaming reads in heapam scans.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/heap/heapam.c         | 205 +++++++++++++++++++++--
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/include/access/heapam.h              |   5 +-
 3 files changed, 192 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a536..76a53f68b6 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -65,6 +65,7 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
@@ -225,6 +226,89 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	BlockNumber blockno;
+
+	Assert(!scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	if (scan->rs_prefetch_block == InvalidBlockNumber)
+	{
+		scan->rs_prefetch_block = blockno = scan->rs_startblock;
+	}
+	else
+	{
+		blockno = ++scan->rs_prefetch_block;
+
+		/* wrap back to the start of the heap */
+		if (blockno >= scan->rs_nblocks)
+			scan->rs_prefetch_block = blockno = 0;
+
+		/* we're done if we're back at where we started */
+		if (blockno == scan->rs_startblock)
+			return InvalidBlockNumber;
+
+		/* check if the limit imposed by heap_setscanlimits() is met */
+		if (scan->rs_numblocks != InvalidBlockNumber)
+		{
+			if (--scan->rs_numblocks == 0)
+				return InvalidBlockNumber;
+		}
+	}
+
+	return blockno;
+}
+
+static BlockNumber
+heap_pgsr_next_parallel(PgStreamingRead *pgsr, void *pgsr_private,
+						void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	ParallelBlockTableScanDesc pbscan =
+		(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+	ParallelBlockTableScanWorker pbscanwork =
+		scan->rs_parallelworkerdata;
+	BlockNumber blockno;
+
+	Assert(scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	/* Note that other processes might have already finished the scan */
+	blockno = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+												pbscanwork, pbscan);
+
+	return blockno;
+}
+
+static PgStreamingRead *
+heap_pgsr_single_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+										  scan,
+										  0,
+										  scan->rs_strategy,
+										  BMR_REL(scan->rs_base.rs_rd),
+										  MAIN_FORKNUM,
+										  heap_pgsr_next_single);
+}
+
+static PgStreamingRead *
+heap_pgsr_parallel_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PGSR_FLAG_SEQUENTIAL,
+										  scan,
+										  0,
+										  scan->rs_strategy,
+										  BMR_REL(scan->rs_base.rs_rd),
+										  MAIN_FORKNUM,
+										  heap_pgsr_next_parallel);
+}
+
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -342,6 +426,26 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 */
 	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN)
 		pgstat_count_heap_scan(scan->rs_base.rs_rd);
+
+	scan->rs_prefetch_block = InvalidBlockNumber;
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
+	/*
+	 * FIXME: This probably should be done in the !rs_inited blocks instead.
+	 */
+	scan->pgsr = NULL;
+	if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
+		(scan->rs_base.rs_flags & SO_TYPE_SEQSCAN))
+	{
+		if (scan->rs_base.rs_parallel)
+			scan->pgsr = heap_pgsr_parallel_alloc(scan);
+		else
+			scan->pgsr = heap_pgsr_single_alloc(scan);
+	}
 }
 
 /*
@@ -374,7 +478,7 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
  * which tuples on the page are visible.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer pgsr_buffer)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
 	Buffer		buffer;
@@ -401,9 +505,20 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
+	if (BufferIsValid(pgsr_buffer))
+	{
+		Assert(scan->pgsr);
+		Assert(BufferGetBlockNumber(pgsr_buffer) == block);
+		scan->rs_cbuf = pgsr_buffer;
+	}
+	else
+	{
+		Assert(!scan->pgsr);
+
+		/* read page using selected strategy */
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+										   RBM_NORMAL, scan->rs_strategy);
+	}
 	scan->rs_cblock = block;
 
 	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
@@ -490,7 +605,7 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
  * of the pages before we can get a chance to get our first page.
  */
 static BlockNumber
-heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
+heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir, Buffer *pgsr_buf)
 {
 	Assert(!scan->rs_inited);
 
@@ -500,16 +615,25 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		if (scan->rs_base.rs_parallel != NULL)
+			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+													 scan->rs_parallelworkerdata,
+													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		/* FIXME: Integrate more neatly */
+		if (scan->pgsr)
+		{
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+
 		/* serial scan */
 		if (scan->rs_base.rs_parallel == NULL)
 			return scan->rs_startblock;
 		else
 		{
-			/* parallel scan */
-			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
-													 scan->rs_parallelworkerdata,
-													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
-
 			/* may return InvalidBlockNumber if there are no more blocks */
 			return table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
 													 scan->rs_parallelworkerdata,
@@ -529,6 +653,12 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 		 */
 		scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
 
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/*
 		 * Start from last page of the scan.  Ensure we take into account
 		 * rs_numblocks if it's been adjusted by heap_setscanlimits().
@@ -630,11 +760,33 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * heap_setscanlimits().
  */
 static inline BlockNumber
-heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
+heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir,
+						 Buffer *pgsr_buf)
 {
 	if (ScanDirectionIsForward(dir))
 	{
-		if (scan->rs_base.rs_parallel == NULL)
+		if (scan->pgsr)
+		{
+#ifdef USE_ASSERT_CHECKING
+			block++;
+
+			/* wrap back to the start of the heap */
+			if (block >= scan->rs_nblocks)
+				block = 0;
+
+			/* we're done if we're back at where we started */
+			if (block == scan->rs_startblock)
+				block = InvalidBlockNumber;
+#endif
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+
+			Assert(scan->rs_base.rs_parallel ||
+				   block == BufferGetBlockNumber(*pgsr_buf));
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+		else if (scan->rs_base.rs_parallel == NULL)
 		{
 			block++;
 
@@ -679,6 +831,12 @@ heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir
 	}
 	else
 	{
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/* we're done if the last block is the start position */
 		if (block == scan->rs_startblock)
 			return InvalidBlockNumber;
@@ -728,13 +886,14 @@ heapgettup(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -755,7 +914,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -807,9 +966,10 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+		pgsr_buf = InvalidBuffer;
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -843,13 +1003,14 @@ heapgettup_pagemode(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -876,7 +1037,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
@@ -908,7 +1069,7 @@ continue_page:
 		}
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -956,6 +1117,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_parallel = parallel_scan;
 	scan->rs_strategy = NULL;	/* set in initscan */
 
+	scan->pgsr = NULL;
+
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
 	 */
@@ -1062,6 +1225,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be..6127f9d75a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2335,7 +2335,7 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
+	heapgetpage(scan, blockno, InvalidBuffer);
 	hscan->rs_inited = true;
 
 	return true;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 932ec0d6f2..bc53f9f4d2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -59,6 +59,7 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	OffsetNumber rs_coffset;	/* current offset # in non-page-at-a-time mode */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+	BlockNumber rs_prefetch_block;	/* block being prefetched */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
@@ -72,6 +73,8 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
+	struct PgStreamingRead *pgsr;
+
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
@@ -246,7 +249,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer buffer);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.39.2

v5-0008-WIP-Use-streaming-reads-in-vacuum.patchtext/x-patch; charset=US-ASCII; name=v5-0008-WIP-Use-streaming-reads-in-vacuum.patchDownload

From 6fd70c119ac96e0af251bc22a99bb894cc200b9a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Apr 2023 15:44:19 -0700
Subject: [PATCH v5 08/10] WIP: Use streaming reads in vacuum.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes (I still need to
understand why regress/regress's stats test reuse counter doesn't like
this).

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/heap/vacuumlazy.c | 291 ++++++++++++++++++++-------
 src/test/regress/expected/stats.out  |  10 +-
 src/test/regress/sql/stats.sql       |   5 +-
 src/tools/pgindent/typedefs.list     |   2 +
 4 files changed, 230 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index abbba8947f..10fb2142d2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -59,6 +59,7 @@
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
+#include "storage/streaming_read.h"
 #include "tcop/tcopprot.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -195,6 +196,17 @@ typedef struct LVRelState
 	BlockNumber missed_dead_pages;	/* # pages with missed dead tuples */
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 
+	/*
+	 * State managed by vacuum_scan_pgsr_next et al follows
+	 */
+	BlockNumber blkno_prefetch;
+	BlockNumber next_unskippable_block;
+	Buffer		vmbuffer_prefetch;
+	Buffer		vmbuffer_data;
+
+	bool		next_unskippable_allvis;
+	bool		skipping_current_range;
+
 	/* Statistics output by us, for table */
 	double		new_rel_tuples; /* new estimated total # of tuples */
 	double		new_live_tuples;	/* new estimated total # of live tuples */
@@ -785,6 +797,50 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 }
 
+static BlockNumber
+vacuum_scan_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private, void *per_buffer_data)
+{
+	LVRelState *vacrel = (LVRelState *) pgsr_private;
+
+	while (vacrel->blkno_prefetch < vacrel->rel_pages)
+	{
+		if (vacrel->blkno_prefetch == vacrel->next_unskippable_block)
+		{
+			/*
+			 * XXX: Does a vacuum_delay_point, it'd be better if this were
+			 * outside the callback...
+			 */
+			vacrel->next_unskippable_block = lazy_scan_skip(vacrel,
+															&vacrel->vmbuffer_prefetch,
+															vacrel->blkno_prefetch + 1,
+															&vacrel->next_unskippable_allvis,
+															&vacrel->skipping_current_range);
+			Assert(vacrel->next_unskippable_block >= vacrel->blkno_prefetch + 1);
+		}
+		else
+		{
+			/*
+			 * Can't skip this page safely.  Must scan the page.  But
+			 * determine the next skippable range after the page first.
+			 */
+			Assert(vacrel->blkno_prefetch < vacrel->rel_pages - 1);
+
+			if (vacrel->skipping_current_range)
+			{
+				vacrel->blkno_prefetch++;
+				continue;
+			}
+		}
+
+		Assert(vacrel->blkno_prefetch < vacrel->rel_pages);
+
+		return vacrel->blkno_prefetch++;
+	}
+
+	Assert(vacrel->blkno_prefetch == vacrel->rel_pages);
+	return InvalidBlockNumber;
+}
+
 /*
  *	lazy_scan_heap() -- workhorse function for VACUUM
  *
@@ -825,19 +881,25 @@ static void
 lazy_scan_heap(LVRelState *vacrel)
 {
 	BlockNumber rel_pages = vacrel->rel_pages,
-				blkno,
-				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
 	VacDeadItems *dead_items = vacrel->dead_items;
-	Buffer		vmbuffer = InvalidBuffer;
-	bool		next_unskippable_allvis,
-				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
 		PROGRESS_VACUUM_MAX_DEAD_TUPLES
 	};
 	int64		initprog_val[3];
+	PgStreamingRead *pgsr;
+
+	{
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_MAINTENANCE,
+											  vacrel,
+											  0,
+											  vacrel->bstrategy,
+											  BMR_REL(vacrel->rel),
+											  MAIN_FORKNUM,
+											  vacuum_scan_pgsr_next);
+	}
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
@@ -846,42 +908,44 @@ lazy_scan_heap(LVRelState *vacrel)
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
-	next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer, 0,
-											&next_unskippable_allvis,
-											&skipping_current_range);
-	for (blkno = 0; blkno < rel_pages; blkno++)
+	vacrel->next_unskippable_block =
+		lazy_scan_skip(vacrel,
+					   &vacrel->vmbuffer_prefetch,
+					   0,
+					   &vacrel->next_unskippable_allvis,
+					   &vacrel->skipping_current_range);
+
+	while (true)
 	{
 		Buffer		buf;
+		BlockNumber blkno;
 		Page		page;
-		bool		all_visible_according_to_vm;
+		bool		all_visible_according_to_vm = false;
 		LVPagePruneState prunestate;
 
-		if (blkno == next_unskippable_block)
-		{
-			/*
-			 * Can't skip this page safely.  Must scan the page.  But
-			 * determine the next skippable range after the page first.
-			 */
-			all_visible_according_to_vm = next_unskippable_allvis;
-			next_unskippable_block = lazy_scan_skip(vacrel, &vmbuffer,
-													blkno + 1,
-													&next_unskippable_allvis,
-													&skipping_current_range);
+		buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
+		if (!BufferIsValid(buf))
+			break;
 
-			Assert(next_unskippable_block >= blkno + 1);
-		}
-		else
-		{
-			/* Last page always scanned (may need to set nonempty_pages) */
-			Assert(blkno < rel_pages - 1);
+		CheckBufferIsPinnedOnce(buf);
 
-			if (skipping_current_range)
-				continue;
+		page = BufferGetPage(buf);
 
-			/* Current range is too small to skip -- just scan the page */
-			all_visible_according_to_vm = true;
-		}
+		/* FIXME: should have more efficient way to determine this. */
+		blkno = BufferGetBlockNumber(buf);
+
+		pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+
+		update_vacuum_error_info(vacrel, NULL, VACUUM_ERRCB_PHASE_SCAN_HEAP,
+								 blkno, InvalidOffsetNumber);
+
+		vacuum_delay_point();
 
+		/*
+		 * We're not skipping this page using the visibility map, and so it is
+		 * (by definition) a scanned page.  Any tuples from this page are now
+		 * guaranteed to be counted below, after some preparatory checks.
+		 */
 		vacrel->scanned_pages++;
 
 		/* Report as block scanned, update error traceback information */
@@ -918,10 +982,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * correctness, but we do it anyway to avoid holding the pin
 			 * across a lengthy, unrelated operation.
 			 */
-			if (BufferIsValid(vmbuffer))
+			if (BufferIsValid(vacrel->vmbuffer_data))
 			{
-				ReleaseBuffer(vmbuffer);
-				vmbuffer = InvalidBuffer;
+				ReleaseBuffer(vacrel->vmbuffer_data);
+				vacrel->vmbuffer_data = InvalidBuffer;
+			}
+
+			if (BufferIsValid(vacrel->vmbuffer_prefetch))
+			{
+				ReleaseBuffer(vacrel->vmbuffer_prefetch);
+				vacrel->vmbuffer_prefetch = InvalidBuffer;
 			}
 
 			/* Perform a round of index and heap vacuuming */
@@ -941,12 +1011,26 @@ lazy_scan_heap(LVRelState *vacrel)
 										 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
 		}
 
+		/*
+		 * Normally, the fact that we can't skip this block must mean that
+		 * it's not all-visible.  But in an aggressive vacuum we know only
+		 * that it's not all-frozen, so it might still be all-visible.
+		 *
+		 * FIXME: deduplicate
+		 */
+		if (vacrel->aggressive &&
+			VM_ALL_VISIBLE(vacrel->rel, blkno, &vacrel->vmbuffer_data))
+			all_visible_according_to_vm = true;
+
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.
+		 *
+		 * FIXME: This needs to be updated based on the fact that there's now
+		 * two different pins.
 		 */
-		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
+		visibilitymap_pin(vacrel->rel, blkno, &vacrel->vmbuffer_data);
 
 		/*
 		 * We need a buffer cleanup lock to prune HOT chains and defragment
@@ -954,9 +1038,6 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * a cleanup lock right away, we may be able to settle for reduced
 		 * processing using lazy_scan_noprune.
 		 */
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-								 vacrel->bstrategy);
-		page = BufferGetPage(buf);
 		if (!ConditionalLockBufferForCleanup(buf))
 		{
 			bool		hastup,
@@ -966,7 +1047,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 			/* Check for new or empty pages before lazy_scan_noprune call */
 			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
-									   vmbuffer))
+									   vacrel->vmbuffer_data))
 			{
 				/* Processed as new/empty page (lock and pin released) */
 				continue;
@@ -1004,7 +1085,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 
 		/* Check for new or empty pages before lazy_scan_prune call */
-		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vacrel->vmbuffer_data))
 		{
 			/* Processed as new/empty page (lock and pin released) */
 			continue;
@@ -1041,7 +1122,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
+				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vacrel->vmbuffer_data);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
 				dead_items->num_items = 0;
@@ -1120,7 +1201,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			PageSetAllVisible(page);
 			MarkBufferDirty(buf);
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, prunestate.visibility_cutoff_xid,
+							  vacrel->vmbuffer_data, prunestate.visibility_cutoff_xid,
 							  flags);
 		}
 
@@ -1131,11 +1212,11 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * with buffer lock before concluding that the VM is corrupt.
 		 */
 		else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
-				 visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
+				 visibilitymap_get_status(vacrel->rel, blkno, &vacrel->vmbuffer_data))
 		{
 			elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
-			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
+			visibilitymap_clear(vacrel->rel, blkno, vacrel->vmbuffer_data,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1160,7 +1241,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				 vacrel->relname, blkno);
 			PageClearAllVisible(page);
 			MarkBufferDirty(buf);
-			visibilitymap_clear(vacrel->rel, blkno, vmbuffer,
+			visibilitymap_clear(vacrel->rel, blkno, vacrel->vmbuffer_data,
 								VISIBILITYMAP_VALID_BITS);
 		}
 
@@ -1171,7 +1252,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		else if (all_visible_according_to_vm && prunestate.all_visible &&
 				 prunestate.all_frozen &&
-				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vmbuffer))
+				 !VM_ALL_FROZEN(vacrel->rel, blkno, &vacrel->vmbuffer_data))
 		{
 			/*
 			 * Avoid relying on all_visible_according_to_vm as a proxy for the
@@ -1193,7 +1274,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 */
 			Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
 			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-							  vmbuffer, InvalidTransactionId,
+							  vacrel->vmbuffer_data, InvalidTransactionId,
 							  VISIBILITYMAP_ALL_VISIBLE |
 							  VISIBILITYMAP_ALL_FROZEN);
 		}
@@ -1235,11 +1316,19 @@ lazy_scan_heap(LVRelState *vacrel)
 	}
 
 	vacrel->blkno = InvalidBlockNumber;
-	if (BufferIsValid(vmbuffer))
-		ReleaseBuffer(vmbuffer);
+	if (BufferIsValid(vacrel->vmbuffer_data))
+	{
+		ReleaseBuffer(vacrel->vmbuffer_data);
+		vacrel->vmbuffer_data = InvalidBuffer;
+	}
+	if (BufferIsValid(vacrel->vmbuffer_prefetch))
+	{
+		ReleaseBuffer(vacrel->vmbuffer_prefetch);
+		vacrel->vmbuffer_prefetch = InvalidBuffer;
+	}
 
 	/* report that everything is now scanned */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, vacrel->rel_pages);
 
 	/* now we can compute the new value for pg_class.reltuples */
 	vacrel->new_live_tuples = vac_estimate_reltuples(vacrel->rel, rel_pages,
@@ -1254,6 +1343,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		Max(vacrel->new_live_tuples, 0) + vacrel->recently_dead_tuples +
 		vacrel->missed_dead_tuples;
 
+	pg_streaming_read_free(pgsr);
+
 	/*
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
@@ -1265,11 +1356,11 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Vacuum the remainder of the Free Space Map.  We must do this whether or
 	 * not there were indexes, and whether or not we bypassed index vacuuming.
 	 */
-	if (blkno > next_fsm_block_to_vacuum)
-		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, blkno);
+	if (vacrel->rel_pages > next_fsm_block_to_vacuum)
+		FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum, vacrel->rel_pages);
 
 	/* report all blocks vacuumed */
-	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
+	pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, vacrel->rel_pages);
 
 	/* Do final index cleanup (call each index's amvacuumcleanup routine) */
 	if (vacrel->nindexes > 0 && vacrel->do_index_cleanup)
@@ -2398,6 +2489,48 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	return allindexes;
 }
 
+typedef struct VacuumHeapBlockState
+{
+	BlockNumber blkno;
+	int			start_tupindex;
+	int			end_tupindex;
+} VacuumHeapBlockState;
+
+typedef struct VacuumHeapState
+{
+	Relation	relation;
+	LVRelState *vacrel;
+	BlockNumber last_block;
+	int			next_tupindex;
+} VacuumHeapState;
+
+static BlockNumber
+vacuum_heap_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	VacuumHeapState *vhs = (VacuumHeapState *) pgsr_private;
+	VacDeadItems *dead_items = vhs->vacrel->dead_items;
+	VacuumHeapBlockState *bs = per_buffer_data;
+
+	if (vhs->next_tupindex == dead_items->num_items)
+		return InvalidBlockNumber;
+
+	bs->blkno = ItemPointerGetBlockNumber(&dead_items->items[vhs->next_tupindex]);
+	bs->start_tupindex = vhs->next_tupindex;
+	bs->end_tupindex = vhs->next_tupindex;
+
+	for (; vhs->next_tupindex < dead_items->num_items; vhs->next_tupindex++)
+	{
+		BlockNumber curblkno = ItemPointerGetBlockNumber(&dead_items->items[vhs->next_tupindex]);
+
+		if (bs->blkno != curblkno)
+			break;				/* past end of tuples for this block */
+		bs->end_tupindex = vhs->next_tupindex;
+	}
+
+	return bs->blkno;
+}
+
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
@@ -2421,6 +2554,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
 	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
+	VacuumHeapState vhs;
+	PgStreamingRead *pgsr;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 
@@ -2437,40 +2572,58 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	vhs.relation = vacrel->rel;
+	vhs.vacrel = vacrel;
+	vhs.last_block = InvalidBlockNumber;
+	vhs.next_tupindex = 0;
+	pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_MAINTENANCE,
+										  &vhs,
+										  sizeof(VacuumHeapBlockState),
+										  vacrel->bstrategy,
+										  BMR_REL(vacrel->rel),
+										  MAIN_FORKNUM,
+										  vacuum_heap_pgsr_next);
+	while (true)
 	{
-		BlockNumber blkno;
-		Buffer		buf;
+		VacuumHeapBlockState *bs;
 		Page		page;
 		Size		freespace;
+		Buffer		buffer;
 
-		vacuum_delay_point();
+		buffer = pg_streaming_read_buffer_get_next(pgsr, (void **) &bs);
+		if (!BufferIsValid(buffer))
+			break;
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
-		vacrel->blkno = blkno;
+		Assert(bs->blkno == BufferGetBlockNumber(buffer));
+		Assert(bs->start_tupindex == index);
+		vacrel->blkno = bs->blkno;
 
 		/*
 		 * Pin the visibility map page in case we need to mark the page
 		 * all-visible.  In most cases this will be very cheap, because we'll
 		 * already have the correct page pinned anyway.
 		 */
-		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
+		visibilitymap_pin(vacrel->rel, bs->blkno, &vmbuffer);
 
 		/* We need a non-cleanup exclusive lock to mark dead_items unused */
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-								 vacrel->bstrategy);
-		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		index = lazy_vacuum_heap_page(vacrel, bs->blkno, buffer, index,
+									  vmbuffer);
 
-		/* Now that we've vacuumed the page, record its available space */
-		page = BufferGetPage(buf);
+		/* Now that we've compacted the page, record its available space */
+		page = BufferGetPage(buffer);
 		freespace = PageGetHeapFreeSpace(page);
 
-		UnlockReleaseBuffer(buf);
-		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
+		UnlockReleaseBuffer(buffer);
+		RecordPageWithFreeSpace(vacrel->rel, bs->blkno, freespace);
 		vacuumed_pages++;
+
+		Assert(bs->end_tupindex + 1 == index);
+		vacuum_delay_point();
 	}
 
+	pg_streaming_read_free(pgsr);
+
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
 		ReleaseBuffer(vmbuffer);
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 346e10a3d2..a3db9ce752 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1485,13 +1485,9 @@ SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads;
  t
 (1 row)
 
-SELECT (:io_sum_vac_strategy_after_reuses + :io_sum_vac_strategy_after_evictions) >
-  (:io_sum_vac_strategy_before_reuses + :io_sum_vac_strategy_before_evictions);
- ?column? 
-----------
- t
-(1 row)
-
+-- TM: This breaks with streaming read: investigate!
+-- SELECT (:io_sum_vac_strategy_after_reuses + :io_sum_vac_strategy_after_evictions) >
+--   (:io_sum_vac_strategy_before_reuses + :io_sum_vac_strategy_before_evictions);
 RESET wal_skip_threshold;
 -- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
 -- BufferAccessStrategy, are tracked in pg_stat_io.
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index e3b4ca96e8..962d9d56db 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -744,8 +744,9 @@ SELECT pg_stat_force_next_flush();
 SELECT sum(reuses) AS reuses, sum(reads) AS reads, sum(evictions) AS evictions
   FROM pg_stat_io WHERE context = 'vacuum' \gset io_sum_vac_strategy_after_
 SELECT :io_sum_vac_strategy_after_reads > :io_sum_vac_strategy_before_reads;
-SELECT (:io_sum_vac_strategy_after_reuses + :io_sum_vac_strategy_after_evictions) >
-  (:io_sum_vac_strategy_before_reuses + :io_sum_vac_strategy_before_evictions);
+-- TM: This breaks with streaming read: investigate!
+-- SELECT (:io_sum_vac_strategy_after_reuses + :io_sum_vac_strategy_after_evictions) >
+--   (:io_sum_vac_strategy_before_reuses + :io_sum_vac_strategy_before_evictions);
 RESET wal_skip_threshold;
 
 -- Test that extends done by a CTAS, which uses a BAS_BULKWRITE
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c42558a0a3..68f9f1c0ac 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2957,6 +2957,8 @@ VacDeadItems
 VacErrPhase
 VacObjFilter
 VacOptValue
+VacuumHeapBlockState
+VacuumHeapState
 VacuumParams
 VacuumRelation
 VacuumStmt
-- 
2.39.2

v5-0009-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchtext/x-patch; charset=US-ASCII; name=v5-0009-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchDownload

From f86c4ed30676baa040d370877bc8361e0066e15f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 12 Feb 2021 14:19:10 -0800
Subject: [PATCH v5 09/10] WIP: Use streaming reads in nbtree vacuum scan.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/nbtree/nbtree.c | 74 ++++++++++++++++++++++++------
 src/include/access/nbtree.h        |  3 ++
 2 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 696d79c085..2f9ff09c32 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -34,6 +34,7 @@
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -81,7 +82,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
 static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 IndexBulkDeleteCallback callback, void *callback_state,
 						 BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno);
 static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
 										  IndexTuple posting,
 										  OffsetNumber updatedoffset,
@@ -893,6 +894,24 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+static BlockNumber
+btvacuumscan_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+					   void *per_buffer_data)
+{
+	BTVacState *vstate = (BTVacState *) pgsr_private;
+
+	if (vstate->nextblock == vstate->num_pages)
+		return InvalidBlockNumber;
+
+	/*
+	 * We can't use _bt_getbuf() here because it always applies
+	 * _bt_checkpage(), which will barf on an all-zero page. We want to
+	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
+	 * buffer access strategy. Also want to use AIO.
+	 */
+	return vstate->nextblock++;
+}
+
 /*
  * btvacuumscan --- scan the index for VACUUMing purposes
  *
@@ -986,6 +1005,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	scanblkno = BTREE_METAPAGE + 1;
 	for (;;)
 	{
+		PgStreamingRead *pgsr;
+
 		/* Get the current relation length */
 		if (needLock)
 			LockRelationForExtension(rel, ExclusiveLock);
@@ -1000,14 +1021,40 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (scanblkno >= num_pages)
 			break;
+
+		vstate.nextblock = scanblkno;
+		vstate.num_pages = num_pages;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_MAINTENANCE,
+											  &vstate,
+											  0,
+											  vstate.info->strategy,
+											  BMR_REL(vstate.info->index),
+											  MAIN_FORKNUM,
+											  btvacuumscan_pgsr_next);
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; scanblkno < num_pages; scanblkno++)
 		{
-			btvacuumpage(&vstate, scanblkno);
+			Buffer		buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
+
+			Assert(BufferIsValid(buf));
+			Assert(BufferGetBlockNumber(buf) == scanblkno);
+
+			btvacuumpage(&vstate, buf, scanblkno);
+
 			if (info->report_progress)
 				pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
 											 scanblkno);
+
+			/* call vacuum_delay_point while not holding any buffer lock */
+			vacuum_delay_point();
 		}
+
+		if (pg_streaming_read_buffer_get_next(pgsr, NULL) != 0)
+			elog(ERROR, "aio and plain pos out of sync");
+
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Set statistics num_pages field to final size of index */
@@ -1040,7 +1087,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
  * recycled (i.e. before the page split).
  */
 static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno)
 {
 	IndexVacuumInfo *info = vstate->info;
 	IndexBulkDeleteResult *stats = vstate->stats;
@@ -1051,28 +1098,28 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
 	bool		attempt_pagedel;
 	BlockNumber blkno,
 				backtrack_to;
-	Buffer		buf;
 	Page		page;
 	BTPageOpaque opaque;
 
 	blkno = scanblkno;
 
+	Assert(BufferIsValid(buf));
 backtrack:
 
 	attempt_pagedel = false;
 	backtrack_to = P_NONE;
 
-	/* call vacuum_delay_point while not holding any buffer lock */
-	vacuum_delay_point();
-
 	/*
-	 * We can't use _bt_getbuf() here because it always applies
-	 * _bt_checkpage(), which will barf on an all-zero page. We want to
-	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
-	 * buffer access strategy.
+	 * If we're not backtracking we already read the page asynchronously. Not
+	 * using AIO when backtracking isn't too tragic - it's relatively rare,
+	 * and the buffer quite possibly is still in shared buffers.
 	 */
-	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-							 info->strategy);
+	if (!BufferIsValid(buf))
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								 info->strategy);
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	_bt_lockbuf(rel, buf, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = NULL;
@@ -1359,6 +1406,7 @@ backtrack:
 	if (backtrack_to != P_NONE)
 	{
 		blkno = backtrack_to;
+		buf = InvalidBuffer;
 		goto backtrack;
 	}
 }
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052e..2dbbbd36e7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -343,6 +343,9 @@ typedef struct BTVacState
 	int			maxbufsize;		/* max bufsize that respects work_mem */
 	BTPendingFSM *pendingpages; /* One entry per newly deleted page */
 	int			npendingpages;	/* current # valid pendingpages */
+
+	BlockNumber nextblock;
+	BlockNumber num_pages;
 } BTVacState;
 
 /*
-- 
2.39.2

v5-0010-WIP-Use-streaming-reads-in-bitmap-heapscan.patchtext/x-patch; charset=US-ASCII; name=v5-0010-WIP-Use-streaming-reads-in-bitmap-heapscan.patchDownload

From c2c532ae8073187549261438e15f80ee37e9076f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 22 Aug 2023 16:48:30 +1200
Subject: [PATCH v5 10/10] WIP: Use streaming reads in bitmap heapscan.

XXX Cherry-picked from
https://github.com/melanieplageman/postgres/tree/bhs_aio and lightly
modified by TM, for demonstration purposes.

Author: Melanie Plageman <melanieplageman@gmail.com>
---
 src/backend/access/gin/ginget.c           |  17 +-
 src/backend/access/gin/ginscan.c          |   7 +
 src/backend/access/heap/heapam.c          |  71 +++
 src/backend/access/heap/heapam_handler.c  |  78 ++--
 src/backend/commands/explain.c            |  21 +-
 src/backend/executor/nodeBitmapHeapscan.c | 521 ++--------------------
 src/backend/nodes/tidbitmap.c             |  76 ++--
 src/include/access/heapam.h               |   7 +
 src/include/access/relscan.h              |   8 +
 src/include/access/tableam.h              |  71 ++-
 src/include/executor/nodeBitmapHeapscan.h |   1 +
 src/include/nodes/execnodes.h             |  39 +-
 src/include/nodes/tidbitmap.h             |   4 +-
 13 files changed, 284 insertions(+), 637 deletions(-)

diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0b4f2ebadb..f238a3b071 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -373,7 +373,10 @@ restartScanEntry:
 			if (entry->matchBitmap)
 			{
 				if (entry->matchIterator)
+				{
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+				}
 				entry->matchIterator = NULL;
 				tbm_free(entry->matchBitmap);
 				entry->matchBitmap = NULL;
@@ -386,6 +389,9 @@ restartScanEntry:
 		if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap))
 		{
 			entry->matchIterator = tbm_begin_iterate(entry->matchBitmap);
+			entry->matchResult = (TBMIterateResult *)
+				palloc0(sizeof(TBMIterateResult) + MaxHeapTuplesPerPage *
+						sizeof(OffsetNumber));
 			entry->isFinished = false;
 		}
 	}
@@ -823,21 +829,24 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 		{
 			/*
 			 * If we've exhausted all items on this block, move to next block
-			 * in the bitmap.
+			 * in the bitmap. tbm_iterate() sets matchResult->blockno to
+			 * InvalidBlockNumber when the bitmap is exhausted.
 			 */
-			while (entry->matchResult == NULL ||
+			while ((!BlockNumberIsValid(entry->matchResult->blockno)) ||
 				   (entry->matchResult->ntuples >= 0 &&
 					entry->offset >= entry->matchResult->ntuples) ||
 				   entry->matchResult->blockno < advancePastBlk ||
 				   (ItemPointerIsLossyPage(&advancePast) &&
 					entry->matchResult->blockno == advancePastBlk))
 			{
-				entry->matchResult = tbm_iterate(entry->matchIterator);
 
-				if (entry->matchResult == NULL)
+				tbm_iterate(entry->matchIterator, entry->matchResult);
+				if (!BlockNumberIsValid(entry->matchResult->blockno))
 				{
 					ItemPointerSetInvalid(&entry->curItem);
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+					entry->matchResult = NULL;
 					entry->matchIterator = NULL;
 					entry->isFinished = true;
 					break;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d38544..be27f9fe07 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -246,7 +246,14 @@ ginFreeScanKeys(GinScanOpaque so)
 		if (entry->list)
 			pfree(entry->list);
 		if (entry->matchIterator)
+		{
 			tbm_end_iterate(entry->matchIterator);
+			if (entry->matchResult)
+			{
+				pfree(entry->matchResult);
+				entry->matchResult = NULL;
+			}
+		}
 		if (entry->matchBitmap)
 			tbm_free(entry->matchBitmap);
 	}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 76a53f68b6..b174b5c675 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -226,6 +226,53 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+							void *per_buffer_data)
+{
+	TBMIterateResult *tbmres;
+	HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+	tbmres = (TBMIterateResult *) per_buffer_data;
+
+	for (;;)
+	{
+		if (hdesc->rs_base.shared_tbmiterator)
+			tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+		else
+			tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+		if (!BlockNumberIsValid(tbmres->blockno))
+			return InvalidBlockNumber;
+
+		/*
+		 * Ignore any claimed entries past what we think is the end of the
+		 * relation. It may have been extended after the start of our scan (we
+		 * only hold an AccessShareLock, and it could be inserts from this
+		 * backend).  We don't take this optimization in SERIALIZABLE
+		 * isolation though, as we need to examine all invisible tuples
+		 * reachable by the index.
+		 */
+		if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+			continue;
+
+		if (tbmres->ntuples >= 0)
+			hdesc->rs_base.exact_pages++;
+		else
+			hdesc->rs_base.lossy_pages++;
+
+		if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+			!tbmres->recheck &&
+			VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer))
+		{
+			hdesc->empty_tuples += tbmres->ntuples;
+			continue;
+		}
+
+		return tbmres->blockno;
+	}
+}
+
 static BlockNumber
 heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
 					  void *per_buffer_data)
@@ -433,6 +480,11 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		pg_streaming_read_free(scan->pgsr);
 		scan->pgsr = NULL;
 	}
+	if (BufferIsValid(scan->vmbuffer))
+	{
+		ReleaseBuffer(scan->vmbuffer);
+		scan->vmbuffer = InvalidBuffer;
+	}
 
 	/*
 	 * FIXME: This probably should be done in the !rs_inited blocks instead.
@@ -446,6 +498,19 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		else
 			scan->pgsr = heap_pgsr_single_alloc(scan);
 	}
+	else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+	{
+		scan->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+													scan,
+													offsetof(TBMIterateResult, offsets) + MaxHeapTuplesPerPage * sizeof(OffsetNumber),
+													scan->rs_strategy,
+													BMR_REL(scan->rs_base.rs_rd),
+													MAIN_FORKNUM,
+													bitmapheap_pgsr_next_single);
+
+
+		scan->rs_inited = true;
+	}
 }
 
 /*
@@ -1118,6 +1183,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_strategy = NULL;	/* set in initscan */
 
 	scan->pgsr = NULL;
+	scan->vmbuffer = InvalidBuffer;
+	scan->empty_tuples = 0;
+	scan->rs_base.exact_pages = 0;
+	scan->rs_base.lossy_pages = 0;
+	scan->rs_base.tbmiterator = NULL;
+	scan->rs_base.shared_tbmiterator = NULL;
 
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6127f9d75a..d5e267f45f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "storage/streaming_read.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -2110,38 +2111,55 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
  * Executor related callbacks for the heap AM
  * ------------------------------------------------------------------------
  */
-
 static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
-							  TBMIterateResult *tbmres)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
-	BlockNumber block = tbmres->blockno;
+	TBMIterateResult *tbmres;
+	void	   *io_private;
 	Buffer		buffer;
 	Snapshot	snapshot;
 	int			ntup;
 
-	hscan->rs_cindex = 0;
-	hscan->rs_ntuples = 0;
+	Assert(hscan->pgsr);
 
 	/*
-	 * Ignore any claimed entries past what we think is the end of the
-	 * relation. It may have been extended after the start of our scan (we
-	 * only hold an AccessShareLock, and it could be inserts from this
-	 * backend).  We don't take this optimization in SERIALIZABLE isolation
-	 * though, as we need to examine all invisible tuples reachable by the
-	 * index.
+	 * Release the buffer containing the previous block and reset the per-page
+	 * counters. Reset hscan->rs_ntuples here just to be safe.
 	 */
-	if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
-		return false;
+	if (BufferIsValid(hscan->rs_cbuf))
+	{
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
+	}
+
+	hscan->rs_cindex = 0;
+	hscan->rs_ntuples = 0;
+
+	hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->pgsr, &io_private);
+
+	if (BufferIsInvalid(hscan->rs_cbuf))
+	{
+		if (BufferIsValid(hscan->vmbuffer))
+		{
+			ReleaseBuffer(hscan->vmbuffer);
+			hscan->vmbuffer = InvalidBuffer;
+		}
+
+		return hscan->empty_tuples > 0;
+	}
+
+	Assert(io_private);
+
+	tbmres = (TBMIterateResult *) io_private;
+
+	Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
+
+	*recheck = tbmres->recheck;
+
+	hscan->rs_cblock = tbmres->blockno;
+	hscan->rs_ntuples = tbmres->ntuples;
 
-	/*
-	 * Acquire pin on the target heap page, trading in any pin we held before.
-	 */
-	hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
-										  scan->rs_rd,
-										  block);
-	hscan->rs_cblock = block;
 	buffer = hscan->rs_cbuf;
 	snapshot = scan->rs_snapshot;
 
@@ -2177,7 +2195,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			ItemPointerData tid;
 			HeapTupleData heapTuple;
 
-			ItemPointerSet(&tid, block, offnum);
+			ItemPointerSet(&tid, tbmres->blockno, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
 									   &heapTuple, NULL, true))
 				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
@@ -2205,7 +2223,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			loctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
 			loctup.t_len = ItemIdGetLength(lp);
 			loctup.t_tableOid = scan->rs_rd->rd_id;
-			ItemPointerSet(&loctup.t_self, block, offnum);
+			ItemPointerSet(&loctup.t_self, tbmres->blockno, offnum);
 			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 			if (valid)
 			{
@@ -2223,25 +2241,31 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 	Assert(ntup <= MaxHeapTuplesPerPage);
 	hscan->rs_ntuples = ntup;
 
-	return ntup > 0;
+	return true;
 }
 
 static bool
-heapam_scan_bitmap_next_tuple(TableScanDesc scan,
-							  TBMIterateResult *tbmres,
-							  TupleTableSlot *slot)
+heapam_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 	OffsetNumber targoffset;
 	Page		page;
 	ItemId		lp;
 
+	if (hscan->empty_tuples)
+	{
+		ExecStoreAllNullTuple(slot);
+		hscan->empty_tuples--;
+		return true;
+	}
+
 	/*
 	 * Out of range?  If so, nothing more to look at on this page
 	 */
 	if (hscan->rs_cindex < 0 || hscan->rs_cindex >= hscan->rs_ntuples)
 		return false;
 
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	targoffset = hscan->rs_vistuples[hscan->rs_cindex];
 	page = BufferGetPage(hscan->rs_cbuf);
 	lp = PageGetItemId(page, targoffset);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 3d590a6b9f..7257308195 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/relscan.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "commands/createas.h"
@@ -112,7 +113,7 @@ static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_tidbitmap_info(BitmapHeapScanState *planstate,
+static void show_tidbitmap_info(TableScanDesc tdesc,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
 									   PlanState *planstate, ExplainState *es);
@@ -1827,7 +1828,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			if (es->analyze)
-				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
+				show_tidbitmap_info(((BitmapHeapScanState *) planstate)->ss.ss_currentScanDesc, es);
 			break;
 		case T_SampleScan:
 			show_tablesample(((SampleScan *) plan)->tablesample,
@@ -3416,25 +3417,25 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
  * If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
  */
 static void
-show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
+show_tidbitmap_info(TableScanDesc tdesc, ExplainState *es)
 {
 	if (es->format != EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainPropertyInteger("Exact Heap Blocks", NULL,
-							   planstate->exact_pages, es);
+							   tdesc->exact_pages, es);
 		ExplainPropertyInteger("Lossy Heap Blocks", NULL,
-							   planstate->lossy_pages, es);
+							   tdesc->lossy_pages, es);
 	}
 	else
 	{
-		if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+		if (tdesc->exact_pages > 0 || tdesc->lossy_pages > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "Heap Blocks:");
-			if (planstate->exact_pages > 0)
-				appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
-			if (planstate->lossy_pages > 0)
-				appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages);
+			if (tdesc->exact_pages > 0)
+				appendStringInfo(es->str, " exact=%ld", tdesc->exact_pages);
+			if (tdesc->lossy_pages > 0)
+				appendStringInfo(es->str, " lossy=%ld", tdesc->lossy_pages);
 			appendStringInfoChar(es->str, '\n');
 		}
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c1e81ebed6..d94ded30f2 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -54,11 +54,6 @@
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-												TBMIterateResult *tbmres);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
-								  TableScanDesc scan);
 static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
 
 
@@ -74,10 +69,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	ExprContext *econtext;
 	TableScanDesc scan;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator = NULL;
-	TBMSharedIterator *shared_tbmiterator = NULL;
-	TBMIterateResult *tbmres;
-	TupleTableSlot *slot;
+	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 	ParallelBitmapHeapState *pstate = node->pstate;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
 
@@ -88,23 +80,10 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	slot = node->ss.ss_ScanTupleSlot;
 	scan = node->ss.ss_currentScanDesc;
 	tbm = node->tbm;
-	if (pstate == NULL)
-		tbmiterator = node->tbmiterator;
-	else
-		shared_tbmiterator = node->shared_tbmiterator;
-	tbmres = node->tbmres;
 
 	/*
 	 * If we haven't yet performed the underlying index scan, do it, and begin
 	 * the iteration over the bitmap.
-	 *
-	 * For prefetching, we use *two* iterators, one for the pages we are
-	 * actually scanning and another that runs ahead of the first for
-	 * prefetching.  node->prefetch_pages tracks exactly how many pages ahead
-	 * the prefetch iterator is.  Also, node->prefetch_target tracks the
-	 * desired prefetch distance, which starts small and increases up to the
-	 * node->prefetch_maximum.  This is to avoid doing a lot of prefetching in
-	 * a scan that stops after a few tuples because of a LIMIT.
 	 */
 	if (!node->initialized)
 	{
@@ -116,17 +95,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				elog(ERROR, "unrecognized result from subplan");
 
 			node->tbm = tbm;
-			node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->prefetch_iterator = tbm_begin_iterate(tbm);
-				node->prefetch_pages = 0;
-				node->prefetch_target = -1;
-			}
-#endif							/* USE_PREFETCH */
+			scan->tbmiterator = tbm_begin_iterate(tbm);
 		}
 		else
 		{
@@ -148,194 +117,59 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				 * dsa_pointer of the iterator state which will be used by
 				 * multiple processes to iterate jointly.
 				 */
-				pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
-				if (node->prefetch_maximum > 0)
-				{
-					pstate->prefetch_iterator =
-						tbm_prepare_shared_iterate(tbm);
-
-					/*
-					 * We don't need the mutex here as we haven't yet woke up
-					 * others.
-					 */
-					pstate->prefetch_pages = 0;
-					pstate->prefetch_target = -1;
-				}
-#endif
+				pstate->tbmiterator =
+					tbm_prepare_shared_iterate(tbm);
 
 				/* We have initialized the shared state so wake up others. */
 				BitmapDoneInitializingSharedState(pstate);
 			}
 
 			/* Allocate a private iterator and attach the shared state to it */
-			node->shared_tbmiterator = shared_tbmiterator =
+			scan->shared_tbmiterator =
 				tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->shared_prefetch_iterator =
-					tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
-			}
-#endif							/* USE_PREFETCH */
 		}
 		node->initialized = true;
+
+		/* Get the first block. if none, end of scan */
+		if (!table_scan_bitmap_next_block(scan, &node->recheck))
+			goto exit;
 	}
 
-	for (;;)
+	do
 	{
-		bool		skip_fetch;
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * Get next page of results if needed
-		 */
-		if (tbmres == NULL)
-		{
-			if (!pstate)
-				node->tbmres = tbmres = tbm_iterate(tbmiterator);
-			else
-				node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
-			if (tbmres == NULL)
-			{
-				/* no more entries in the bitmap */
-				break;
-			}
-
-			BitmapAdjustPrefetchIterator(node, tbmres);
-
-			/*
-			 * We can skip fetching the heap page if we don't need any fields
-			 * from the heap, and the bitmap entries don't need rechecking,
-			 * and all tuples on the page are visible to our transaction.
-			 *
-			 * XXX: It's a layering violation that we do these checks above
-			 * tableam, they should probably moved below it at some point.
-			 */
-			skip_fetch = (node->can_skip_fetch &&
-						  !tbmres->recheck &&
-						  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-										 tbmres->blockno,
-										 &node->vmbuffer));
-
-			if (skip_fetch)
-			{
-				/* can't be lossy in the skip_fetch case */
-				Assert(tbmres->ntuples >= 0);
-
-				/*
-				 * The number of tuples on this page is put into
-				 * node->return_empty_tuples.
-				 */
-				node->return_empty_tuples = tbmres->ntuples;
-			}
-			else if (!table_scan_bitmap_next_block(scan, tbmres))
-			{
-				/* AM doesn't think this block is valid, skip */
-				continue;
-			}
-
-			if (tbmres->ntuples >= 0)
-				node->exact_pages++;
-			else
-				node->lossy_pages++;
-
-			/* Adjust the prefetch target */
-			BitmapAdjustPrefetchTarget(node);
-		}
-		else
+		while (table_scan_bitmap_next_tuple(scan, slot))
 		{
-			/*
-			 * Continuing in previously obtained page.
-			 */
-
-#ifdef USE_PREFETCH
-
-			/*
-			 * Try to prefetch at least a few pages even before we get to the
-			 * second page if we don't stop reading after the first tuple.
-			 */
-			if (!pstate)
-			{
-				if (node->prefetch_target < node->prefetch_maximum)
-					node->prefetch_target++;
-			}
-			else if (pstate->prefetch_target < node->prefetch_maximum)
-			{
-				/* take spinlock while updating shared state */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_target < node->prefetch_maximum)
-					pstate->prefetch_target++;
-				SpinLockRelease(&pstate->mutex);
-			}
-#endif							/* USE_PREFETCH */
-		}
+			CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We issue prefetch requests *after* fetching the current page to try
-		 * to avoid having prefetching interfere with the main I/O. Also, this
-		 * should happen only when we have determined there is still something
-		 * to do on the current page, else we may uselessly prefetch the same
-		 * page we are just about to request for real.
-		 *
-		 * XXX: It's a layering violation that we do these checks above
-		 * tableam, they should probably moved below it at some point.
-		 */
-		BitmapPrefetch(node, scan);
-
-		if (node->return_empty_tuples > 0)
-		{
-			/*
-			 * If we don't have to fetch the tuple, just return nulls.
-			 */
-			ExecStoreAllNullTuple(slot);
+			if (!node->recheck)
+				return slot;
 
-			if (--node->return_empty_tuples == 0)
-			{
-				/* no more tuples to return in the next round */
-				node->tbmres = tbmres = NULL;
-			}
-		}
-		else
-		{
-			/*
-			 * Attempt to fetch tuple from AM.
-			 */
-			if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
-			{
-				/* nothing more to look at on this page */
-				node->tbmres = tbmres = NULL;
-				continue;
-			}
+			econtext->ecxt_scantuple = slot;
+			if (ExecQualAndReset(node->bitmapqualorig, econtext))
+				return slot;
 
-			/*
-			 * If we are using lossy info, we have to recheck the qual
-			 * conditions at every tuple.
-			 */
-			if (tbmres->recheck)
-			{
-				econtext->ecxt_scantuple = slot;
-				if (!ExecQualAndReset(node->bitmapqualorig, econtext))
-				{
-					/* Fails recheck, so drop it and loop back for another */
-					InstrCountFiltered2(node, 1);
-					ExecClearTuple(slot);
-					continue;
-				}
-			}
+			/* Fails recheck, so drop it and loop back for another */
+			InstrCountFiltered2(node, 1);
+			ExecClearTuple(slot);
 		}
+	} while (table_scan_bitmap_next_block(scan, &node->recheck));
 
-		/* OK to return this tuple */
-		return slot;
-	}
+exit:
 
 	/*
-	 * if we get here it means we are at the end of the scan..
+	 * Release iterator
 	 */
-	return ExecClearTuple(slot);
+	if (scan->tbmiterator)
+	{
+		tbm_end_iterate(scan->tbmiterator);
+		scan->tbmiterator = NULL;
+	}
+	else if (scan->shared_tbmiterator)
+	{
+		tbm_end_shared_iterate(scan->shared_tbmiterator);
+		scan->shared_tbmiterator = NULL;
+	}
+	return NULL;
 }
 
 /*
@@ -353,215 +187,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
 	ConditionVariableBroadcast(&pstate->cv);
 }
 
-/*
- *	BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-							 TBMIterateResult *tbmres)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (node->prefetch_pages > 0)
-		{
-			/* The main iterator has closed the distance by one page */
-			node->prefetch_pages--;
-		}
-		else if (prefetch_iterator)
-		{
-			/* Do not let the prefetch iterator get behind the main one */
-			TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-
-			if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
-				elog(ERROR, "prefetch and main iterators are out of sync");
-		}
-		return;
-	}
-
-	if (node->prefetch_maximum > 0)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_pages > 0)
-		{
-			pstate->prefetch_pages--;
-			SpinLockRelease(&pstate->mutex);
-		}
-		else
-		{
-			/* Release the mutex before iterating */
-			SpinLockRelease(&pstate->mutex);
-
-			/*
-			 * In case of shared mode, we can not ensure that the current
-			 * blockno of the main iterator and that of the prefetch iterator
-			 * are same.  It's possible that whatever blockno we are
-			 * prefetching will be processed by another process.  Therefore,
-			 * we don't validate the blockno here as we do in non-parallel
-			 * case.
-			 */
-			if (prefetch_iterator)
-				tbm_shared_iterate(prefetch_iterator);
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max.  Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		if (node->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (node->prefetch_target >= node->prefetch_maximum / 2)
-			node->prefetch_target = node->prefetch_maximum;
-		else if (node->prefetch_target > 0)
-			node->prefetch_target *= 2;
-		else
-			node->prefetch_target++;
-		return;
-	}
-
-	/* Do an unlocked check first to save spinlock acquisitions. */
-	if (pstate->prefetch_target < node->prefetch_maximum)
-	{
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
-			pstate->prefetch_target = node->prefetch_maximum;
-		else if (pstate->prefetch_target > 0)
-			pstate->prefetch_target *= 2;
-		else
-			pstate->prefetch_target++;
-		SpinLockRelease(&pstate->mutex);
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (node->prefetch_pages < node->prefetch_target)
-			{
-				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-				bool		skip_fetch;
-
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = NULL;
-					break;
-				}
-				node->prefetch_pages++;
-
-				/*
-				 * If we expect not to have to actually read this heap page,
-				 * skip this prefetch call, but continue to run the prefetch
-				 * logic normally.  (Would it be better not to increment
-				 * prefetch_pages?)
-				 *
-				 * This depends on the assumption that the index AM will
-				 * report the same recheck flag for this future heap page as
-				 * it did for the current heap page; which is not a certainty
-				 * but is true in many cases.
-				 */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-
-		return;
-	}
-
-	if (pstate->prefetch_pages < pstate->prefetch_target)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (1)
-			{
-				TBMIterateResult *tbmpre;
-				bool		do_prefetch = false;
-				bool		skip_fetch;
-
-				/*
-				 * Recheck under the mutex. If some other process has already
-				 * done enough prefetching then we need not to do anything.
-				 */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_pages < pstate->prefetch_target)
-				{
-					pstate->prefetch_pages++;
-					do_prefetch = true;
-				}
-				SpinLockRelease(&pstate->mutex);
-
-				if (!do_prefetch)
-					return;
-
-				tbmpre = tbm_shared_iterate(prefetch_iterator);
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_shared_iterate(prefetch_iterator);
-					node->shared_prefetch_iterator = NULL;
-					break;
-				}
-
-				/* As above, skip prefetch if we expect not to need page */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
 /*
  * BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
  */
@@ -607,29 +232,10 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 	table_rescan(node->ss.ss_currentScanDesc, NULL);
 
 	/* release bitmaps and buffers if any */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 	node->tbm = NULL;
-	node->tbmiterator = NULL;
-	node->tbmres = NULL;
-	node->prefetch_iterator = NULL;
 	node->initialized = false;
-	node->shared_tbmiterator = NULL;
-	node->shared_prefetch_iterator = NULL;
-	node->vmbuffer = InvalidBuffer;
-	node->pvmbuffer = InvalidBuffer;
 
 	ExecScanReScan(&node->ss);
 
@@ -663,20 +269,8 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
 	/*
 	 * release bitmaps and buffers if any
 	 */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 
 	/*
 	 * close heap scan
@@ -695,6 +289,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		can_skip_fetch;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -714,32 +309,10 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
 
 	scanstate->tbm = NULL;
-	scanstate->tbmiterator = NULL;
-	scanstate->tbmres = NULL;
-	scanstate->return_empty_tuples = 0;
-	scanstate->vmbuffer = InvalidBuffer;
-	scanstate->pvmbuffer = InvalidBuffer;
-	scanstate->exact_pages = 0;
-	scanstate->lossy_pages = 0;
-	scanstate->prefetch_iterator = NULL;
-	scanstate->prefetch_pages = 0;
-	scanstate->prefetch_target = 0;
 	scanstate->pscan_len = 0;
 	scanstate->initialized = false;
-	scanstate->shared_tbmiterator = NULL;
-	scanstate->shared_prefetch_iterator = NULL;
 	scanstate->pstate = NULL;
 
-	/*
-	 * We can potentially skip fetching heap pages if we do not need any
-	 * columns of the table, either for checking non-indexable quals or for
-	 * returning data.  This test is a bit simplistic, as it checks the
-	 * stronger condition that there's no qual or return tlist at all.  But in
-	 * most cases it's probably not worth working harder than that.
-	 */
-	scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
-								 node->scan.plan.targetlist == NIL);
-
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -778,19 +351,22 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->bitmapqualorig =
 		ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
 
+	scanstate->ss.ss_currentRelation = currentRelation;
+
 	/*
-	 * Maximum number of prefetches for the tablespace if configured,
-	 * otherwise the current value of the effective_io_concurrency GUC.
+	 * We can potentially skip fetching heap pages if we do not need any
+	 * columns of the table, either for checking non-indexable quals or for
+	 * returning data.  This test is a bit simplistic, as it checks the
+	 * stronger condition that there's no qual or return tlist at all.  But in
+	 * most cases it's probably not worth working harder than that.
 	 */
-	scanstate->prefetch_maximum =
-		get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
-	scanstate->ss.ss_currentRelation = currentRelation;
+	can_skip_fetch = (node->scan.plan.qual == NIL &&
+					  node->scan.plan.targetlist == NIL);
 
 	scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
 														  estate->es_snapshot,
 														  0,
-														  NULL);
+														  NULL, can_skip_fetch);
 
 	/*
 	 * all done.
@@ -875,12 +451,9 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 	pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
 
 	pstate->tbmiterator = 0;
-	pstate->prefetch_iterator = 0;
 
 	/* Initialize the mutex */
 	SpinLockInit(&pstate->mutex);
-	pstate->prefetch_pages = 0;
-	pstate->prefetch_target = 0;
 	pstate->state = BM_INITIAL;
 
 	ConditionVariableInit(&pstate->cv);
@@ -912,11 +485,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 	if (DsaPointerIsValid(pstate->tbmiterator))
 		tbm_free_shared_area(dsa, pstate->tbmiterator);
 
-	if (DsaPointerIsValid(pstate->prefetch_iterator))
-		tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
 	pstate->tbmiterator = InvalidDsaPointer;
-	pstate->prefetch_iterator = InvalidDsaPointer;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index 0f4850065f..c0f337c265 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -180,7 +180,6 @@ struct TBMIterator
 	int			spageptr;		/* next spages index */
 	int			schunkptr;		/* next schunks index */
 	int			schunkbit;		/* next bit to check in current schunk */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /*
@@ -221,7 +220,6 @@ struct TBMSharedIterator
 	PTEntryArray *ptbase;		/* pagetable element array */
 	PTIterationArray *ptpages;	/* sorted exact page index list */
 	PTIterationArray *ptchunks; /* sorted lossy page index list */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /* Local function prototypes */
@@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * Create the TBMIterator struct, with enough trailing space to serve the
 	 * needs of the TBMIterateResult sub-struct.
 	 */
-	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-									  MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+	iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
 	iterator->tbm = tbm;
 
 	/*
@@ -957,20 +954,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
 /*
  * tbm_iterate - scan through next page of a TIDBitmap
  *
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan.  Pages are guaranteed to be delivered in numerical
- * order.  If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition.  If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway.  (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order.  tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition.  If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway.  (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
  */
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
 {
 	TIDBitmap  *tbm = iterator->tbm;
-	TBMIterateResult *output = &(iterator->output);
 
 	Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
 
@@ -998,6 +996,7 @@ tbm_iterate(TBMIterator *iterator)
 	 * If both chunk and per-page data remain, must output the numerically
 	 * earlier page.
 	 */
+	Assert(tbmres);
 	if (iterator->schunkptr < tbm->nchunks)
 	{
 		PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -1008,11 +1007,11 @@ tbm_iterate(TBMIterator *iterator)
 			chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			iterator->schunkbit++;
-			return output;
+			return;
 		}
 	}
 
@@ -1028,16 +1027,17 @@ tbm_iterate(TBMIterator *iterator)
 			page = tbm->spages[iterator->spageptr];
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		iterator->spageptr++;
-		return output;
+		return;
 	}
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
@@ -1047,10 +1047,9 @@ tbm_iterate(TBMIterator *iterator)
  *	across multiple processes.  We need to acquire the iterator LWLock,
  *	before accessing the shared members.
  */
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
 {
-	TBMIterateResult *output = &iterator->output;
 	TBMSharedIteratorState *istate = iterator->state;
 	PagetableEntry *ptbase = NULL;
 	int		   *idxpages = NULL;
@@ -1101,13 +1100,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 			chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			istate->schunkbit++;
 
 			LWLockRelease(&istate->lock);
-			return output;
+			return;
 		}
 	}
 
@@ -1117,21 +1116,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 		int			ntuples;
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		istate->spageptr++;
 
 		LWLockRelease(&istate->lock);
 
-		return output;
+		return;
 	}
 
 	LWLockRelease(&istate->lock);
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bc53f9f4d2..fdbbeb17c7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -20,6 +20,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "nodes/execnodes.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -76,9 +77,15 @@ typedef struct HeapScanDescData
 	struct PgStreamingRead *pgsr;
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
+
+	/*
+	 * MFIXME: not sure if vmbuffer is being released in the right place.
+	 */
+	Buffer		vmbuffer;		/* current VM buffer for BHS */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
+	int			empty_tuples;
 }			HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..cdb44316a3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/spin.h"
@@ -40,6 +41,13 @@ typedef struct TableScanDescData
 	ItemPointerData rs_mintid;
 	ItemPointerData rs_maxtid;
 
+	/* Only used for Bitmap table scans */
+	TBMIterator *tbmiterator;
+	TBMSharedIterator *shared_tbmiterator;
+	long		exact_pages;
+	long		lossy_pages;
+	int			empty_tuples;
+
 	/*
 	 * Information about type and behaviour of the scan, a bitmask of members
 	 * of the ScanOptions enum (see tableam.h).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d..b5626805a6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -32,6 +32,7 @@ extern PGDLLIMPORT char *default_table_access_method;
 extern PGDLLIMPORT bool synchronize_seqscans;
 
 
+struct BitmapHeapScanState;
 struct BulkInsertStateData;
 struct IndexInfo;
 struct SampleScanState;
@@ -62,7 +63,8 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
-}			ScanOptions;
+	SO_CAN_SKIP_FETCH = 1 << 10
+} ScanOptions;
 
 /*
  * Result codes for table_{update,delete,lock_tuple}, and for visibility
@@ -773,53 +775,36 @@ typedef struct TableAmRoutine
 	 */
 
 	/*
-	 * Prepare to fetch / check / return tuples from `tbmres->blockno` as part
-	 * of a bitmap table scan. `scan` was started via table_beginscan_bm().
-	 * Return false if there are no tuples to be found on the page, true
-	 * otherwise.
+	 * Prepare to fetch / check / return tuples obtained from the streaming
+	 * read helper as part of a bitmap table scan. `scan` was started via
+	 * table_beginscan_bm(). Return false if the relation is exhausted and
+	 * true otherwise.
 	 *
 	 * This will typically read and pin the target block, and do the necessary
 	 * work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
 	 * make sense to perform tuple visibility checks at this time). For some
-	 * AMs it will make more sense to do all the work referencing `tbmres`
-	 * contents here, for others it might be better to defer more work to
-	 * scan_bitmap_next_tuple.
+	 * AMs, it might be better to defer more work to scan_bitmap_next_tuple().
 	 *
-	 * If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
-	 * on the page have to be returned, otherwise the tuples at offsets in
-	 * `tbmres->offsets` need to be returned.
+	 * After examining the page from the TBMIterateResult returned by the
+	 * streaming read helper, most AMs will set relevant fields in the `scan`
+	 * for the benefit of scan_bitmap_next_tuple(). All AMs must set
+	 * `recheck`, passed in by the caller, based on the value in the
+	 * TBMIterateResult so that BitmapHeapNext() can determine whether or not
+	 * to yield tuples.
 	 *
-	 * XXX: Currently this may only be implemented if the AM uses md.c as its
-	 * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
-	 * blockids directly to the underlying storage. nodeBitmapHeapscan.c
-	 * performs prefetching directly using that interface.  This probably
-	 * needs to be rectified at a later point.
-	 *
-	 * XXX: Currently this may only be implemented if the AM uses the
-	 * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
-	 * perform prefetching.  This probably needs to be rectified at a later
-	 * point.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_block) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres);
+	bool		(*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck);
 
 	/*
 	 * Fetch the next tuple of a bitmap table scan into `slot` and return true
 	 * if a visible tuple was found, false otherwise.
 	 *
-	 * For some AMs it will make more sense to do all the work referencing
-	 * `tbmres` contents in scan_bitmap_next_block, for others it might be
-	 * better to defer more work to this callback.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres,
-										   TupleTableSlot *slot);
+	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan, TupleTableSlot *slot);
 
 	/*
 	 * Prepare to fetch tuples from the next block in a sample scan. Return
@@ -944,10 +929,13 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
  */
 static inline TableScanDesc
 table_beginscan_bm(Relation rel, Snapshot snapshot,
-				   int nkeys, struct ScanKeyData *key)
+				   int nkeys, struct ScanKeyData *key, bool can_skip_fetch)
 {
 	uint32		flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
 
+	if (can_skip_fetch)
+		flags |= SO_CAN_SKIP_FETCH;
+
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
 
@@ -1955,8 +1943,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
  * used after verifying the presence (at plan time or such).
  */
 static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1966,8 +1953,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
 
-	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
-														   tbmres);
+	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck);
 }
 
 /*
@@ -1979,9 +1965,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
  * returned false.
  */
 static inline bool
-table_scan_bitmap_next_tuple(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres,
-							 TupleTableSlot *slot)
+table_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
@@ -1992,7 +1976,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
 
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
-														   tbmres,
 														   slot);
 }
 
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index ea003a9caa..39ae7b29ef 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -28,5 +28,6 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 										  ParallelContext *pcxt);
 extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 										   ParallelWorkerContext *pwcxt);
+extern void table_bitmap_scan_setup(BitmapHeapScanState *scanstate);
 
 #endif							/* NODEBITMAPHEAPSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 561fdd98f1..535dfef6f0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1675,11 +1675,7 @@ typedef enum
 /* ----------------
  *	 ParallelBitmapHeapState information
  *		tbmiterator				iterator for scanning current pages
- *		prefetch_iterator		iterator for prefetching ahead of current page
- *		mutex					mutual exclusion for the prefetching variable
- *								and state
- *		prefetch_pages			# pages prefetch iterator is ahead of current
- *		prefetch_target			current target prefetch distance
+ *		mutex					mutual exclusion for the state
  *		state					current state of the TIDBitmap
  *		cv						conditional wait variable
  *		phs_snapshot_data		snapshot data shared to workers
@@ -1688,10 +1684,7 @@ typedef enum
 typedef struct ParallelBitmapHeapState
 {
 	dsa_pointer tbmiterator;
-	dsa_pointer prefetch_iterator;
 	slock_t		mutex;
-	int			prefetch_pages;
-	int			prefetch_target;
 	SharedBitmapState state;
 	ConditionVariable cv;
 	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1702,22 +1695,9 @@ typedef struct ParallelBitmapHeapState
  *
  *		bitmapqualorig	   execution state for bitmapqualorig expressions
  *		tbm				   bitmap obtained from child index scan(s)
- *		tbmiterator		   iterator for scanning current pages
- *		tbmres			   current-page data
- *		can_skip_fetch	   can we potentially skip tuple fetches in this scan?
- *		return_empty_tuples number of empty tuples to return
- *		vmbuffer		   buffer for visibility-map lookups
- *		pvmbuffer		   ditto, for prefetched pages
- *		exact_pages		   total number of exact pages retrieved
- *		lossy_pages		   total number of lossy pages retrieved
- *		prefetch_iterator  iterator for prefetching ahead of current page
- *		prefetch_pages	   # pages prefetch iterator is ahead of current
- *		prefetch_target    current target prefetch distance
- *		prefetch_maximum   maximum value for prefetch_target
+ *		recheck			   whether or not the tuple needs to be rechecked before returning
  *		pscan_len		   size of the shared memory for parallel bitmap
  *		initialized		   is node is ready to iterate
- *		shared_tbmiterator	   shared iterator
- *		shared_prefetch_iterator shared iterator for prefetching
  *		pstate			   shared state for parallel bitmap scan
  * ----------------
  */
@@ -1726,22 +1706,9 @@ typedef struct BitmapHeapScanState
 	ScanState	ss;				/* its first field is NodeTag */
 	ExprState  *bitmapqualorig;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator;
-	TBMIterateResult *tbmres;
-	bool		can_skip_fetch;
-	int			return_empty_tuples;
-	Buffer		vmbuffer;
-	Buffer		pvmbuffer;
-	long		exact_pages;
-	long		lossy_pages;
-	TBMIterator *prefetch_iterator;
-	int			prefetch_pages;
-	int			prefetch_target;
-	int			prefetch_maximum;
+	bool		recheck;
 	Size		pscan_len;
 	bool		initialized;
-	TBMSharedIterator *shared_tbmiterator;
-	TBMSharedIterator *shared_prefetch_iterator;
 	ParallelBitmapHeapState *pstate;
 } BitmapHeapScanState;
 
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 1945f0639b..97e40097fd 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -64,8 +64,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
 
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
 extern void tbm_end_iterate(TBMIterator *iterator);
 extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
-- 
2.39.2

#21

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Thomas Munro (#20)

Re: Streaming I/O, vectored I/O (WIP)

On 10/01/2024 06:13, Thomas Munro wrote:

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.

How will these functions work in the brave new async I/O world? I assume
PrepareReadBuffer() will initiate the async I/O, and
CompleteReadBuffers() will wait for it to complete. How about
StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
WaitBufferRead()?

About the signature of those functions: Does it make sense for
CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector
of buffers? If StartReadBuffer() initiates the async I/O immediately, is
there any benefit to batching the waiting?

If StartReadBuffer() starts the async I/O, the idea that you can call
ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
you in RBM_ZERO_* modes.

Putting all that together, I propose:

/*
* Initiate reading a block from disk to the buffer cache.
*
* XXX: Until we have async I/O, this just allocates the buffer in the
buffer
* cache. The actual I/O happens in WaitReadBuffer().
*/
Buffer
StartReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
ReadBufferMode mode,
bool *foundPtr);

/*
* Wait for a read that was started earlier with StartReadBuffer() to
finish.
*
* XXX: Until we have async I/O, this is the function that actually
performs
* the I/O. If multiple I/Os have been started with StartReadBuffer(), this
* will try to perform all of them in one syscall. Subsequent calls to
* WaitReadBuffer(), for those other buffers, will finish quickly.
*/
void
WaitReadBuffer(Buffer buf);

I'm not sure how well this fits with the streaming read API. The
streaming read code performs grouping of adjacent blocks to one
CompleteReadBuffers() call. If WaitReadBuffer() does the batching,
that's not really required. But does that make sense with async I/O?
With async I/O, will you need a vectorized version of StartReadBuffer() too?

--
Heikki Linnakangas
Neon (https://neon.tech)

#22

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#21)

1 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/01/2024 06:13, Thomas Munro wrote:

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.

How will these functions work in the brave new async I/O world? I assume
PrepareReadBuffer() will initiate the async I/O, and
CompleteReadBuffers() will wait for it to complete. How about
StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
WaitBufferRead()?

What we have imagined so far is that the asynchronous version would
probably have three steps, like this:

* PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed
* StartReadBuffers() -> starts the I/O for n contiguous !found buffers
* CompleteReadBuffers() -> waits, completes as necessary

In the proposed synchronous version, the middle step is missing, but
streaming_read.c directly calls smgrprefetch() instead. I thought
about shoving that inside a prosthetic StartReadBuffers() function,
but I backed out of simulating asynchronous I/O too fancifully. The
streaming read API is where we really want to stabilise a nice API, so
we can moves things around behind it if required.

A bit of analysis of the one block -> nblocks change and the
synchronous -> asynchronous change:

Two things are new in a world with nblocks > 1. (1) We needed to be
able to set BM_IO_IN_PROGRESS on more than one block at a time, but
commit 12f3867f already provided that, and (2) someone else might come
along and read a block in the middle of our range, effectively
chopping our range into subranges. That's true also in master but
when nblocks === 1 that's all-or-nothing, and now we have partial
cases. In the proposed synchronous code, CompleteReadBuffers() claims
as many contiguous BM_IO_IN_PROGRESS flags as it in the range, and
then loops process the rest, skipping over any blocks that are already
done. Further down in md.c, you might also cross a segment boundary.
So that's two different reasons while a single call to
CompleteReadBuffers() might finish up generating zero or more than one
I/O system call, though very often it's one.

Hmm, while spelling that out, I noticed an obvious problem and
improvement to make to that part of v5. If backend #1 is trying to
read blocks 101..103 and acquires BM_IO_IN_PROGRESS for 101, but
backend #2 comes along and starts reading block 102 first, backend
#1's StartBufferIO() call would wait for 102's I/O CV while it still
holds BM_IO_IN_PROGRESS for block 101, potentially blocking a third
backend #3 that wants to read block 101 even though no I/O is in
progress for that block yet! At least that's deadlock free (because
always in block order), but it seems like undesirable lock chaining.
Here is my proposed improvement: StartBufferIO() gains a nowait flag.
For the head block we wait, but while trying to build a larger range
we don't. We'll try 102 again in the next loop, with a wait. Here is
a small fixup for that.

In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would
take place in StartReadBuffers() instead, which would be responsible
for kicking off asynchronous I/Os (= asking a background worker to
call pread(), or equivalent fancy kernel async APIs). One question is
what it does if it finds a block in the middle that chops the read up,
or for that matter a segment boundary. I don't think we have to
decide now, but the two options seem to be that it starts one single
I/O and reports its size, making it the client's problem to call again
with the rest, or that it starts more than one I/O and they are
somehow chained together so that the caller doesn't know about that
and can later wait for all of them to finish using just one <handle
thing>.

(The reason why I'm not 100% certain how it will look is that the
real, working code in the aio branch right now doesn't actually expose
a vector/nblocks bufmgr interface at all, yet. Andres's original
prototype had a single-block Start*(), Complete*() design, but a lower
level of the AIO system notices if pending read operations are
adjacent and could be merged. While discussing all this we decided it
was a bit strange to have lower code deal with allocating, chaining
and processing lots of separate I/O objects in shared memory, when
higher level code could often work in bigger ranges up front, and then
interact with the AIO subsystem with many fewer objects and steps.
Also, the present simple and lightweight synchronous proposal that
lacks the whole subsystem that could do that by magic.)

About the signature of those functions: Does it make sense for
CompleteReadBuffers() (or WaitReadBuffers()) function to take a vector
of buffers? If StartReadBuffer() initiates the async I/O immediately, is
there any benefit to batching the waiting?

If StartReadBuffer() starts the async I/O, the idea that you can call
ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
you in RBM_ZERO_* modes.

Yeah, good thoughts, and topics that have occupied me for some time
now. I also thought that StartReadBuffer() should take
ReadBufferMode, but I came to the idea that it probably shouldn't like
this:

I started working on all this by trying to implement the most
complicated case I could imagine, streaming recovery, and then working
back to the easy cases that just do scans with RBM_NORMAL. In
recovery, we can predict that a block will be zeroed using WAL flags,
and pre-existing cross-checks at redo time that enforce that the flags
and redo code definitely agree on that, but we can't predict which
exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or
RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and
get_cleanup_lock=true, as the comment warns them not to, but I
digress).

That's OK, because we can't take locks while looking ahead in recovery
anyway (the redo routine carefully controls lock order/protocol), so
the code to actually do the locking needs to be somewhere near the
output end of the stream when the redo code calls
XLogReadBufferForRedoExtended(). But if you try to use RBM_XXX in
these interfaces, it begins to look pretty funny: the streaming
callback needs to be able to say which ReadBufferMode, but anywhere
near Prepare*(), Start*() or even Complete*() is too soon, so maybe we
need to invent a new value RBM_WILL_ZERO that doesn't yet say which of
the zero modes to use, and then the redo routine needs to pass in the
RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to
XLogReadBufferForRedoExtended() and it does it in a separate step
anyway, so we are ignoring ReadBufferMode. But that feels just wrong
-- we'd be using RBM but implementing them only partially.

Another way to put it is that ReadBufferMode actually conflates a
bunch of different behaviour that applies at different times that we
are now separating, and recovery reveals this most clearly because it
doesn't have all the information needed while looking ahead. It might
be possible to shove more information in the WAL to fix the
information problem, but it seemed more natural to me to separate the
aspects of ReadBufferMode, because that isn't the only problem:
groupwise-processing of lock doesn't even make sense.

So I teased a couple of those aspects out into separate flags, for example:

The streaming read interface has two variants: the "simple" implicit
RBM_NORMAL, single relation, single fork version that is used in most
client examples and probably everything involving the executor, and
the "extended" version (like the earlier versions in this thread,
removed for now based on complaints that most early uses don't use it,
will bring it back separately later with streaming recovery patches).
In the extended version, the streaming callback can set *will_zero =
true, which is about all the info that recovery can figure out from
the WAL anyway, and then XLogReadBufferForRedoExtended() will later
call ZeroBuffer() because at that time we have the ReadBufferMode.

The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the
right time and makes sense to process as a batch, so it becomes a
flag.

Putting all that together, I propose:

/*
* Initiate reading a block from disk to the buffer cache.
*
* XXX: Until we have async I/O, this just allocates the buffer in the
buffer
* cache. The actual I/O happens in WaitReadBuffer().
*/
Buffer
StartReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
ReadBufferMode mode,
bool *foundPtr);

/*
* Wait for a read that was started earlier with StartReadBuffer() to
finish.
*
* XXX: Until we have async I/O, this is the function that actually
performs
* the I/O. If multiple I/Os have been started with StartReadBuffer(), this
* will try to perform all of them in one syscall. Subsequent calls to
* WaitReadBuffer(), for those other buffers, will finish quickly.
*/
void
WaitReadBuffer(Buffer buf);

I'm confused about where the extra state lives that would allow the
communication required to build a larger I/O. In the AIO branch, it
does look a little more like that, but there is more magic state and
machinery hiding behind the curtain: the backend's pending I/O list
builds up a chain of I/Os, and when you try to wait, if it hasn't
already been submitted to the kernel/bgworkers yet it will be, and
before that merging will happen. So you get bigger I/Os without
having to say so explicitly.

For this synchronous version (and hopefully soon a more efficient
improved version in the AIO branch), we want to take advantage of the
client's pre-existing and more cheaply obtained knowledge of ranges.

I'm not sure how well this fits with the streaming read API. The
streaming read code performs grouping of adjacent blocks to one
CompleteReadBuffers() call. If WaitReadBuffer() does the batching,
that's not really required. But does that make sense with async I/O?

With async I/O, will you need a vectorized version of StartReadBuffer() too?

I think so, yes.

Attachments:

0001-fixup-Provide-vectored-variant-of-ReadBuffer.txttext/plain; charset=US-ASCII; name=0001-fixup-Provide-vectored-variant-of-ReadBuffer.txtDownload

From d7b09cab9faf93ce288dd8006c2202cc08897b67 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 11 Jan 2024 16:00:57 +1300
Subject: [PATCH] fixup! Provide vectored variant of ReadBuffer().

Avoid wait-chaining for BM_IO_IN_PROGRESS.
---
 src/backend/storage/buffer/bufmgr.c | 38 ++++++++++++++++++++---------
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d462ef7460..af23d3dda4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -1160,7 +1160,7 @@ PrepareReadBuffer(BufferManagerRelation bmr,
 }
 
 static inline bool
-CompleteReadBuffersCanStartIO(Buffer buffer)
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
 {
 	if (BufferIsLocal(buffer))
 	{
@@ -1169,7 +1169,7 @@ CompleteReadBuffersCanStartIO(Buffer buffer)
 		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
 	else
-		return StartBufferIO(GetBufferDescriptor(buffer - 1), true);
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
 }
 
 /*
@@ -1252,8 +1252,13 @@ CompleteReadBuffers(BufferManagerRelation bmr,
 		}
 #endif
 
-		/* Skip this block if someone else has already completed it. */
-		if (!CompleteReadBuffersCanStartIO(buffers[i]))
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
 		{
 			/*
 			 * Report this as a 'hit' for this backend, even though it must
@@ -1276,10 +1281,13 @@ CompleteReadBuffers(BufferManagerRelation bmr,
 
 		/*
 		 * How many neighboring-on-disk blocks can we can scatter-read into
-		 * other buffers at the same time?
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
 		 */
 		while ((i + 1) < nblocks &&
-			   CompleteReadBuffersCanStartIO(buffers[i + 1]))
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
 		{
 			/* Must be consecutive block numbers. */
 			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
@@ -2160,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2183,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -3580,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5359,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5374,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
-- 
2.39.2

#23

Heikki Linnakangas

hlinnaka@iki.fi

about 2 years ago

In reply to: Thomas Munro (#22)

Re: Streaming I/O, vectored I/O (WIP)

On 11/01/2024 05:19, Thomas Munro wrote:

On Thu, Jan 11, 2024 at 8:58 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/01/2024 06:13, Thomas Munro wrote:

Bikeshedding call: I am open to better suggestions for the names
PrepareReadBuffer() and CompleteReadBuffers(), they seem a little
grammatically clumsy.

How will these functions work in the brave new async I/O world? I assume
PrepareReadBuffer() will initiate the async I/O, and
CompleteReadBuffers() will wait for it to complete. How about
StartReadBuffer() and WaitReadBuffer()? Or StartBufferRead() and
WaitBufferRead()?

What we have imagined so far is that the asynchronous version would
probably have three steps, like this:

* PrepareReadBuffer() -> pins one buffer, reports if found or IO/zeroing needed
* StartReadBuffers() -> starts the I/O for n contiguous !found buffers
* CompleteReadBuffers() -> waits, completes as necessary

Ok. It feels surprising to have three steps. I understand that you need
two steps, one to start the I/O and another to wait for them to finish,
but why do you need separate Prepare and Start steps? What can you do in
between them? (You explained that. I'm just saying that that's my
initial reaction when seeing that API. It is surprising.)

If StartReadBuffer() starts the async I/O, the idea that you can call
ZeroBuffer() instead of WaitReadBuffer() doesn't work. I think
StartReadBuffer() needs to take ReadBufferMode, and do the zeroing for
you in RBM_ZERO_* modes.

Yeah, good thoughts, and topics that have occupied me for some time
now. I also thought that StartReadBuffer() should take
ReadBufferMode, but I came to the idea that it probably shouldn't like
this:

I started working on all this by trying to implement the most
complicated case I could imagine, streaming recovery, and then working
back to the easy cases that just do scans with RBM_NORMAL. In
recovery, we can predict that a block will be zeroed using WAL flags,
and pre-existing cross-checks at redo time that enforce that the flags
and redo code definitely agree on that, but we can't predict which
exact ReadBufferMode the redo code will use, RBM_ZERO_AND_LOCK or
RBM_ZERO_AND_CLEANUP_LOCK (or mode=RBM_NORMAL and
get_cleanup_lock=true, as the comment warns them not to, but I
digress).

That's OK, because we can't take locks while looking ahead in recovery
anyway (the redo routine carefully controls lock order/protocol), so
the code to actually do the locking needs to be somewhere near the
output end of the stream when the redo code calls
XLogReadBufferForRedoExtended(). But if you try to use RBM_XXX in
these interfaces, it begins to look pretty funny: the streaming
callback needs to be able to say which ReadBufferMode, but anywhere
near Prepare*(), Start*() or even Complete*() is too soon, so maybe we
need to invent a new value RBM_WILL_ZERO that doesn't yet say which of
the zero modes to use, and then the redo routine needs to pass in the
RBM_ZERO_AND_{LOCK,CLEANUP_LOCK} value to
XLogReadBufferForRedoExtended() and it does it in a separate step
anyway, so we are ignoring ReadBufferMode. But that feels just wrong
-- we'd be using RBM but implementing them only partially.

I see. When you're about to zero the page, there's not much point in
splitting the operation into Prepare/Start/Complete stages anyway.
You're not actually doing any I/O. Perhaps it's best to have a separate
"Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)"
function that does the same as
ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.

The _ZERO_ON_ERROR aspect is a case where CompleteReadBuffers() is the
right time and makes sense to process as a batch, so it becomes a
flag.

Putting all that together, I propose:

/*
* Initiate reading a block from disk to the buffer cache.
*
* XXX: Until we have async I/O, this just allocates the buffer in the
buffer
* cache. The actual I/O happens in WaitReadBuffer().
*/
Buffer
StartReadBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
ReadBufferMode mode,
bool *foundPtr);

/*
* Wait for a read that was started earlier with StartReadBuffer() to
finish.
*
* XXX: Until we have async I/O, this is the function that actually
performs
* the I/O. If multiple I/Os have been started with StartReadBuffer(), this
* will try to perform all of them in one syscall. Subsequent calls to
* WaitReadBuffer(), for those other buffers, will finish quickly.
*/
void
WaitReadBuffer(Buffer buf);

I'm confused about where the extra state lives that would allow the
communication required to build a larger I/O. In the AIO branch, it
does look a little more like that, but there is more magic state and
machinery hiding behind the curtain: the backend's pending I/O list
builds up a chain of I/Os, and when you try to wait, if it hasn't
already been submitted to the kernel/bgworkers yet it will be, and
before that merging will happen. So you get bigger I/Os without
having to say so explicitly.

Yeah, I was thinking that there would be a global variable that holds a
list of operations started with StartReadBuffer().

For this synchronous version (and hopefully soon a more efficient
improved version in the AIO branch), we want to take advantage of the
client's pre-existing and more cheaply obtained knowledge of ranges.

Ok.

In an asynchronous version, that BM_IO_IN_PROGRESS negotiation would
take place in StartReadBuffers() instead, which would be responsible
for kicking off asynchronous I/Os (= asking a background worker to
call pread(), or equivalent fancy kernel async APIs). One question is
what it does if it finds a block in the middle that chops the read up,
or for that matter a segment boundary. I don't think we have to
decide now, but the two options seem to be that it starts one single
I/O and reports its size, making it the client's problem to call again
with the rest, or that it starts more than one I/O and they are
somehow chained together so that the caller doesn't know about that
and can later wait for all of them to finish using just one <handle
thing>.

(The reason why I'm not 100% certain how it will look is that the
real, working code in the aio branch right now doesn't actually expose
a vector/nblocks bufmgr interface at all, yet. Andres's original
prototype had a single-block Start*(), Complete*() design, but a lower
level of the AIO system notices if pending read operations are
adjacent and could be merged. While discussing all this we decided it
was a bit strange to have lower code deal with allocating, chaining
and processing lots of separate I/O objects in shared memory, when
higher level code could often work in bigger ranges up front, and then
interact with the AIO subsystem with many fewer objects and steps.
Also, the present simple and lightweight synchronous proposal that
lacks the whole subsystem that could do that by magic.)

Hmm, let's sketch out what this would look like with the approach that
you always start one I/O and report the size:

/*
* Initiate reading a range of blocks from disk to the buffer cache.
*
* If the pages were already found in the buffer cache, returns true.
* Otherwise false, and the caller must call WaitReadBufferRange() to
* wait for the I/O to finish, before accessing the buffers.
*
* 'buffers' is a caller-supplied array large enough to hold (endBlk -
* startBlk) buffers. It is filled with the buffers that the pages are
* read into.
*
* This always starts a read of at least one block, and tries to
* initiate one read I/O for the whole range if possible. But if the
* read cannot be performed as a single I/O, a partial read is started,
* and *endBlk is updated to reflect the range for which the read was
* started. The caller can make another call to read the rest of the
* range. A partial read can occur if some, but not all, of the pages
* are already in the buffer cache, or because the range crosses a
* segment boundary.
*
* XXX: Until we have async I/O, this just allocates the buffers in the
* buffer cache. And perhaps calls smgrprefetch(). The actual I/O
* happens in WaitReadBufferRange().
*/
bool
StartReadBufferRange(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber startBlk,
BlockNumber *endBlk,
BufferAccessStrategy strategy,
Buffer *buffers);

/*
* Wait for a read that was started earlier with StartReadBufferRange()
* to finish.
*
* XXX: Until we have async I/O, this is the function that actually
* performs
* the I/O. StartReadBufferRange already checked that the pages can be
* read in one preadv() syscall. However, it's possible that another
* backend performed the read for some of the pages in between. In that
* case this will perform multiple syscalls, after all.
*/
void
WaitReadBufferRange(Buffer *buffers, int nbuffers, bool zero_on_error);

/*
* Allocates a buffer for the given page in the buffer cache, and locks
* the page. No I/O is initiated. The caller must initialize it and mark
* the buffer dirty before releasing the lock.
*
* This is equivalent to ReadBuffer(RBM_ZERO_AND_LOCK) or
* ReadBuffer(RBM_ZERO_AND_CLEANUP_LOCK).
*/
Buffer
ZeroBuffer(BufferManagerRelation bmr,
ForkNumber forkNum,
BlockNumber blockNum,
BufferAccessStrategy strategy,
bool cleanup_lock);

This range-oriented API fits the callers pretty well: the streaming read
API works with block ranges already. There is no need for separate
Prepare and Start steps.

One weakness is that if StartReadBufferRange() finds that the range is
"chopped up", it needs to return and throw away the work it had to do to
look up the next buffer. So in the extreme case that every other block
in the range is in the buffer cache, each call would look up two buffers
in the buffer cache, startBlk and startBlk + 1, but only return one
buffer to the caller.

--
Heikki Linnakangas
Neon (https://neon.tech)

#24

Thomas Munro

thomas.munro@gmail.com

about 2 years ago

In reply to: Heikki Linnakangas (#23)

Re: Streaming I/O, vectored I/O (WIP)

On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Ok. It feels surprising to have three steps. I understand that you need
two steps, one to start the I/O and another to wait for them to finish,
but why do you need separate Prepare and Start steps? What can you do in
between them? (You explained that. I'm just saying that that's my
initial reaction when seeing that API. It is surprising.)

Actually I don't think I explained that very well. First, some more
detail about how a two-step version would work:

* we only want to start one I/O in an StartReadBuffers() call, because
otherwise it is hard/impossible for the caller to cap concurrency I/O
* therefore StartReadBuffers() can handle sequences matching /^H*I*H*/
("H" = hit, "I" = miss, I/O) in one call
* in the asynchronous version, "I" in that pattern means we got
BM_IO_IN_PROGRESS
* in the synchronous version, "I" means that it's not valid, not
BM_IN_IN_PROGRESS, but we won't actually try to get BM_IO_IN_PROGRESS
until the later Complete/Wait call (and then the answer might chagne,
but we'll just deal with that by looping in the synchronous version)
* streaming_read.c has to deal with buffering up work that
StartReadBuffers() didn't accept
* that's actually quite easy, you just use the rest to create a new
range in the next slot

Previously I thought the requirement to deal with buffering future
stuff that StartReadBuffers() couldn't accept yet was a pain, and life
became so much simpler once I deleted all that and exposed
PrepareReadBuffer() to the calling code. Perhaps I just hadn't done a
good enough job of that.

The other disadvantage you reminded me of was the duplicate buffer
lookup in certain unlucky patterns, which I had forgotten about in my
previous email. But I guess it's not fatal to the idea and there is a
potential partial mitigation. (See below).

A third thing was the requirement for communication between
StartReadBuffers() and CompleteReadBuffers() which I originally had an
"opaque" object that the caller has to keep around that held private
state. It seemed nice to go back to talking just about buffer
numbers, but that's not really an argument for anything...

OK, I'm going to try the two-step version (again) with interfaces
along the lines you sketched out... more soon.

I see. When you're about to zero the page, there's not much point in
splitting the operation into Prepare/Start/Complete stages anyway.
You're not actually doing any I/O. Perhaps it's best to have a separate
"Buffer ZeroBuffer(Relation, ForkNumber, BlockNumber, lockmode)"
function that does the same as
ReadBuffer(RBM_ZERO_AND_[LOCK|CLEANUP_LOCK]) today.

That makes sense, but... hmm, sometimes just allocating a page
generates I/O if it has to evict a dirty buffer. Nothing in this code
does anything fancy about that, but imagine some hypothetical future
thing that manages to do that asynchronously -- then we might want to
take advantage of the ability to stream even a zeroed page, ie doing
something ahead of time? Just a thought for another day, and perhaps
that is just an argument for including it in the streaming read API,
but it doesn't mean that the bufmgr.c API can't be as you say.

One weakness is that if StartReadBufferRange() finds that the range is
"chopped up", it needs to return and throw away the work it had to do to
look up the next buffer. So in the extreme case that every other block
in the range is in the buffer cache, each call would look up two buffers
in the buffer cache, startBlk and startBlk + 1, but only return one
buffer to the caller.

Yeah, right. This was one of the observations that influenced my
PrepareReadBuffer() three-step thing that I'd forgotten. To spell
that out with an example, suppose the buffer pool contains every odd
numbered block. Successive StartReadBuffers() calls would process
"HMHm", "MHm", "MHm"... where "m" represents a miss that we can't do
anything with for a block we'll look up in the buffer pool again in
the next call. With the PrepareReadBuffer() design, that miss just
starts a new range and we don't have to look it up again. Hmm, I
suppose that could be mitigated somewhat with ReadRecentBuffer() if we
can find somewhere decent to store it.

BTW it was while thinking about and testing cases like that that I
found Palak Chaturvedi's https://commitfest.postgresql.org/46/4426/
extremely useful. It can kick out every second page or any other
range-chopping scenario you can express in a WHERE clause. I would
quite like to get that tool into the tree...

#25

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#20)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

Thanks for working on this!

On Wed, 10 Jan 2024 at 07:14, Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks! I committed the patches up as far as smgr.c before the
holidays. The next thing to commit would be the bufmgr.c one for
vectored ReadBuffer(), v5-0001. Here's my response to your review of
that, which triggered quite a few changes.

See also new version of the streaming_read.c patch, with change list
at end. (I'll talk about v5-0002, the SMgrRelation lifetime one, over
on the separate thread about that where Heikki posted a better
version.)

I have a couple of comments / questions.

0001-Provide-vectored-variant-of-ReadBuffer:

- Do we need to pass the hit variable to ReadBuffer_common()? I think
it can be just declared in the ReadBuffer_common() now.

0003-Provide-API-for-streaming-reads-of-relations:

- Do we need to re-think about getting a victim buffer logic?
StrategyGetBuffer() function errors if it can not find any unpinned
buffers, this can be more common in the async world since we pin
buffers before completing the read (while looking ahead).

- If the returned block from the callback is an invalid block,
pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there
be cases like the returned block being an invalid block but we should
continue to read after this invalid block?

- max_pinned_buffers and pinned_buffers_trigger variables are set in
the initialization part (in the
pg_streaming_read_buffer_alloc_internal() function) then they do not
change. In some cases there could be no acquirable buffers to pin
while initializing the pgsr (LimitAdditionalPins() set
max_pinned_buffers to 1) but while the read is continuing there could
be chances to create larger reads (other consecutive reads are
finished while this read is continuing). Do you think that trying to
reset max_pinned_buffers and pinned_buffers_trigger to have higher
values after the initialization to have larger reads make sense?

+        /* Is there a head range that we can't extend? */
+        head_range = &pgsr->ranges[pgsr->head];
+        if (head_range->nblocks > 0 &&
+            (!need_complete ||
+             !head_range->need_complete ||
+             head_range->blocknum + head_range->nblocks != blocknum))
+        {
+            /* Yes, time to start building a new one. */
+            head_range = pg_streaming_read_new_range(pgsr);

- I think if both need_complete and head_range->need_complete are
false, we can extend the head range regardless of the consecutiveness
of the blocks.

0006-Allow-streaming-reads-to-ramp-up-in-size:

- ramp_up_pin_limit variable is declared as an int but we do not check
the overflow while doubling it. This could be a problem in longer
reads.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#26

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#25)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Feb 7, 2024 at 11:54 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

0001-Provide-vectored-variant-of-ReadBuffer:

- Do we need to pass the hit variable to ReadBuffer_common()? I think
it can be just declared in the ReadBuffer_common() now.

Right, thanks! Done, in the version I'll post shortly.

0003-Provide-API-for-streaming-reads-of-relations:

- Do we need to re-think about getting a victim buffer logic?
StrategyGetBuffer() function errors if it can not find any unpinned
buffers, this can be more common in the async world since we pin
buffers before completing the read (while looking ahead).

Hmm, well that is what the pin limit machinery is supposed to be the
solution to. It has always been possible to see that error if
shared_buffers is too small, so that your backends can't pin what they
need to make progress because there are too many other backends, the
whole buffer pool is pinned and there is nothing available to
steal/evict. Here, sure, we pin more stuff per backend, but not more
than a "fair share", that is, Buffers / max backends, so it's not any
worse, is it? Well maybe it's marginally worse in some case, for
example if a query that uses many streams has one pinned buffer per
stream (which we always allow) where before we'd have acquired and
released pins in a slightly different sequence or whatever, but there
is always going to be a minimum shared_buffers that will work at all
for a given workload and we aren't changing it by much if at all here.
If you're anywhere near that limit, your performance must be so bad
that it'd only be a toy setting anyway. Does that sound reasonable?

Note that this isn't the first to use multi-pin logic or that limit
mechanism: that was the new extension code that shipped in 16. This
will do that more often, though.

- If the returned block from the callback is an invalid block,
pg_streaming_read_look_ahead() sets pgsr->finished = true. Could there
be cases like the returned block being an invalid block but we should
continue to read after this invalid block?

Yeah, I think there will be, and I think we should do it with some
kind of reset/restart function. I don't think we need it for the
current users so I haven't included it yet (there is a maybe-related
discussion about reset for another reasons, I think Melanie has an
idea about that), but I think something like that will useful for
future stuff like streaming recovery, where you can run out of WAL to
read but more will come via the network soon.

- max_pinned_buffers and pinned_buffers_trigger variables are set in
the initialization part (in the
pg_streaming_read_buffer_alloc_internal() function) then they do not
change. In some cases there could be no acquirable buffers to pin
while initializing the pgsr (LimitAdditionalPins() set
max_pinned_buffers to 1) but while the read is continuing there could
be chances to create larger reads (other consecutive reads are
finished while this read is continuing). Do you think that trying to
reset max_pinned_buffers and pinned_buffers_trigger to have higher
values after the initialization to have larger reads make sense?

That sounds hard! You're right that in the execution of a query there
might well be cases like that (inner and outer scan of a hash join
don't actually run at the same time, likewise for various other plan
shapes), and something that would magically and dynamically balance
resource usage might be ideal, but I don't know where to begin.
Concretely, as long as your buffer pool is measured in gigabytes and
your max backends is measured in hundreds, the per backend pin limit
should actually be fairly hard to hit anyway, as it would be in the
thousands. So I don't think it is as important as other resource
usage balance problems that we also don't attempt (memory, CPU, I/O
bandwidth).

+        /* Is there a head range that we can't extend? */
+        head_range = &pgsr->ranges[pgsr->head];
+        if (head_range->nblocks > 0 &&
+            (!need_complete ||
+             !head_range->need_complete ||
+             head_range->blocknum + head_range->nblocks != blocknum))
+        {
+            /* Yes, time to start building a new one. */
+            head_range = pg_streaming_read_new_range(pgsr);

- I think if both need_complete and head_range->need_complete are
false, we can extend the head range regardless of the consecutiveness
of the blocks.

Yeah, I think we can experiment with ideas like that. Not done yet
but I'm thinking about it -- more shortly.

0006-Allow-streaming-reads-to-ramp-up-in-size:

- ramp_up_pin_limit variable is declared as an int but we do not check
the overflow while doubling it. This could be a problem in longer
reads.

But it can't get above very high, because eventually it exceeds
max_pinned_buffers, which is anchored to the ground by various small
limits.

#27

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#24)

6 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Fri, Jan 12, 2024 at 12:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Jan 12, 2024 at 3:31 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Ok. It feels surprising to have three steps. I understand that you need
two steps, one to start the I/O and another to wait for them to finish,
but why do you need separate Prepare and Start steps? What can you do in
between them? (You explained that. I'm just saying that that's my
initial reaction when seeing that API. It is surprising.)

[...]

OK, I'm going to try the two-step version (again) with interfaces
along the lines you sketched out... more soon.

Here's the 2 step version. The streaming_read.c API is unchanged, but
the bugmfr.c API now has only the following extra functions:

bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op)
WaitReadBuffers(ReadBuffersOperation *op)

That is, the PrepareReadBuffer() step is gone.

StartReadBuffers() updates *nblocks to the number actually processed,
which is always at least one. If it returns true, then you must call
WaitReadBuffers(). When it finds a 'hit' (doesn't need I/O), then
that one final (or only) buffer is processed, but no more.
StartReadBuffers() always conceptually starts 0 or 1 I/Os. Example:
if you ask for 16 blocks, and it finds two misses followed by a hit,
it'll set *nblocks = 3, smgrprefetch(2 blocks), and smgrreadv(2
blocks) them in WaitReadBuffer(). The caller can't really tell that
the third block was a hit. The only case it can distinguish is if the
first one was a hit, and then it returns false and sets *nblocks = 1.

This arrangement, where the results include the 'boundary' block that
ends the readable range, avoids the double-lookup problem we discussed
upthread. I think it should probably also be able to handle multiple
consecutive 'hits' at the start of a sequence, but in this version I
kept it simpler. It couldn't ever handle more than one after an I/O
range though, because it can't guess if the one after will be a hit or
a miss. If it turned out to be a miss, we don't want to start a
second I/O, so unless we decide that we're happy unpinning and
re-looking-up next time, it's better to give up then. Hence the idea
of including the hit as a bonus block on the end.

It took me a long time but I eventually worked my way around to
preferring this way over the 3 step version. streaming_read.c now has
to do a bit more work including sometimes 'ungetting' a block (ie
deferring one that the callback has requested until next time), to
resolve some circularities that come up with flow control. But I
suspect you'd probably finish up having to deal with 'short' writes
anyway, because in the asynchronous future, in a three-step version,
the StartReadBuffers() (as 2nd step) might also be short when it fails
to get enough BM_IO_IN_PROGRESS flags, so you have to deal with some
version of these problems anyway. Thoughts?

I am still thinking about how to improve the coding in
streaming_read.c, ie to simplify and beautify the main control loop
and improve the flow control logic. And looking for interesting test
cases to hit various conditions in it and try to break it. And trying
to figure out how this read-coalescing and parallel seq scan's block
allocator might interfere with each other to produce non-idea patterns
of system calls.

Here are some example strace results generated by a couple of simple
queries. See CF #4426 for pg_buffercache_invalidate().

=== Sequential scan example ===

create table big as select generate_series(1, 10000000);

select count(*) from big;

pread64(81, ...) = 8192 <-- starts small
pread64(81, ...) = 16384
pread64(81, ...) = 32768
pread64(81, ...) = 65536
pread64(81, ...) = 131072 <-- fully ramped up size reached
preadv(81, ...) = 131072 <-- more Vs seen as buffers fill up/fragments
preadv(81, ...) = 131072
...repeating...
preadv(81, ...) = 131072
preadv(81, ...) = 131072
pread64(81, ...) = 8192 <-- end fragment

create table small as select generate_series(1, 100000);

select bool_and(pg_buffercache_invalidate(bufferid))
from pg_buffercache
where relfilenode = pg_relation_filenode('small')
and relblocknumber % 3 != 0; -- <-- kick out every 3rd block

select count(*) from small;

preadv(88, ...) = 16384 <-- just the 2-block fragments we need to load
preadv(88, ...) = 16384
preadv(88, ...) = 16384

=== Bitmap heapscan example ===

create table heap (i int primary key);
insert into heap select generate_series(1, 1000000);

select bool_and(pg_buffercache_invalidate(bufferid))
from pg_buffercache
where relfilenode = pg_relation_filenode('heap');

select count(i) from heap where i in (10, 1000, 10000, 100000) or i in
(20, 200, 2000, 200000);

pread64(75, ..., 8192, 0) = 8192
fadvise64(75, 32768, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(75, 65536, 8192, POSIX_FADV_WILLNEED) = 0
pread64(75, ..., 8192, 32768) = 8192
pread64(75, ..., 8192, 65536) = 8192
fadvise64(75, 360448, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(75, 3620864, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(75, 7241728, 8192, POSIX_FADV_WILLNEED) = 0
pread64(75, ..., 8192, 360448) = 8192
pread64(75, ..., 8192, 3620864) = 8192
pread64(75, ..., 8192, 7241728) = 8192

Attachments:

v5-0001-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From b5b766176d1d4a993af545a11b9e0ef848e9abcb Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v5 1/7] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

With some more infrastructure in later work, StartReadBuffers() should
be extended to start real asynchronous I/O instead of advice.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reivewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 641 ++++++++++++++++++--------
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/include/storage/bufmgr.h          |  45 ++
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 492 insertions(+), 209 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bdf89bbc4dc..3b1b0ad99df 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -472,10 +477,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -501,7 +505,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -782,7 +786,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -795,15 +798,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -823,13 +820,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -995,35 +991,68 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1042,181 +1071,404 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+static Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
 
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	operation->bmr = bmr;
+	operation->forknum = forkNum;
+	operation->blocknum = blockNum;
+	operation->buffers = buffers;
+	operation->nblocks = actual_nblocks;
+	operation->strategy = strategy;
+	operation->flags = flags;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	operation->io_buffers_len = 0;
 
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
-	else
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		bool		found;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		buffers[i] = PrepareReadBuffer(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			operation->io_buffers_len++;
+		}
+	}
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+	if (operation->io_buffers_len > 0)
+	{
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
+			/*
+			 * In theory we should only do this if PrepareReadBuffers() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
 		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
 	}
+	else
+	{
+		return false;
+	}
+}
 
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
 	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
 	}
 	else
 	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
 	}
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for StartReadBuffers.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1224,11 +1476,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1287,19 +1535,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1364,19 +1603,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1408,15 +1638,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1770,7 +1994,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2035,7 +2259,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2058,7 +2282,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2373,7 +2597,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2382,7 +2611,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3450,7 +3679,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5185,9 +5414,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5200,6 +5435,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1f02fed250e..6956d4e5b49 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..b57f71f97e3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int			io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +292,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fc8b15d0cf2..6deb18234ac 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2267,6 +2267,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.2

v5-0002-Provide-API-for-streaming-reads-of-relations.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Provide-API-for-streaming-reads-of-relations.patchDownload

From 64c22a27f1aa9dc2d6143612e0b9b19c5afb2b37 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v5 2/7] Provide API for "streaming" reads of relations.

"Streaming reads" can be used as a more efficient replacement for
a series of ReadBuffer() calls.

The client code supplies a callback that can say which block to read
next, and then consumes individual buffers one at a time.  This division
allows streaming_read.c to build up large calls to StartReadBuffers(),
and issue fadvise() advice about future random reads in a systematic
way.

This API is based on an idea proposed by Andres Freund, to pave the way
for asynchronous I/O in future work as required to support direct I/O.
The main aim is to create an abstraction that insulates client code from
future improvements to the I/O subsystem.

An extended API may be necessary in future for more complicated cases
(for example recovery), but this basic API is thought to be sufficient
for many common usage patterns involving predictable access to a single
relation fork.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 612 +++++++++++++++++++++++
 src/backend/storage/meson.build          |   1 +
 src/include/storage/streaming_read.h     |  52 ++
 src/tools/pgindent/typedefs.list         |   2 +
 7 files changed, 687 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..71f2c4a70b6
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,612 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		need_wait;
+	bool		advice_issued;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index;
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+	ReadBuffersOperation operation;
+} PgStreamingReadRange;
+
+/*
+ * Streaming read object.
+ */
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	int			ramp_up_pin_limit;
+	int			ramp_up_pin_stall;
+	bool		finished;
+	bool		advice_enabled;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs might be set to 0, but we want to allow at
+	 * least one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+	pgsr->unget_blocknum = InvalidBlockNumber;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We start off building small ranges, but double that quickly, for the
+	 * benefit of users that don't know how far ahead they'll read.  This can
+	 * be disabled by users that already know they'll read all the way.
+	 */
+	if (flags & PGSR_FLAG_FULL)
+		pgsr->ramp_up_pin_limit = INT_MAX;
+	else
+		pgsr->ramp_up_pin_limit = 1;
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space for the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Find the per-buffer data index for the Nth block of a range.
+ */
+static int
+get_per_buffer_data_index(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	int			result;
+
+	/*
+	 * Find slot in the circular buffer of per-buffer data, without using the
+	 * expensive % operator.
+	 */
+	result = range->per_buffer_data_index + n;
+	if (result >= pgsr->max_pinned_buffers)
+		result -= pgsr->max_pinned_buffers;
+	Assert(result == (range->per_buffer_data_index + n) % pgsr->max_pinned_buffers);
+
+	return result;
+}
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data_by_index(PgStreamingRead *pgsr, int per_buffer_data_index)
+{
+	return (char *) pgsr->per_buffer_data +
+		pgsr->per_buffer_data_size * per_buffer_data_index;
+}
+
+/*
+ * Return a pointer to the per-buffer data for the Nth block of a range.
+ */
+static void *
+get_per_buffer_data(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	return get_per_buffer_data_by_index(pgsr,
+										get_per_buffer_data_index(pgsr,
+																  range,
+																  n));
+}
+
+/*
+ * Start reading the head range, and create a new head range.  The new head
+ * range is returned.  It may not be empty, if StartReadBuffers() couldn't
+ * start the entire range; in that case the returned range contains the
+ * remaining portion of the range.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_start_head_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+	PgStreamingReadRange *new_head_range;
+	int			nblocks_pinned;
+	int			flags;
+
+	/* Caller should make sure we never exceed max_ios. */
+	Assert(pgsr->ios_in_progress < pgsr->max_ios);
+
+	/* Should only call if the head range has some blocks to read. */
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (pgsr->advice_enabled && head_range->blocknum != pgsr->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+
+	/* Start reading as many blocks as we can from the head range. */
+	nblocks_pinned = head_range->nblocks;
+	head_range->need_wait =
+		StartReadBuffers(pgsr->bmr,
+						 head_range->buffers,
+						 pgsr->forknum,
+						 head_range->blocknum,
+						 &nblocks_pinned,
+						 pgsr->strategy,
+						 flags,
+						 &head_range->operation);
+
+	/* Did that start an I/O? */
+	if (head_range->need_wait && (flags & READ_BUFFERS_ISSUE_ADVICE))
+	{
+		head_range->advice_issued = true;
+		pgsr->ios_in_progress++;
+		Assert(pgsr->ios_in_progress <= pgsr->max_ios);
+	}
+
+	/*
+	 * StartReadBuffers() might have pinned fewer blocks than we asked it to,
+	 * but always at least one.
+	 */
+	Assert(nblocks_pinned <= head_range->nblocks);
+	Assert(nblocks_pinned >= 1);
+	pgsr->pinned_buffers += nblocks_pinned;
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time.
+	 */
+	pgsr->seq_blocknum = head_range->blocknum + nblocks_pinned;
+
+	/*
+	 * Create a new head range.  There must be space, because we have enough
+	 * elements for every range to hold just one block, up to the pin limit.
+	 */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	new_head_range = &pgsr->ranges[pgsr->head];
+	new_head_range->nblocks = 0;
+	new_head_range->advice_issued = false;
+
+	/*
+	 * If we didn't manage to start the whole read above, we split the range,
+	 * moving the remainder into the new head range.
+	 */
+	if (nblocks_pinned < head_range->nblocks)
+	{
+		int			nblocks_remaining = head_range->nblocks - nblocks_pinned;
+
+		head_range->nblocks = nblocks_pinned;
+
+		new_head_range->blocknum = head_range->blocknum + nblocks_pinned;
+		new_head_range->nblocks = nblocks_remaining;
+	}
+
+	/* The new range has per-buffer data starting after the previous range. */
+	new_head_range->per_buffer_data_index =
+		get_per_buffer_data_index(pgsr, head_range, nblocks_pinned);
+
+	return new_head_range;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow pg_streaming_unget_block() to work.
+ */
+static BlockNumber
+pg_streaming_get_block(PgStreamingRead *pgsr, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(pgsr->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = pgsr->unget_blocknum;
+		pgsr->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * pg_streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == pgsr->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by pg_streaming_get_block().
+ */
+static void
+pg_streaming_unget_block(PgStreamingRead *pgsr, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(pgsr->unget_blocknum == InvalidBlockNumber);
+	pgsr->unget_blocknum = blocknum;
+	pgsr->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *range;
+
+	/*
+	 * If we're still ramping up, we may have to stall to wait for buffers to
+	 * be consumed first before we do any more prefetching.
+	 */
+	if (pgsr->ramp_up_pin_stall > 0)
+	{
+		Assert(pgsr->pinned_buffers > 0);
+		return;
+	}
+
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BlockNumber blocknum;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		range = &pgsr->ranges[pgsr->head];
+		if (range->nblocks == lengthof(range->buffers))
+		{
+			/* Start as much of it as we can. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/* If we're now at the I/O limit, stop here. */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+				return;
+
+			/*
+			 * If we couldn't form a full range, then stop here to avoid
+			 * creating small I/O.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+				return;
+
+			/*
+			 * That might have only been partially started, but always
+			 * processes at least one so that'll do for now.
+			 */
+			Assert(range->nblocks < lengthof(range->buffers));
+		}
+
+		/* Find per-buffer data slot for the next block. */
+		per_buffer_data = get_per_buffer_data(pgsr, range, range->nblocks);
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pg_streaming_get_block(pgsr, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			pgsr->finished = true;
+			break;
+		}
+
+		/*
+		 * Is there a head range that we cannot extend, because the requested
+		 * block is not consecutive?
+		 */
+		if (range->nblocks > 0 &&
+			range->blocknum + range->nblocks != blocknum)
+		{
+			/* Yes.  Start it, so we can begin building a new one. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * It's possible that it was only partially started, and we have a
+			 * new range with the remainder.  Keep starting I/Os until we get
+			 * it all out of the way, or we hit the I/O limit.
+			 */
+			while (range->nblocks > 0 && pgsr->ios_in_progress < pgsr->max_ios)
+				range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * We have to 'unget' the block returned by the callback if we
+			 * don't have enough I/O capacity left to start something.
+			 */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+			{
+				pg_streaming_unget_block(pgsr, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* If we have a new, empty range, initialize the start block. */
+		if (range->nblocks == 0)
+		{
+			range->blocknum = blocknum;
+		}
+
+		/* This block extends the range by one. */
+		Assert(range->blocknum + range->nblocks == blocknum);
+		range->nblocks++;
+
+	} while (pgsr->pinned_buffers + range->nblocks < pgsr->max_pinned_buffers &&
+			 pgsr->pinned_buffers + range->nblocks < pgsr->ramp_up_pin_limit);
+
+	/* If we've hit the ramp-up limit, insert a stall. */
+	if (pgsr->pinned_buffers + range->nblocks >= pgsr->ramp_up_pin_limit)
+	{
+		/* Can't get here if an earlier stall hasn't finished. */
+		Assert(pgsr->ramp_up_pin_stall == 0);
+		/* Don't do any more prefetching until these buffers are consumed. */
+		pgsr->ramp_up_pin_stall = pgsr->ramp_up_pin_limit;
+		/* Double it.  It will soon be out of the way. */
+		pgsr->ramp_up_pin_limit *= 2;
+	}
+
+	/* Start as much as we can. */
+	while (range->nblocks > 0)
+	{
+		range = pg_streaming_read_start_head_range(pgsr);
+		if (pgsr->ios_in_progress == pgsr->max_ios)
+			break;
+	}
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_wait)
+		{
+			WaitReadBuffers(&tail_range->operation);
+			tail_range->need_wait = false;
+
+			/*
+			 * We don't really know if the kernel generated a physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished now because we've performed the read.
+			 */
+			if (tail_range->advice_issued)
+			{
+				Assert(pgsr->ios_in_progress > 0);
+				pgsr->ios_in_progress--;
+			}
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (pgsr->ramp_up_pin_stall > 0)
+				pgsr->ramp_up_pin_stall--;
+
+			if (per_buffer_data)
+				*per_buffer_data = get_per_buffer_data(pgsr, tail_range, buffer_index);
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+
+		/*
+		 * If tail crashed into head, and head is not empty, then it is time
+		 * to start that range.
+		 */
+		if (pgsr->tail == pgsr->head &&
+			pgsr->ranges[pgsr->head].nblocks > 0)
+			pg_streaming_read_start_head_range(pgsr);
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	pgsr->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(pgsr->pinned_buffers == 0);
+	Assert(pgsr->ios_in_progress == 0);
+
+	/* Release memory. */
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..c4d3892bb26
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,52 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6deb18234ac..cfb58cf4836 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2097,6 +2097,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.39.2

v5-0003-Use-streaming-reads-in-pg_prewarm.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Use-streaming-reads-in-pg_prewarm.patchDownload

From b4cefb4cea2573617e68a3d773e2902779d75ab3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v5 3/7] Use streaming reads in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use streaming reads.  This
provides a simple example of such a transformation, and generates fewer
system calls.

Reviewed-by:
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..1cc84bcb0c2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_FULL,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.2

v5-0004-WIP-Use-streaming-reads-in-heapam-scans.patchtext/x-patch; charset=US-ASCII; name=v5-0004-WIP-Use-streaming-reads-in-heapam-scans.patchDownload

From c2bc35daa98ef1e4516f9fe5a047013116d6f49e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Apr 2023 18:16:06 -0700
Subject: [PATCH v5 4/7] WIP: Use streaming reads in heapam scans.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/heap/heapam.c         | 205 +++++++++++++++++++++--
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/include/access/heapam.h              |   5 +-
 3 files changed, 192 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..76a53f68b66 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -65,6 +65,7 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
@@ -225,6 +226,89 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	BlockNumber blockno;
+
+	Assert(!scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	if (scan->rs_prefetch_block == InvalidBlockNumber)
+	{
+		scan->rs_prefetch_block = blockno = scan->rs_startblock;
+	}
+	else
+	{
+		blockno = ++scan->rs_prefetch_block;
+
+		/* wrap back to the start of the heap */
+		if (blockno >= scan->rs_nblocks)
+			scan->rs_prefetch_block = blockno = 0;
+
+		/* we're done if we're back at where we started */
+		if (blockno == scan->rs_startblock)
+			return InvalidBlockNumber;
+
+		/* check if the limit imposed by heap_setscanlimits() is met */
+		if (scan->rs_numblocks != InvalidBlockNumber)
+		{
+			if (--scan->rs_numblocks == 0)
+				return InvalidBlockNumber;
+		}
+	}
+
+	return blockno;
+}
+
+static BlockNumber
+heap_pgsr_next_parallel(PgStreamingRead *pgsr, void *pgsr_private,
+						void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	ParallelBlockTableScanDesc pbscan =
+		(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+	ParallelBlockTableScanWorker pbscanwork =
+		scan->rs_parallelworkerdata;
+	BlockNumber blockno;
+
+	Assert(scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	/* Note that other processes might have already finished the scan */
+	blockno = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+												pbscanwork, pbscan);
+
+	return blockno;
+}
+
+static PgStreamingRead *
+heap_pgsr_single_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+										  scan,
+										  0,
+										  scan->rs_strategy,
+										  BMR_REL(scan->rs_base.rs_rd),
+										  MAIN_FORKNUM,
+										  heap_pgsr_next_single);
+}
+
+static PgStreamingRead *
+heap_pgsr_parallel_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PGSR_FLAG_SEQUENTIAL,
+										  scan,
+										  0,
+										  scan->rs_strategy,
+										  BMR_REL(scan->rs_base.rs_rd),
+										  MAIN_FORKNUM,
+										  heap_pgsr_next_parallel);
+}
+
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -342,6 +426,26 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 */
 	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN)
 		pgstat_count_heap_scan(scan->rs_base.rs_rd);
+
+	scan->rs_prefetch_block = InvalidBlockNumber;
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
+	/*
+	 * FIXME: This probably should be done in the !rs_inited blocks instead.
+	 */
+	scan->pgsr = NULL;
+	if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
+		(scan->rs_base.rs_flags & SO_TYPE_SEQSCAN))
+	{
+		if (scan->rs_base.rs_parallel)
+			scan->pgsr = heap_pgsr_parallel_alloc(scan);
+		else
+			scan->pgsr = heap_pgsr_single_alloc(scan);
+	}
 }
 
 /*
@@ -374,7 +478,7 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
  * which tuples on the page are visible.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer pgsr_buffer)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
 	Buffer		buffer;
@@ -401,9 +505,20 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
+	if (BufferIsValid(pgsr_buffer))
+	{
+		Assert(scan->pgsr);
+		Assert(BufferGetBlockNumber(pgsr_buffer) == block);
+		scan->rs_cbuf = pgsr_buffer;
+	}
+	else
+	{
+		Assert(!scan->pgsr);
+
+		/* read page using selected strategy */
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+										   RBM_NORMAL, scan->rs_strategy);
+	}
 	scan->rs_cblock = block;
 
 	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
@@ -490,7 +605,7 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
  * of the pages before we can get a chance to get our first page.
  */
 static BlockNumber
-heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
+heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir, Buffer *pgsr_buf)
 {
 	Assert(!scan->rs_inited);
 
@@ -500,16 +615,25 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		if (scan->rs_base.rs_parallel != NULL)
+			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+													 scan->rs_parallelworkerdata,
+													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		/* FIXME: Integrate more neatly */
+		if (scan->pgsr)
+		{
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+
 		/* serial scan */
 		if (scan->rs_base.rs_parallel == NULL)
 			return scan->rs_startblock;
 		else
 		{
-			/* parallel scan */
-			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
-													 scan->rs_parallelworkerdata,
-													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
-
 			/* may return InvalidBlockNumber if there are no more blocks */
 			return table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
 													 scan->rs_parallelworkerdata,
@@ -529,6 +653,12 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 		 */
 		scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
 
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/*
 		 * Start from last page of the scan.  Ensure we take into account
 		 * rs_numblocks if it's been adjusted by heap_setscanlimits().
@@ -630,11 +760,33 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * heap_setscanlimits().
  */
 static inline BlockNumber
-heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
+heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir,
+						 Buffer *pgsr_buf)
 {
 	if (ScanDirectionIsForward(dir))
 	{
-		if (scan->rs_base.rs_parallel == NULL)
+		if (scan->pgsr)
+		{
+#ifdef USE_ASSERT_CHECKING
+			block++;
+
+			/* wrap back to the start of the heap */
+			if (block >= scan->rs_nblocks)
+				block = 0;
+
+			/* we're done if we're back at where we started */
+			if (block == scan->rs_startblock)
+				block = InvalidBlockNumber;
+#endif
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+
+			Assert(scan->rs_base.rs_parallel ||
+				   block == BufferGetBlockNumber(*pgsr_buf));
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+		else if (scan->rs_base.rs_parallel == NULL)
 		{
 			block++;
 
@@ -679,6 +831,12 @@ heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir
 	}
 	else
 	{
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/* we're done if the last block is the start position */
 		if (block == scan->rs_startblock)
 			return InvalidBlockNumber;
@@ -728,13 +886,14 @@ heapgettup(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -755,7 +914,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -807,9 +966,10 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+		pgsr_buf = InvalidBuffer;
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -843,13 +1003,14 @@ heapgettup_pagemode(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -876,7 +1037,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
@@ -908,7 +1069,7 @@ continue_page:
 		}
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -956,6 +1117,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_parallel = parallel_scan;
 	scan->rs_strategy = NULL;	/* set in initscan */
 
+	scan->pgsr = NULL;
+
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
 	 */
@@ -1062,6 +1225,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..20e83f48e35 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2333,7 +2333,7 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
+	heapgetpage(scan, blockno, InvalidBuffer);
 	hscan->rs_inited = true;
 
 	return true;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..21e71d60912 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -59,6 +59,7 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	OffsetNumber rs_coffset;	/* current offset # in non-page-at-a-time mode */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+	BlockNumber rs_prefetch_block;	/* block being prefetched */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
@@ -72,6 +73,8 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
+	struct PgStreamingRead *pgsr;
+
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
@@ -246,7 +249,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer buffer);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.39.2

v5-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchtext/x-patch; charset=US-ASCII; name=v5-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchDownload

From bea6cf16b5ba918d3c7203d187d2aa226a16e2fc Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 12 Feb 2021 14:19:10 -0800
Subject: [PATCH v5 5/7] WIP: Use streaming reads in nbtree vacuum scan.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/nbtree/nbtree.c | 74 ++++++++++++++++++++++++------
 src/include/access/nbtree.h        |  3 ++
 2 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 21d879a3bdf..773da3800f1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -34,6 +34,7 @@
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/streaming_read.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -81,7 +82,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
 static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 IndexBulkDeleteCallback callback, void *callback_state,
 						 BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno);
 static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
 										  IndexTuple posting,
 										  OffsetNumber updatedoffset,
@@ -878,6 +879,24 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+static BlockNumber
+btvacuumscan_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+					   void *per_buffer_data)
+{
+	BTVacState *vstate = (BTVacState *) pgsr_private;
+
+	if (vstate->nextblock == vstate->num_pages)
+		return InvalidBlockNumber;
+
+	/*
+	 * We can't use _bt_getbuf() here because it always applies
+	 * _bt_checkpage(), which will barf on an all-zero page. We want to
+	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
+	 * buffer access strategy. Also want to use AIO.
+	 */
+	return vstate->nextblock++;
+}
+
 /*
  * btvacuumscan --- scan the index for VACUUMing purposes
  *
@@ -971,6 +990,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	scanblkno = BTREE_METAPAGE + 1;
 	for (;;)
 	{
+		PgStreamingRead *pgsr;
+
 		/* Get the current relation length */
 		if (needLock)
 			LockRelationForExtension(rel, ExclusiveLock);
@@ -985,14 +1006,40 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (scanblkno >= num_pages)
 			break;
+
+		vstate.nextblock = scanblkno;
+		vstate.num_pages = num_pages;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_MAINTENANCE,
+											  &vstate,
+											  0,
+											  vstate.info->strategy,
+											  BMR_REL(vstate.info->index),
+											  MAIN_FORKNUM,
+											  btvacuumscan_pgsr_next);
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; scanblkno < num_pages; scanblkno++)
 		{
-			btvacuumpage(&vstate, scanblkno);
+			Buffer		buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
+
+			Assert(BufferIsValid(buf));
+			Assert(BufferGetBlockNumber(buf) == scanblkno);
+
+			btvacuumpage(&vstate, buf, scanblkno);
+
 			if (info->report_progress)
 				pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
 											 scanblkno);
+
+			/* call vacuum_delay_point while not holding any buffer lock */
+			vacuum_delay_point();
 		}
+
+		if (pg_streaming_read_buffer_get_next(pgsr, NULL) != 0)
+			elog(ERROR, "aio and plain pos out of sync");
+
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Set statistics num_pages field to final size of index */
@@ -1025,7 +1072,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
  * recycled (i.e. before the page split).
  */
 static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno)
 {
 	IndexVacuumInfo *info = vstate->info;
 	IndexBulkDeleteResult *stats = vstate->stats;
@@ -1036,28 +1083,28 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
 	bool		attempt_pagedel;
 	BlockNumber blkno,
 				backtrack_to;
-	Buffer		buf;
 	Page		page;
 	BTPageOpaque opaque;
 
 	blkno = scanblkno;
 
+	Assert(BufferIsValid(buf));
 backtrack:
 
 	attempt_pagedel = false;
 	backtrack_to = P_NONE;
 
-	/* call vacuum_delay_point while not holding any buffer lock */
-	vacuum_delay_point();
-
 	/*
-	 * We can't use _bt_getbuf() here because it always applies
-	 * _bt_checkpage(), which will barf on an all-zero page. We want to
-	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
-	 * buffer access strategy.
+	 * If we're not backtracking we already read the page asynchronously. Not
+	 * using AIO when backtracking isn't too tragic - it's relatively rare,
+	 * and the buffer quite possibly is still in shared buffers.
 	 */
-	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-							 info->strategy);
+	if (!BufferIsValid(buf))
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								 info->strategy);
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	_bt_lockbuf(rel, buf, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = NULL;
@@ -1344,6 +1391,7 @@ backtrack:
 	if (backtrack_to != P_NONE)
 	{
 		blkno = backtrack_to;
+		buf = InvalidBuffer;
 		goto backtrack;
 	}
 }
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052e9..2dbbbd36e73 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -343,6 +343,9 @@ typedef struct BTVacState
 	int			maxbufsize;		/* max bufsize that respects work_mem */
 	BTPendingFSM *pendingpages; /* One entry per newly deleted page */
 	int			npendingpages;	/* current # valid pendingpages */
+
+	BlockNumber nextblock;
+	BlockNumber num_pages;
 } BTVacState;
 
 /*
-- 
2.39.2

v5-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patchtext/x-patch; charset=US-ASCII; name=v5-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patchDownload

From a94df02198c007aa8a39cb9b9f7906ea11fbc60a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 22 Aug 2023 16:48:30 +1200
Subject: [PATCH v5 6/7] WIP: Use streaming reads in bitmap heapscan.

XXX Cherry-picked from
https://github.com/melanieplageman/postgres/tree/bhs_aio and lightly
modified by TM, for demonstration purposes.

Author: Melanie Plageman <melanieplageman@gmail.com>
---
 src/backend/access/gin/ginget.c           |  17 +-
 src/backend/access/gin/ginscan.c          |   7 +
 src/backend/access/heap/heapam.c          |  71 +++
 src/backend/access/heap/heapam_handler.c  |  78 ++--
 src/backend/commands/explain.c            |  21 +-
 src/backend/executor/nodeBitmapHeapscan.c | 521 ++--------------------
 src/backend/nodes/tidbitmap.c             |  76 ++--
 src/include/access/heapam.h               |   7 +
 src/include/access/relscan.h              |   8 +
 src/include/access/tableam.h              |  71 ++-
 src/include/executor/nodeBitmapHeapscan.h |   1 +
 src/include/nodes/execnodes.h             |  39 +-
 src/include/nodes/tidbitmap.h             |   4 +-
 13 files changed, 284 insertions(+), 637 deletions(-)

diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0b4f2ebadb6..f238a3b0711 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -373,7 +373,10 @@ restartScanEntry:
 			if (entry->matchBitmap)
 			{
 				if (entry->matchIterator)
+				{
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+				}
 				entry->matchIterator = NULL;
 				tbm_free(entry->matchBitmap);
 				entry->matchBitmap = NULL;
@@ -386,6 +389,9 @@ restartScanEntry:
 		if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap))
 		{
 			entry->matchIterator = tbm_begin_iterate(entry->matchBitmap);
+			entry->matchResult = (TBMIterateResult *)
+				palloc0(sizeof(TBMIterateResult) + MaxHeapTuplesPerPage *
+						sizeof(OffsetNumber));
 			entry->isFinished = false;
 		}
 	}
@@ -823,21 +829,24 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 		{
 			/*
 			 * If we've exhausted all items on this block, move to next block
-			 * in the bitmap.
+			 * in the bitmap. tbm_iterate() sets matchResult->blockno to
+			 * InvalidBlockNumber when the bitmap is exhausted.
 			 */
-			while (entry->matchResult == NULL ||
+			while ((!BlockNumberIsValid(entry->matchResult->blockno)) ||
 				   (entry->matchResult->ntuples >= 0 &&
 					entry->offset >= entry->matchResult->ntuples) ||
 				   entry->matchResult->blockno < advancePastBlk ||
 				   (ItemPointerIsLossyPage(&advancePast) &&
 					entry->matchResult->blockno == advancePastBlk))
 			{
-				entry->matchResult = tbm_iterate(entry->matchIterator);
 
-				if (entry->matchResult == NULL)
+				tbm_iterate(entry->matchIterator, entry->matchResult);
+				if (!BlockNumberIsValid(entry->matchResult->blockno))
 				{
 					ItemPointerSetInvalid(&entry->curItem);
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+					entry->matchResult = NULL;
 					entry->matchIterator = NULL;
 					entry->isFinished = true;
 					break;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d38544e..be27f9fe07e 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -246,7 +246,14 @@ ginFreeScanKeys(GinScanOpaque so)
 		if (entry->list)
 			pfree(entry->list);
 		if (entry->matchIterator)
+		{
 			tbm_end_iterate(entry->matchIterator);
+			if (entry->matchResult)
+			{
+				pfree(entry->matchResult);
+				entry->matchResult = NULL;
+			}
+		}
 		if (entry->matchBitmap)
 			tbm_free(entry->matchBitmap);
 	}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 76a53f68b66..b174b5c675d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -226,6 +226,53 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+							void *per_buffer_data)
+{
+	TBMIterateResult *tbmres;
+	HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+	tbmres = (TBMIterateResult *) per_buffer_data;
+
+	for (;;)
+	{
+		if (hdesc->rs_base.shared_tbmiterator)
+			tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+		else
+			tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+		if (!BlockNumberIsValid(tbmres->blockno))
+			return InvalidBlockNumber;
+
+		/*
+		 * Ignore any claimed entries past what we think is the end of the
+		 * relation. It may have been extended after the start of our scan (we
+		 * only hold an AccessShareLock, and it could be inserts from this
+		 * backend).  We don't take this optimization in SERIALIZABLE
+		 * isolation though, as we need to examine all invisible tuples
+		 * reachable by the index.
+		 */
+		if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+			continue;
+
+		if (tbmres->ntuples >= 0)
+			hdesc->rs_base.exact_pages++;
+		else
+			hdesc->rs_base.lossy_pages++;
+
+		if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+			!tbmres->recheck &&
+			VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer))
+		{
+			hdesc->empty_tuples += tbmres->ntuples;
+			continue;
+		}
+
+		return tbmres->blockno;
+	}
+}
+
 static BlockNumber
 heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
 					  void *per_buffer_data)
@@ -433,6 +480,11 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		pg_streaming_read_free(scan->pgsr);
 		scan->pgsr = NULL;
 	}
+	if (BufferIsValid(scan->vmbuffer))
+	{
+		ReleaseBuffer(scan->vmbuffer);
+		scan->vmbuffer = InvalidBuffer;
+	}
 
 	/*
 	 * FIXME: This probably should be done in the !rs_inited blocks instead.
@@ -446,6 +498,19 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		else
 			scan->pgsr = heap_pgsr_single_alloc(scan);
 	}
+	else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+	{
+		scan->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+													scan,
+													offsetof(TBMIterateResult, offsets) + MaxHeapTuplesPerPage * sizeof(OffsetNumber),
+													scan->rs_strategy,
+													BMR_REL(scan->rs_base.rs_rd),
+													MAIN_FORKNUM,
+													bitmapheap_pgsr_next_single);
+
+
+		scan->rs_inited = true;
+	}
 }
 
 /*
@@ -1118,6 +1183,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_strategy = NULL;	/* set in initscan */
 
 	scan->pgsr = NULL;
+	scan->vmbuffer = InvalidBuffer;
+	scan->empty_tuples = 0;
+	scan->rs_base.exact_pages = 0;
+	scan->rs_base.lossy_pages = 0;
+	scan->rs_base.tbmiterator = NULL;
+	scan->rs_base.shared_tbmiterator = NULL;
 
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 20e83f48e35..946ecf92bb5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "storage/streaming_read.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -2108,38 +2109,55 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
  * Executor related callbacks for the heap AM
  * ------------------------------------------------------------------------
  */
-
 static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
-							  TBMIterateResult *tbmres)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
-	BlockNumber block = tbmres->blockno;
+	TBMIterateResult *tbmres;
+	void	   *io_private;
 	Buffer		buffer;
 	Snapshot	snapshot;
 	int			ntup;
 
-	hscan->rs_cindex = 0;
-	hscan->rs_ntuples = 0;
+	Assert(hscan->pgsr);
 
 	/*
-	 * Ignore any claimed entries past what we think is the end of the
-	 * relation. It may have been extended after the start of our scan (we
-	 * only hold an AccessShareLock, and it could be inserts from this
-	 * backend).  We don't take this optimization in SERIALIZABLE isolation
-	 * though, as we need to examine all invisible tuples reachable by the
-	 * index.
+	 * Release the buffer containing the previous block and reset the per-page
+	 * counters. Reset hscan->rs_ntuples here just to be safe.
 	 */
-	if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
-		return false;
+	if (BufferIsValid(hscan->rs_cbuf))
+	{
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
+	}
+
+	hscan->rs_cindex = 0;
+	hscan->rs_ntuples = 0;
+
+	hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->pgsr, &io_private);
+
+	if (BufferIsInvalid(hscan->rs_cbuf))
+	{
+		if (BufferIsValid(hscan->vmbuffer))
+		{
+			ReleaseBuffer(hscan->vmbuffer);
+			hscan->vmbuffer = InvalidBuffer;
+		}
+
+		return hscan->empty_tuples > 0;
+	}
+
+	Assert(io_private);
+
+	tbmres = (TBMIterateResult *) io_private;
+
+	Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
+
+	*recheck = tbmres->recheck;
+
+	hscan->rs_cblock = tbmres->blockno;
+	hscan->rs_ntuples = tbmres->ntuples;
 
-	/*
-	 * Acquire pin on the target heap page, trading in any pin we held before.
-	 */
-	hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
-										  scan->rs_rd,
-										  block);
-	hscan->rs_cblock = block;
 	buffer = hscan->rs_cbuf;
 	snapshot = scan->rs_snapshot;
 
@@ -2175,7 +2193,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			ItemPointerData tid;
 			HeapTupleData heapTuple;
 
-			ItemPointerSet(&tid, block, offnum);
+			ItemPointerSet(&tid, tbmres->blockno, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
 									   &heapTuple, NULL, true))
 				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
@@ -2203,7 +2221,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			loctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
 			loctup.t_len = ItemIdGetLength(lp);
 			loctup.t_tableOid = scan->rs_rd->rd_id;
-			ItemPointerSet(&loctup.t_self, block, offnum);
+			ItemPointerSet(&loctup.t_self, tbmres->blockno, offnum);
 			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 			if (valid)
 			{
@@ -2221,25 +2239,31 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 	Assert(ntup <= MaxHeapTuplesPerPage);
 	hscan->rs_ntuples = ntup;
 
-	return ntup > 0;
+	return true;
 }
 
 static bool
-heapam_scan_bitmap_next_tuple(TableScanDesc scan,
-							  TBMIterateResult *tbmres,
-							  TupleTableSlot *slot)
+heapam_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 	OffsetNumber targoffset;
 	Page		page;
 	ItemId		lp;
 
+	if (hscan->empty_tuples)
+	{
+		ExecStoreAllNullTuple(slot);
+		hscan->empty_tuples--;
+		return true;
+	}
+
 	/*
 	 * Out of range?  If so, nothing more to look at on this page
 	 */
 	if (hscan->rs_cindex < 0 || hscan->rs_cindex >= hscan->rs_ntuples)
 		return false;
 
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	targoffset = hscan->rs_vistuples[hscan->rs_cindex];
 	page = BufferGetPage(hscan->rs_cbuf);
 	lp = PageGetItemId(page, targoffset);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 83d00a46638..7c6bea232d7 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/relscan.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "commands/createas.h"
@@ -112,7 +113,7 @@ static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_tidbitmap_info(BitmapHeapScanState *planstate,
+static void show_tidbitmap_info(TableScanDesc tdesc,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
 									   PlanState *planstate, ExplainState *es);
@@ -1874,7 +1875,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			if (es->analyze)
-				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
+				show_tidbitmap_info(((BitmapHeapScanState *) planstate)->ss.ss_currentScanDesc, es);
 			break;
 		case T_SampleScan:
 			show_tablesample(((SampleScan *) plan)->tablesample,
@@ -3463,25 +3464,25 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
  * If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
  */
 static void
-show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
+show_tidbitmap_info(TableScanDesc tdesc, ExplainState *es)
 {
 	if (es->format != EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainPropertyInteger("Exact Heap Blocks", NULL,
-							   planstate->exact_pages, es);
+							   tdesc->exact_pages, es);
 		ExplainPropertyInteger("Lossy Heap Blocks", NULL,
-							   planstate->lossy_pages, es);
+							   tdesc->lossy_pages, es);
 	}
 	else
 	{
-		if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+		if (tdesc->exact_pages > 0 || tdesc->lossy_pages > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "Heap Blocks:");
-			if (planstate->exact_pages > 0)
-				appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
-			if (planstate->lossy_pages > 0)
-				appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages);
+			if (tdesc->exact_pages > 0)
+				appendStringInfo(es->str, " exact=%ld", tdesc->exact_pages);
+			if (tdesc->lossy_pages > 0)
+				appendStringInfo(es->str, " lossy=%ld", tdesc->lossy_pages);
 			appendStringInfoChar(es->str, '\n');
 		}
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index c1e81ebed63..d94ded30f24 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -54,11 +54,6 @@
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-												TBMIterateResult *tbmres);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
-								  TableScanDesc scan);
 static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
 
 
@@ -74,10 +69,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	ExprContext *econtext;
 	TableScanDesc scan;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator = NULL;
-	TBMSharedIterator *shared_tbmiterator = NULL;
-	TBMIterateResult *tbmres;
-	TupleTableSlot *slot;
+	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 	ParallelBitmapHeapState *pstate = node->pstate;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
 
@@ -88,23 +80,10 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	slot = node->ss.ss_ScanTupleSlot;
 	scan = node->ss.ss_currentScanDesc;
 	tbm = node->tbm;
-	if (pstate == NULL)
-		tbmiterator = node->tbmiterator;
-	else
-		shared_tbmiterator = node->shared_tbmiterator;
-	tbmres = node->tbmres;
 
 	/*
 	 * If we haven't yet performed the underlying index scan, do it, and begin
 	 * the iteration over the bitmap.
-	 *
-	 * For prefetching, we use *two* iterators, one for the pages we are
-	 * actually scanning and another that runs ahead of the first for
-	 * prefetching.  node->prefetch_pages tracks exactly how many pages ahead
-	 * the prefetch iterator is.  Also, node->prefetch_target tracks the
-	 * desired prefetch distance, which starts small and increases up to the
-	 * node->prefetch_maximum.  This is to avoid doing a lot of prefetching in
-	 * a scan that stops after a few tuples because of a LIMIT.
 	 */
 	if (!node->initialized)
 	{
@@ -116,17 +95,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				elog(ERROR, "unrecognized result from subplan");
 
 			node->tbm = tbm;
-			node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->prefetch_iterator = tbm_begin_iterate(tbm);
-				node->prefetch_pages = 0;
-				node->prefetch_target = -1;
-			}
-#endif							/* USE_PREFETCH */
+			scan->tbmiterator = tbm_begin_iterate(tbm);
 		}
 		else
 		{
@@ -148,194 +117,59 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				 * dsa_pointer of the iterator state which will be used by
 				 * multiple processes to iterate jointly.
 				 */
-				pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
-				if (node->prefetch_maximum > 0)
-				{
-					pstate->prefetch_iterator =
-						tbm_prepare_shared_iterate(tbm);
-
-					/*
-					 * We don't need the mutex here as we haven't yet woke up
-					 * others.
-					 */
-					pstate->prefetch_pages = 0;
-					pstate->prefetch_target = -1;
-				}
-#endif
+				pstate->tbmiterator =
+					tbm_prepare_shared_iterate(tbm);
 
 				/* We have initialized the shared state so wake up others. */
 				BitmapDoneInitializingSharedState(pstate);
 			}
 
 			/* Allocate a private iterator and attach the shared state to it */
-			node->shared_tbmiterator = shared_tbmiterator =
+			scan->shared_tbmiterator =
 				tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->shared_prefetch_iterator =
-					tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
-			}
-#endif							/* USE_PREFETCH */
 		}
 		node->initialized = true;
+
+		/* Get the first block. if none, end of scan */
+		if (!table_scan_bitmap_next_block(scan, &node->recheck))
+			goto exit;
 	}
 
-	for (;;)
+	do
 	{
-		bool		skip_fetch;
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * Get next page of results if needed
-		 */
-		if (tbmres == NULL)
-		{
-			if (!pstate)
-				node->tbmres = tbmres = tbm_iterate(tbmiterator);
-			else
-				node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
-			if (tbmres == NULL)
-			{
-				/* no more entries in the bitmap */
-				break;
-			}
-
-			BitmapAdjustPrefetchIterator(node, tbmres);
-
-			/*
-			 * We can skip fetching the heap page if we don't need any fields
-			 * from the heap, and the bitmap entries don't need rechecking,
-			 * and all tuples on the page are visible to our transaction.
-			 *
-			 * XXX: It's a layering violation that we do these checks above
-			 * tableam, they should probably moved below it at some point.
-			 */
-			skip_fetch = (node->can_skip_fetch &&
-						  !tbmres->recheck &&
-						  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-										 tbmres->blockno,
-										 &node->vmbuffer));
-
-			if (skip_fetch)
-			{
-				/* can't be lossy in the skip_fetch case */
-				Assert(tbmres->ntuples >= 0);
-
-				/*
-				 * The number of tuples on this page is put into
-				 * node->return_empty_tuples.
-				 */
-				node->return_empty_tuples = tbmres->ntuples;
-			}
-			else if (!table_scan_bitmap_next_block(scan, tbmres))
-			{
-				/* AM doesn't think this block is valid, skip */
-				continue;
-			}
-
-			if (tbmres->ntuples >= 0)
-				node->exact_pages++;
-			else
-				node->lossy_pages++;
-
-			/* Adjust the prefetch target */
-			BitmapAdjustPrefetchTarget(node);
-		}
-		else
+		while (table_scan_bitmap_next_tuple(scan, slot))
 		{
-			/*
-			 * Continuing in previously obtained page.
-			 */
-
-#ifdef USE_PREFETCH
-
-			/*
-			 * Try to prefetch at least a few pages even before we get to the
-			 * second page if we don't stop reading after the first tuple.
-			 */
-			if (!pstate)
-			{
-				if (node->prefetch_target < node->prefetch_maximum)
-					node->prefetch_target++;
-			}
-			else if (pstate->prefetch_target < node->prefetch_maximum)
-			{
-				/* take spinlock while updating shared state */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_target < node->prefetch_maximum)
-					pstate->prefetch_target++;
-				SpinLockRelease(&pstate->mutex);
-			}
-#endif							/* USE_PREFETCH */
-		}
+			CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We issue prefetch requests *after* fetching the current page to try
-		 * to avoid having prefetching interfere with the main I/O. Also, this
-		 * should happen only when we have determined there is still something
-		 * to do on the current page, else we may uselessly prefetch the same
-		 * page we are just about to request for real.
-		 *
-		 * XXX: It's a layering violation that we do these checks above
-		 * tableam, they should probably moved below it at some point.
-		 */
-		BitmapPrefetch(node, scan);
-
-		if (node->return_empty_tuples > 0)
-		{
-			/*
-			 * If we don't have to fetch the tuple, just return nulls.
-			 */
-			ExecStoreAllNullTuple(slot);
+			if (!node->recheck)
+				return slot;
 
-			if (--node->return_empty_tuples == 0)
-			{
-				/* no more tuples to return in the next round */
-				node->tbmres = tbmres = NULL;
-			}
-		}
-		else
-		{
-			/*
-			 * Attempt to fetch tuple from AM.
-			 */
-			if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
-			{
-				/* nothing more to look at on this page */
-				node->tbmres = tbmres = NULL;
-				continue;
-			}
+			econtext->ecxt_scantuple = slot;
+			if (ExecQualAndReset(node->bitmapqualorig, econtext))
+				return slot;
 
-			/*
-			 * If we are using lossy info, we have to recheck the qual
-			 * conditions at every tuple.
-			 */
-			if (tbmres->recheck)
-			{
-				econtext->ecxt_scantuple = slot;
-				if (!ExecQualAndReset(node->bitmapqualorig, econtext))
-				{
-					/* Fails recheck, so drop it and loop back for another */
-					InstrCountFiltered2(node, 1);
-					ExecClearTuple(slot);
-					continue;
-				}
-			}
+			/* Fails recheck, so drop it and loop back for another */
+			InstrCountFiltered2(node, 1);
+			ExecClearTuple(slot);
 		}
+	} while (table_scan_bitmap_next_block(scan, &node->recheck));
 
-		/* OK to return this tuple */
-		return slot;
-	}
+exit:
 
 	/*
-	 * if we get here it means we are at the end of the scan..
+	 * Release iterator
 	 */
-	return ExecClearTuple(slot);
+	if (scan->tbmiterator)
+	{
+		tbm_end_iterate(scan->tbmiterator);
+		scan->tbmiterator = NULL;
+	}
+	else if (scan->shared_tbmiterator)
+	{
+		tbm_end_shared_iterate(scan->shared_tbmiterator);
+		scan->shared_tbmiterator = NULL;
+	}
+	return NULL;
 }
 
 /*
@@ -353,215 +187,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
 	ConditionVariableBroadcast(&pstate->cv);
 }
 
-/*
- *	BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-							 TBMIterateResult *tbmres)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (node->prefetch_pages > 0)
-		{
-			/* The main iterator has closed the distance by one page */
-			node->prefetch_pages--;
-		}
-		else if (prefetch_iterator)
-		{
-			/* Do not let the prefetch iterator get behind the main one */
-			TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-
-			if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
-				elog(ERROR, "prefetch and main iterators are out of sync");
-		}
-		return;
-	}
-
-	if (node->prefetch_maximum > 0)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_pages > 0)
-		{
-			pstate->prefetch_pages--;
-			SpinLockRelease(&pstate->mutex);
-		}
-		else
-		{
-			/* Release the mutex before iterating */
-			SpinLockRelease(&pstate->mutex);
-
-			/*
-			 * In case of shared mode, we can not ensure that the current
-			 * blockno of the main iterator and that of the prefetch iterator
-			 * are same.  It's possible that whatever blockno we are
-			 * prefetching will be processed by another process.  Therefore,
-			 * we don't validate the blockno here as we do in non-parallel
-			 * case.
-			 */
-			if (prefetch_iterator)
-				tbm_shared_iterate(prefetch_iterator);
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max.  Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		if (node->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (node->prefetch_target >= node->prefetch_maximum / 2)
-			node->prefetch_target = node->prefetch_maximum;
-		else if (node->prefetch_target > 0)
-			node->prefetch_target *= 2;
-		else
-			node->prefetch_target++;
-		return;
-	}
-
-	/* Do an unlocked check first to save spinlock acquisitions. */
-	if (pstate->prefetch_target < node->prefetch_maximum)
-	{
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
-			pstate->prefetch_target = node->prefetch_maximum;
-		else if (pstate->prefetch_target > 0)
-			pstate->prefetch_target *= 2;
-		else
-			pstate->prefetch_target++;
-		SpinLockRelease(&pstate->mutex);
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (node->prefetch_pages < node->prefetch_target)
-			{
-				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-				bool		skip_fetch;
-
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = NULL;
-					break;
-				}
-				node->prefetch_pages++;
-
-				/*
-				 * If we expect not to have to actually read this heap page,
-				 * skip this prefetch call, but continue to run the prefetch
-				 * logic normally.  (Would it be better not to increment
-				 * prefetch_pages?)
-				 *
-				 * This depends on the assumption that the index AM will
-				 * report the same recheck flag for this future heap page as
-				 * it did for the current heap page; which is not a certainty
-				 * but is true in many cases.
-				 */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-
-		return;
-	}
-
-	if (pstate->prefetch_pages < pstate->prefetch_target)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (1)
-			{
-				TBMIterateResult *tbmpre;
-				bool		do_prefetch = false;
-				bool		skip_fetch;
-
-				/*
-				 * Recheck under the mutex. If some other process has already
-				 * done enough prefetching then we need not to do anything.
-				 */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_pages < pstate->prefetch_target)
-				{
-					pstate->prefetch_pages++;
-					do_prefetch = true;
-				}
-				SpinLockRelease(&pstate->mutex);
-
-				if (!do_prefetch)
-					return;
-
-				tbmpre = tbm_shared_iterate(prefetch_iterator);
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_shared_iterate(prefetch_iterator);
-					node->shared_prefetch_iterator = NULL;
-					break;
-				}
-
-				/* As above, skip prefetch if we expect not to need page */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
 /*
  * BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
  */
@@ -607,29 +232,10 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 	table_rescan(node->ss.ss_currentScanDesc, NULL);
 
 	/* release bitmaps and buffers if any */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 	node->tbm = NULL;
-	node->tbmiterator = NULL;
-	node->tbmres = NULL;
-	node->prefetch_iterator = NULL;
 	node->initialized = false;
-	node->shared_tbmiterator = NULL;
-	node->shared_prefetch_iterator = NULL;
-	node->vmbuffer = InvalidBuffer;
-	node->pvmbuffer = InvalidBuffer;
 
 	ExecScanReScan(&node->ss);
 
@@ -663,20 +269,8 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
 	/*
 	 * release bitmaps and buffers if any
 	 */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 
 	/*
 	 * close heap scan
@@ -695,6 +289,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		can_skip_fetch;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -714,32 +309,10 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
 
 	scanstate->tbm = NULL;
-	scanstate->tbmiterator = NULL;
-	scanstate->tbmres = NULL;
-	scanstate->return_empty_tuples = 0;
-	scanstate->vmbuffer = InvalidBuffer;
-	scanstate->pvmbuffer = InvalidBuffer;
-	scanstate->exact_pages = 0;
-	scanstate->lossy_pages = 0;
-	scanstate->prefetch_iterator = NULL;
-	scanstate->prefetch_pages = 0;
-	scanstate->prefetch_target = 0;
 	scanstate->pscan_len = 0;
 	scanstate->initialized = false;
-	scanstate->shared_tbmiterator = NULL;
-	scanstate->shared_prefetch_iterator = NULL;
 	scanstate->pstate = NULL;
 
-	/*
-	 * We can potentially skip fetching heap pages if we do not need any
-	 * columns of the table, either for checking non-indexable quals or for
-	 * returning data.  This test is a bit simplistic, as it checks the
-	 * stronger condition that there's no qual or return tlist at all.  But in
-	 * most cases it's probably not worth working harder than that.
-	 */
-	scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
-								 node->scan.plan.targetlist == NIL);
-
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -778,19 +351,22 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->bitmapqualorig =
 		ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
 
+	scanstate->ss.ss_currentRelation = currentRelation;
+
 	/*
-	 * Maximum number of prefetches for the tablespace if configured,
-	 * otherwise the current value of the effective_io_concurrency GUC.
+	 * We can potentially skip fetching heap pages if we do not need any
+	 * columns of the table, either for checking non-indexable quals or for
+	 * returning data.  This test is a bit simplistic, as it checks the
+	 * stronger condition that there's no qual or return tlist at all.  But in
+	 * most cases it's probably not worth working harder than that.
 	 */
-	scanstate->prefetch_maximum =
-		get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
-	scanstate->ss.ss_currentRelation = currentRelation;
+	can_skip_fetch = (node->scan.plan.qual == NIL &&
+					  node->scan.plan.targetlist == NIL);
 
 	scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
 														  estate->es_snapshot,
 														  0,
-														  NULL);
+														  NULL, can_skip_fetch);
 
 	/*
 	 * all done.
@@ -875,12 +451,9 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 	pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
 
 	pstate->tbmiterator = 0;
-	pstate->prefetch_iterator = 0;
 
 	/* Initialize the mutex */
 	SpinLockInit(&pstate->mutex);
-	pstate->prefetch_pages = 0;
-	pstate->prefetch_target = 0;
 	pstate->state = BM_INITIAL;
 
 	ConditionVariableInit(&pstate->cv);
@@ -912,11 +485,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 	if (DsaPointerIsValid(pstate->tbmiterator))
 		tbm_free_shared_area(dsa, pstate->tbmiterator);
 
-	if (DsaPointerIsValid(pstate->prefetch_iterator))
-		tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
 	pstate->tbmiterator = InvalidDsaPointer;
-	pstate->prefetch_iterator = InvalidDsaPointer;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index e8ab5d78fcc..3f84abeb907 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -181,7 +181,6 @@ struct TBMIterator
 	int			spageptr;		/* next spages index */
 	int			schunkptr;		/* next schunks index */
 	int			schunkbit;		/* next bit to check in current schunk */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /*
@@ -222,7 +221,6 @@ struct TBMSharedIterator
 	PTEntryArray *ptbase;		/* pagetable element array */
 	PTIterationArray *ptpages;	/* sorted exact page index list */
 	PTIterationArray *ptchunks; /* sorted lossy page index list */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /* Local function prototypes */
@@ -696,8 +694,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * Create the TBMIterator struct, with enough trailing space to serve the
 	 * needs of the TBMIterateResult sub-struct.
 	 */
-	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-									  MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+	iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
 	iterator->tbm = tbm;
 
 	/*
@@ -958,20 +955,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
 /*
  * tbm_iterate - scan through next page of a TIDBitmap
  *
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan.  Pages are guaranteed to be delivered in numerical
- * order.  If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition.  If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway.  (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order.  tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition.  If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway.  (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
  */
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
 {
 	TIDBitmap  *tbm = iterator->tbm;
-	TBMIterateResult *output = &(iterator->output);
 
 	Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
 
@@ -999,6 +997,7 @@ tbm_iterate(TBMIterator *iterator)
 	 * If both chunk and per-page data remain, must output the numerically
 	 * earlier page.
 	 */
+	Assert(tbmres);
 	if (iterator->schunkptr < tbm->nchunks)
 	{
 		PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -1009,11 +1008,11 @@ tbm_iterate(TBMIterator *iterator)
 			chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			iterator->schunkbit++;
-			return output;
+			return;
 		}
 	}
 
@@ -1029,16 +1028,17 @@ tbm_iterate(TBMIterator *iterator)
 			page = tbm->spages[iterator->spageptr];
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		iterator->spageptr++;
-		return output;
+		return;
 	}
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
@@ -1048,10 +1048,9 @@ tbm_iterate(TBMIterator *iterator)
  *	across multiple processes.  We need to acquire the iterator LWLock,
  *	before accessing the shared members.
  */
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
 {
-	TBMIterateResult *output = &iterator->output;
 	TBMSharedIteratorState *istate = iterator->state;
 	PagetableEntry *ptbase = NULL;
 	int		   *idxpages = NULL;
@@ -1102,13 +1101,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 			chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			istate->schunkbit++;
 
 			LWLockRelease(&istate->lock);
-			return output;
+			return;
 		}
 	}
 
@@ -1118,21 +1117,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 		int			ntuples;
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		istate->spageptr++;
 
 		LWLockRelease(&istate->lock);
 
-		return output;
+		return;
 	}
 
 	LWLockRelease(&istate->lock);
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 21e71d60912..e28fb71aaf1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -20,6 +20,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "nodes/execnodes.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -76,9 +77,15 @@ typedef struct HeapScanDescData
 	struct PgStreamingRead *pgsr;
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
+
+	/*
+	 * MFIXME: not sure if vmbuffer is being released in the right place.
+	 */
+	Buffer		vmbuffer;		/* current VM buffer for BHS */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
+	int			empty_tuples;
 }			HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304ab..cdb44316a34 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/spin.h"
@@ -40,6 +41,13 @@ typedef struct TableScanDescData
 	ItemPointerData rs_mintid;
 	ItemPointerData rs_maxtid;
 
+	/* Only used for Bitmap table scans */
+	TBMIterator *tbmiterator;
+	TBMSharedIterator *shared_tbmiterator;
+	long		exact_pages;
+	long		lossy_pages;
+	int			empty_tuples;
+
 	/*
 	 * Information about type and behaviour of the scan, a bitmask of members
 	 * of the ScanOptions enum (see tableam.h).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d2..b5626805a62 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -32,6 +32,7 @@ extern PGDLLIMPORT char *default_table_access_method;
 extern PGDLLIMPORT bool synchronize_seqscans;
 
 
+struct BitmapHeapScanState;
 struct BulkInsertStateData;
 struct IndexInfo;
 struct SampleScanState;
@@ -62,7 +63,8 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
-}			ScanOptions;
+	SO_CAN_SKIP_FETCH = 1 << 10
+} ScanOptions;
 
 /*
  * Result codes for table_{update,delete,lock_tuple}, and for visibility
@@ -773,53 +775,36 @@ typedef struct TableAmRoutine
 	 */
 
 	/*
-	 * Prepare to fetch / check / return tuples from `tbmres->blockno` as part
-	 * of a bitmap table scan. `scan` was started via table_beginscan_bm().
-	 * Return false if there are no tuples to be found on the page, true
-	 * otherwise.
+	 * Prepare to fetch / check / return tuples obtained from the streaming
+	 * read helper as part of a bitmap table scan. `scan` was started via
+	 * table_beginscan_bm(). Return false if the relation is exhausted and
+	 * true otherwise.
 	 *
 	 * This will typically read and pin the target block, and do the necessary
 	 * work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
 	 * make sense to perform tuple visibility checks at this time). For some
-	 * AMs it will make more sense to do all the work referencing `tbmres`
-	 * contents here, for others it might be better to defer more work to
-	 * scan_bitmap_next_tuple.
+	 * AMs, it might be better to defer more work to scan_bitmap_next_tuple().
 	 *
-	 * If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
-	 * on the page have to be returned, otherwise the tuples at offsets in
-	 * `tbmres->offsets` need to be returned.
+	 * After examining the page from the TBMIterateResult returned by the
+	 * streaming read helper, most AMs will set relevant fields in the `scan`
+	 * for the benefit of scan_bitmap_next_tuple(). All AMs must set
+	 * `recheck`, passed in by the caller, based on the value in the
+	 * TBMIterateResult so that BitmapHeapNext() can determine whether or not
+	 * to yield tuples.
 	 *
-	 * XXX: Currently this may only be implemented if the AM uses md.c as its
-	 * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
-	 * blockids directly to the underlying storage. nodeBitmapHeapscan.c
-	 * performs prefetching directly using that interface.  This probably
-	 * needs to be rectified at a later point.
-	 *
-	 * XXX: Currently this may only be implemented if the AM uses the
-	 * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
-	 * perform prefetching.  This probably needs to be rectified at a later
-	 * point.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_block) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres);
+	bool		(*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck);
 
 	/*
 	 * Fetch the next tuple of a bitmap table scan into `slot` and return true
 	 * if a visible tuple was found, false otherwise.
 	 *
-	 * For some AMs it will make more sense to do all the work referencing
-	 * `tbmres` contents in scan_bitmap_next_block, for others it might be
-	 * better to defer more work to this callback.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres,
-										   TupleTableSlot *slot);
+	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan, TupleTableSlot *slot);
 
 	/*
 	 * Prepare to fetch tuples from the next block in a sample scan. Return
@@ -944,10 +929,13 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
  */
 static inline TableScanDesc
 table_beginscan_bm(Relation rel, Snapshot snapshot,
-				   int nkeys, struct ScanKeyData *key)
+				   int nkeys, struct ScanKeyData *key, bool can_skip_fetch)
 {
 	uint32		flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
 
+	if (can_skip_fetch)
+		flags |= SO_CAN_SKIP_FETCH;
+
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
 
@@ -1955,8 +1943,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
  * used after verifying the presence (at plan time or such).
  */
 static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1966,8 +1953,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
 
-	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
-														   tbmres);
+	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck);
 }
 
 /*
@@ -1979,9 +1965,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
  * returned false.
  */
 static inline bool
-table_scan_bitmap_next_tuple(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres,
-							 TupleTableSlot *slot)
+table_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
@@ -1992,7 +1976,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
 
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
-														   tbmres,
 														   slot);
 }
 
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index ea003a9caae..39ae7b29efb 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -28,5 +28,6 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 										  ParallelContext *pcxt);
 extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 										   ParallelWorkerContext *pwcxt);
+extern void table_bitmap_scan_setup(BitmapHeapScanState *scanstate);
 
 #endif							/* NODEBITMAPHEAPSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd57..eab4ae0f54c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1682,11 +1682,7 @@ typedef enum
 /* ----------------
  *	 ParallelBitmapHeapState information
  *		tbmiterator				iterator for scanning current pages
- *		prefetch_iterator		iterator for prefetching ahead of current page
- *		mutex					mutual exclusion for the prefetching variable
- *								and state
- *		prefetch_pages			# pages prefetch iterator is ahead of current
- *		prefetch_target			current target prefetch distance
+ *		mutex					mutual exclusion for the state
  *		state					current state of the TIDBitmap
  *		cv						conditional wait variable
  *		phs_snapshot_data		snapshot data shared to workers
@@ -1695,10 +1691,7 @@ typedef enum
 typedef struct ParallelBitmapHeapState
 {
 	dsa_pointer tbmiterator;
-	dsa_pointer prefetch_iterator;
 	slock_t		mutex;
-	int			prefetch_pages;
-	int			prefetch_target;
 	SharedBitmapState state;
 	ConditionVariable cv;
 	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1709,22 +1702,9 @@ typedef struct ParallelBitmapHeapState
  *
  *		bitmapqualorig	   execution state for bitmapqualorig expressions
  *		tbm				   bitmap obtained from child index scan(s)
- *		tbmiterator		   iterator for scanning current pages
- *		tbmres			   current-page data
- *		can_skip_fetch	   can we potentially skip tuple fetches in this scan?
- *		return_empty_tuples number of empty tuples to return
- *		vmbuffer		   buffer for visibility-map lookups
- *		pvmbuffer		   ditto, for prefetched pages
- *		exact_pages		   total number of exact pages retrieved
- *		lossy_pages		   total number of lossy pages retrieved
- *		prefetch_iterator  iterator for prefetching ahead of current page
- *		prefetch_pages	   # pages prefetch iterator is ahead of current
- *		prefetch_target    current target prefetch distance
- *		prefetch_maximum   maximum value for prefetch_target
+ *		recheck			   whether or not the tuple needs to be rechecked before returning
  *		pscan_len		   size of the shared memory for parallel bitmap
  *		initialized		   is node is ready to iterate
- *		shared_tbmiterator	   shared iterator
- *		shared_prefetch_iterator shared iterator for prefetching
  *		pstate			   shared state for parallel bitmap scan
  * ----------------
  */
@@ -1733,22 +1713,9 @@ typedef struct BitmapHeapScanState
 	ScanState	ss;				/* its first field is NodeTag */
 	ExprState  *bitmapqualorig;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator;
-	TBMIterateResult *tbmres;
-	bool		can_skip_fetch;
-	int			return_empty_tuples;
-	Buffer		vmbuffer;
-	Buffer		pvmbuffer;
-	long		exact_pages;
-	long		lossy_pages;
-	TBMIterator *prefetch_iterator;
-	int			prefetch_pages;
-	int			prefetch_target;
-	int			prefetch_maximum;
+	bool		recheck;
 	Size		pscan_len;
 	bool		initialized;
-	TBMSharedIterator *shared_tbmiterator;
-	TBMSharedIterator *shared_prefetch_iterator;
 	ParallelBitmapHeapState *pstate;
 } BitmapHeapScanState;
 
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 1945f0639bf..97e40097fd2 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -64,8 +64,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
 
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
 extern void tbm_end_iterate(TBMIterator *iterator);
 extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
-- 
2.39.2

#28

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#27)

Re: Streaming I/O, vectored I/O (WIP)

On Tue, Feb 27, 2024 at 9:25 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Here's the 2 step version. The streaming_read.c API is unchanged, but
the bugmfr.c API now has only the following extra functions:

bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op)
WaitReadBuffers(ReadBuffersOperation *op)

I wonder if there are semi-standard names that people use for this
kind of API. Somehow I like "start" and "finish" or "start" and
"complete" better than "start" and "wait". But I don't really know
what's best. If there's a usual practice, it'd be good to adhere to
it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#29

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Robert Haas (#28)

7 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Tue, Feb 27, 2024 at 5:03 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 27, 2024 at 9:25 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Here's the 2 step version. The streaming_read.c API is unchanged, but
the bugmfr.c API now has only the following extra functions:

bool StartReadBuffers(..., int *nblocks, ..., ReadBuffersOperation *op)
WaitReadBuffers(ReadBuffersOperation *op)

I wonder if there are semi-standard names that people use for this
kind of API. Somehow I like "start" and "finish" or "start" and
"complete" better than "start" and "wait". But I don't really know
what's best. If there's a usual practice, it'd be good to adhere to
it.

I think "complete" has a subtly different meaning, which is why I
liked Heikki's suggestion. One way to look at it is that only one
backend can "complete" a read, but any number of backends can "wait".
But I don't have strong views on this. It feels like this API is
relatively difficult to use directly anyway, so almost all users will
go through the streaming read API.

Here's a new version with two main improvements. (Note that I'm only
talking about 0001-0003 here, the rest are useful for testing but are
just outdated versions of patches that have their own threads.)

1. Melanie discovered small regressions in all-cached simple scans.
Here's a better look-ahead distance control algorithm that addresses
that. First let me state the updated goals of the algorithm:

A) for all-hit scans, pin very few buffers, since that can't help if
we're not doing any I/O
B) for all-miss sequential scans, pin only as many buffers as it
takes to build full-sized I/Os, since fully sequential scans are left
to the OS to optimise for now (ie no "advice")
C) for all-miss random scans, pin as many buffers as it takes to
reach our I/O concurrency level

In all cases, respect the per-backend pin limit as a last resort limit
(roughly NBuffers / max_connections), but we're now actively trying
*not* to use so many.

For patterns in between the A, B, C extremes, do something in between.
The way the new algorithm attempts to classify the scan adaptively
over time is as follows:

* look ahead distance starts out at one block (behaviour A)
* every time we start an I/O, we double the distance until we reach
the max pin limit (behaviour C), or if we're not issuing "advice"
because sequential access is detected, until we reach the
MAX_TRANSFER_BUFFERS (behaviour B)
* every time we get a hit, we decrement the distance by one (we move
slowly back to behaviour A)

Query to observe a system transitioning A->B->A->B, when doing a full
scan that has ~50 contiguous blocks already in shared buffers
somewhere in the middle:

create extension pg_prewarm;
create extension pg_buffercache;
set max_parallel_workers_per_gather = 0;

create table t (i int);
insert into t select generate_series(1, 100000);
select pg_prewarm('t');
select bool_and(pg_buffercache_invalidate(bufferid))
from pg_buffercache
where relfilenode = pg_relation_filenode('t')
and (relblocknumber between 0 and 100 or relblocknumber > 150);

select count(*) from t;

pread(31,...) = 8192 (0x2000) <--- start small (A)
preadv(31,...) = 16384 (0x4000) <--- ramp up...
preadv(31,...) = 32768 (0x8000)
preadv(31,...) = 65536 (0x10000)
preadv(31,...) = 131072 (0x20000) <--- full size (B)
preadv(31,...) = 131072 (0x20000)
...
preadv(31,...) = 131072 (0x20000)
preadv(31,...) = 49152 (0xc000) <--- end of misses, decay to A
pread(31,...) = 8192 (0x2000) <--- start small again (A)
preadv(31,...) = 16384 (0x4000)
preadv(31,...) = 32768 (0x8000)
preadv(31,...) = 65536 (0x10000)
preadv(31,...) = 131072 (0x20000) <--- full size (B)
preadv(31,...) = 131072 (0x20000)
...
preadv(31,...) = 131072 (0x20000)
preadv(31,...) = 122880 (0x1e000) <-- end of relation

Note that this never pins more than 16 buffers, because it's case B,
not case C in the description above. There is no benefit in looking
further ahead if you're relying on the kernel's sequential
prefetching.

The fully cached regression Melanie reported now stays entirely in A.
The previous coding would ramp up to high look-ahead distance for no
good reason and never ramp down, so it was wasteful and slightly
slower than master.

2. Melanie and I noticed while discussing the pre-existing ad hoc
bitmap heap scan that this thing should respect the
{effective,maintenance}_io_concurrency setting of the containing
tablespace. Note special cases to avoid problems while stream-reading
pg_tablespace itself and pg_database in backend initialisation.

There is a third problem that I'm still working on: the behaviour for
very small values of effective_io_concurrency isn't quite right, as
discussed in detail by Tomas and Melanie on the bitmap heapscan
thread. The attached makes 0 do approximately the right thing (though
I'm hoping to make it less special), but other small numbers aren't
quite right yet -- 1 is still issuing a useless fadvise at the wrong
time, and 2 is working in groups of N at a time instead of
interleaving as you might perhaps expect. These are probably
side-effects of my focusing on coalescing large reads and losing sight
of the small cases. I need a little more adaptivity and generality in
the algorithm at the small end, not least because 1 is the default
value. I'll share a patch to improve that very soon.

Attachments:

v6-0001-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From af0178dd30013cdad1ef0d002792f52302b015bf Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v6 1/7] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

With some more infrastructure in later work, StartReadBuffers() should
be extended to start real asynchronous I/O instead of advice.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 641 ++++++++++++++++++--------
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/include/storage/bufmgr.h          |  45 ++
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 492 insertions(+), 209 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..729d1f9172 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -471,10 +476,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +504,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +785,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +797,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +819,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +990,68 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1070,404 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+static Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
 
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	operation->bmr = bmr;
+	operation->forknum = forkNum;
+	operation->blocknum = blockNum;
+	operation->buffers = buffers;
+	operation->nblocks = actual_nblocks;
+	operation->strategy = strategy;
+	operation->flags = flags;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	operation->io_buffers_len = 0;
 
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
-	else
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		bool		found;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		buffers[i] = PrepareReadBuffer(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			operation->io_buffers_len++;
+		}
+	}
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+	if (operation->io_buffers_len > 0)
+	{
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
+			/*
+			 * In theory we should only do this if PrepareReadBuffers() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
 		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
 	}
+	else
+	{
+		return false;
+	}
+}
 
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
 	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
 	}
 	else
 	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
 	}
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for StartReadBuffers.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1475,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1534,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1602,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1637,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +1993,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2258,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2281,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2596,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2610,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3678,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5413,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5434,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a5..985a2c7049 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..b57f71f97e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int			io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +292,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3a7f75b08..26129b8e22 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2267,6 +2267,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.43.2

v6-0002-Provide-API-for-streaming-reads-of-relations.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Provide-API-for-streaming-reads-of-relations.patchDownload

From b60fcf1b691c8db06e012d7dcb5af69ce26f7bd2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v6 2/7] Provide API for "streaming" reads of relations.

"Streaming reads" can be used as a more efficient replacement for
a series of ReadBuffer() calls.

The client code supplies a callback that can say which block to read
next, and then consumes individual buffers one at a time.  This division
allows streaming_read.c to build up large calls to StartReadBuffers(),
and issue fadvise() advice about future random reads in a systematic
way.

This API is based on an idea proposed by Andres Freund, to pave the way
for asynchronous I/O in future work as required to support direct I/O.
The main aim is to create an abstraction that insulates client code from
future improvements to the I/O subsystem.

An extended API may be necessary in future for more complicated cases
(for example recovery), but this basic API is thought to be sufficient
for many common usage patterns involving predictable access to a single
relation fork.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 648 +++++++++++++++++++++++
 src/backend/storage/meson.build          |   1 +
 src/include/storage/streaming_read.h     |  52 ++
 src/tools/pgindent/typedefs.list         |   2 +
 7 files changed, 723 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..404c95b7d3
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,648 @@
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		need_wait;
+	bool		advice_issued;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index;
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+	ReadBuffersOperation operation;
+} PgStreamingReadRange;
+
+/*
+ * Streaming read object.
+ */
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	int			distance;
+	bool		finished;
+	bool		advice_enabled;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/*
+	 * Make sure our bmr's smgr and persistent are populated.  The caller
+	 * asserts that the storage manager will remain valid.
+	 */
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANSFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+	pgsr->callback = next_block_cb;
+	pgsr->bmr = bmr;
+	pgsr->forknum = forknum;
+
+	pgsr->unget_blocknum = InvalidBlockNumber;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out doing full-sized
+	 * reads.
+	 */
+	if (flags & PGSR_FLAG_FULL)
+		pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
+	else
+		pgsr->distance = 1;
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space for the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Find the per-buffer data index for the Nth block of a range.
+ */
+static int
+get_per_buffer_data_index(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	int			result;
+
+	/*
+	 * Find slot in the circular buffer of per-buffer data, without using the
+	 * expensive % operator.
+	 */
+	result = range->per_buffer_data_index + n;
+	while (result >= pgsr->max_pinned_buffers)
+		result -= pgsr->max_pinned_buffers;
+	Assert(result == (range->per_buffer_data_index + n) % pgsr->max_pinned_buffers);
+
+	return result;
+}
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data_by_index(PgStreamingRead *pgsr, int per_buffer_data_index)
+{
+	return (char *) pgsr->per_buffer_data +
+		pgsr->per_buffer_data_size * per_buffer_data_index;
+}
+
+/*
+ * Return a pointer to the per-buffer data for the Nth block of a range.
+ */
+static void *
+get_per_buffer_data(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	return get_per_buffer_data_by_index(pgsr,
+										get_per_buffer_data_index(pgsr,
+																  range,
+																  n));
+}
+
+/*
+ * Start reading the head range, and create a new head range.  The new head
+ * range is returned.  It may not be empty, if StartReadBuffers() couldn't
+ * start the entire range; in that case the returned range contains the
+ * remaining portion of the range.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_start_head_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+	PgStreamingReadRange *new_head_range;
+	int			nblocks_pinned;
+	int			flags;
+
+	/* Caller should make sure we never exceed max_ios. */
+	Assert((pgsr->ios_in_progress < pgsr->max_ios) ||
+		   (pgsr->ios_in_progress == 0 && pgsr->max_ios == 0));
+
+	/* Should only call if the head range has some blocks to read. */
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (pgsr->advice_enabled &&
+		pgsr->max_ios > 0 &&
+		head_range->blocknum != pgsr->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We shouldn't be trying to pin more buffers that we're allowed to. */
+	Assert(pgsr->pinned_buffers + head_range->nblocks <= pgsr->max_pinned_buffers);
+
+	/* Start reading as many blocks as we can from the head range. */
+	nblocks_pinned = head_range->nblocks;
+	head_range->need_wait =
+		StartReadBuffers(pgsr->bmr,
+						 head_range->buffers,
+						 pgsr->forknum,
+						 head_range->blocknum,
+						 &nblocks_pinned,
+						 pgsr->strategy,
+						 flags,
+						 &head_range->operation);
+
+	if (head_range->need_wait)
+	{
+		int		distance;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * Since we've issued advice, we count an I/O in progress until we
+			 * call WaitReadBuffers().
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+			Assert(pgsr->ios_in_progress <= pgsr->max_ios);
+
+			/*
+			 * Look-ahead distance ramps up rapidly, so we can search for more
+			 * I/Os to start.
+			 */
+			distance = pgsr->distance * 2;
+			distance = Min(distance, pgsr->max_pinned_buffers);
+			pgsr->distance = distance;
+		}
+		else
+		{
+			/*
+			 * There is no point in increasing look-ahead distance if we've
+			 * already reached the full I/O size, since we're not issuing
+			 * advice.  Extra distance would only pin more buffers for no
+			 * benefit.
+			 */
+			if (pgsr->distance > MAX_BUFFERS_PER_TRANSFER)
+			{
+				/* Look-ahead distance gradually decays. */
+				pgsr->distance--;
+			}
+			else
+			{
+				/*
+				 * Look-ahead distance ramps up rapidly, but not more that the
+				 * full I/O size.
+				 */
+				distance = pgsr->distance * 2;
+				distance = Min(distance, MAX_BUFFERS_PER_TRANSFER);
+				distance = Min(distance, pgsr->max_pinned_buffers);
+				pgsr->distance = distance;
+			}
+		}
+	}
+	else
+	{
+		/* No I/O necessary. Look-ahead distance gradually decays. */
+		if (pgsr->distance > 1)
+			pgsr->distance--;
+	}
+
+	/*
+	 * StartReadBuffers() might have pinned fewer blocks than we asked it to,
+	 * but always at least one.
+	 */
+	Assert(nblocks_pinned <= head_range->nblocks);
+	Assert(nblocks_pinned >= 1);
+	pgsr->pinned_buffers += nblocks_pinned;
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time.
+	 */
+	pgsr->seq_blocknum = head_range->blocknum + nblocks_pinned;
+
+	/*
+	 * Create a new head range.  There must be space, because we have enough
+	 * elements for every range to hold just one block, up to the pin limit.
+	 */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	new_head_range = &pgsr->ranges[pgsr->head];
+	new_head_range->nblocks = 0;
+	new_head_range->advice_issued = false;
+
+	/*
+	 * If we didn't manage to start the whole read above, we split the range,
+	 * moving the remainder into the new head range.
+	 */
+	if (nblocks_pinned < head_range->nblocks)
+	{
+		int			nblocks_remaining = head_range->nblocks - nblocks_pinned;
+
+		head_range->nblocks = nblocks_pinned;
+
+		new_head_range->blocknum = head_range->blocknum + nblocks_pinned;
+		new_head_range->nblocks = nblocks_remaining;
+	}
+
+	/* The new range has per-buffer data starting after the previous range. */
+	new_head_range->per_buffer_data_index =
+		get_per_buffer_data_index(pgsr, head_range, nblocks_pinned);
+
+	return new_head_range;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow pg_streaming_unget_block() to work.
+ */
+static BlockNumber
+pg_streaming_get_block(PgStreamingRead *pgsr, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(pgsr->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = pgsr->unget_blocknum;
+		pgsr->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * pg_streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == pgsr->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by pg_streaming_get_block().
+ */
+static void
+pg_streaming_unget_block(PgStreamingRead *pgsr, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(pgsr->unget_blocknum == InvalidBlockNumber);
+	pgsr->unget_blocknum = blocknum;
+	pgsr->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *range;
+
+	/* If we're finished, don't look ahead. */
+	if (pgsr->finished)
+		return;
+
+	/*
+	 * We we've already started the maximum allowed number of I/Os, don't look
+	 * ahead.  (The special case for max_ios == 0 is handle higher up.)
+	 */
+	if (pgsr->max_ios > 0 && pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BlockNumber blocknum;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		range = &pgsr->ranges[pgsr->head];
+		if (range->nblocks == lengthof(range->buffers))
+		{
+			/* Start as much of it as we can. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/* If we're now at the I/O limit, stop here. */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+				return;
+
+			/*
+			 * If we couldn't form a full range, then stop here to avoid
+			 * creating small I/O.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+				return;
+
+			/*
+			 * That might have only been partially started, but always
+			 * processes at least one so that'll do for now.
+			 */
+			Assert(range->nblocks < lengthof(range->buffers));
+		}
+
+		/* Find per-buffer data slot for the next block. */
+		per_buffer_data = get_per_buffer_data(pgsr, range, range->nblocks);
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pg_streaming_get_block(pgsr, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			pgsr->finished = true;
+			break;
+		}
+
+		/*
+		 * Is there a head range that we cannot extend, because the requested
+		 * block is not consecutive?
+		 */
+		if (range->nblocks > 0 &&
+			range->blocknum + range->nblocks != blocknum)
+		{
+			/* Yes.  Start it, so we can begin building a new one. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * It's possible that it was only partially started, and we have a
+			 * new range with the remainder.  Keep starting I/Os until we get
+			 * it all out of the way, or we hit the I/O limit.
+			 */
+			while (range->nblocks > 0 && pgsr->ios_in_progress < pgsr->max_ios)
+				range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * We have to 'unget' the block returned by the callback if we
+			 * don't have enough I/O capacity left to start something.
+			 */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+			{
+				pg_streaming_unget_block(pgsr, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* If we have a new, empty range, initialize the start block. */
+		if (range->nblocks == 0)
+		{
+			range->blocknum = blocknum;
+		}
+
+		/* This block extends the range by one. */
+		Assert(range->blocknum + range->nblocks == blocknum);
+		range->nblocks++;
+
+	} while (pgsr->pinned_buffers + range->nblocks < pgsr->distance);
+
+	/* Start as much as we can. */
+	while (range->nblocks > 0)
+	{
+		range = pg_streaming_read_start_head_range(pgsr);
+		if (pgsr->ios_in_progress == pgsr->max_ios)
+			break;
+	}
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	/*
+	 * The setting max_ios == 0 requires special t...
+	 */
+	if (pgsr->max_ios > 0 || pgsr->pinned_buffers == 0)
+		pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_wait)
+		{
+			WaitReadBuffers(&tail_range->operation);
+			tail_range->need_wait = false;
+
+			/*
+			 * We don't really know if the kernel generated a physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished now because we've performed the read.
+			 */
+			if (tail_range->advice_issued)
+			{
+				Assert(pgsr->ios_in_progress > 0);
+				pgsr->ios_in_progress--;
+			}
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = get_per_buffer_data(pgsr, tail_range, buffer_index);
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+
+		/*
+		 * If tail crashed into head, and head is not empty, then it is time
+		 * to start that range.
+		 */
+		if (pgsr->tail == pgsr->head &&
+			pgsr->ranges[pgsr->head].nblocks > 0)
+			pg_streaming_read_start_head_range(pgsr);
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	pgsr->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(pgsr->pinned_buffers == 0);
+	Assert(pgsr->ios_in_progress == 0);
+
+	/* Release memory. */
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..c4d3892bb2
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,52 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 26129b8e22..299c77ea69 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2097,6 +2097,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.43.2

v6-0003-Use-streaming-reads-in-pg_prewarm.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Use-streaming-reads-in-pg_prewarm.patchDownload

From 95d6ed95bdd4ec8f04723be625e010f040900804 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v6 3/7] Use streaming reads in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use streaming reads.  This
provides a simple example of such a transformation, and generates fewer
system calls.

Reviewed-by:
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c          | 40 +++++++++++++++++++++++-
 src/backend/storage/aio/streaming_read.c |  2 +-
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..1cc84bcb0c 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_FULL,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 404c95b7d3..0529bfc505 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -285,7 +285,7 @@ pg_streaming_read_start_head_range(PgStreamingRead *pgsr)
 
 	if (head_range->need_wait)
 	{
-		int		distance;
+		int			distance;
 
 		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-- 
2.43.2

v6-0004-WIP-Use-streaming-reads-in-heapam-scans.patchtext/x-patch; charset=US-ASCII; name=v6-0004-WIP-Use-streaming-reads-in-heapam-scans.patchDownload

From ab2bbd086680bdcd22e01c5e744d12210bbac1a6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Apr 2023 18:16:06 -0700
Subject: [PATCH v6 4/7] WIP: Use streaming reads in heapam scans.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/heap/heapam.c         | 205 +++++++++++++++++++++--
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/include/access/heapam.h              |   5 +-
 3 files changed, 192 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625..ab88eff296 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -62,6 +62,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/relcache.h"
@@ -221,6 +222,89 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	BlockNumber blockno;
+
+	Assert(!scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	if (scan->rs_prefetch_block == InvalidBlockNumber)
+	{
+		scan->rs_prefetch_block = blockno = scan->rs_startblock;
+	}
+	else
+	{
+		blockno = ++scan->rs_prefetch_block;
+
+		/* wrap back to the start of the heap */
+		if (blockno >= scan->rs_nblocks)
+			scan->rs_prefetch_block = blockno = 0;
+
+		/* we're done if we're back at where we started */
+		if (blockno == scan->rs_startblock)
+			return InvalidBlockNumber;
+
+		/* check if the limit imposed by heap_setscanlimits() is met */
+		if (scan->rs_numblocks != InvalidBlockNumber)
+		{
+			if (--scan->rs_numblocks == 0)
+				return InvalidBlockNumber;
+		}
+	}
+
+	return blockno;
+}
+
+static BlockNumber
+heap_pgsr_next_parallel(PgStreamingRead *pgsr, void *pgsr_private,
+						void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	ParallelBlockTableScanDesc pbscan =
+		(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+	ParallelBlockTableScanWorker pbscanwork =
+		scan->rs_parallelworkerdata;
+	BlockNumber blockno;
+
+	Assert(scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	/* Note that other processes might have already finished the scan */
+	blockno = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+												pbscanwork, pbscan);
+
+	return blockno;
+}
+
+static PgStreamingRead *
+heap_pgsr_single_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+										  scan,
+										  0,
+										  scan->rs_strategy,
+										  BMR_REL(scan->rs_base.rs_rd),
+										  MAIN_FORKNUM,
+										  heap_pgsr_next_single);
+}
+
+static PgStreamingRead *
+heap_pgsr_parallel_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PGSR_FLAG_SEQUENTIAL,
+										  scan,
+										  0,
+										  scan->rs_strategy,
+										  BMR_REL(scan->rs_base.rs_rd),
+										  MAIN_FORKNUM,
+										  heap_pgsr_next_parallel);
+}
+
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -338,6 +422,26 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 */
 	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN)
 		pgstat_count_heap_scan(scan->rs_base.rs_rd);
+
+	scan->rs_prefetch_block = InvalidBlockNumber;
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
+	/*
+	 * FIXME: This probably should be done in the !rs_inited blocks instead.
+	 */
+	scan->pgsr = NULL;
+	if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
+		(scan->rs_base.rs_flags & SO_TYPE_SEQSCAN))
+	{
+		if (scan->rs_base.rs_parallel)
+			scan->pgsr = heap_pgsr_parallel_alloc(scan);
+		else
+			scan->pgsr = heap_pgsr_single_alloc(scan);
+	}
 }
 
 /*
@@ -370,7 +474,7 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
  * which tuples on the page are visible.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer pgsr_buffer)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
 	Buffer		buffer;
@@ -397,9 +501,20 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
+	if (BufferIsValid(pgsr_buffer))
+	{
+		Assert(scan->pgsr);
+		Assert(BufferGetBlockNumber(pgsr_buffer) == block);
+		scan->rs_cbuf = pgsr_buffer;
+	}
+	else
+	{
+		Assert(!scan->pgsr);
+
+		/* read page using selected strategy */
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+										   RBM_NORMAL, scan->rs_strategy);
+	}
 	scan->rs_cblock = block;
 
 	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
@@ -486,7 +601,7 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
  * of the pages before we can get a chance to get our first page.
  */
 static BlockNumber
-heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
+heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir, Buffer *pgsr_buf)
 {
 	Assert(!scan->rs_inited);
 
@@ -496,16 +611,25 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		if (scan->rs_base.rs_parallel != NULL)
+			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+													 scan->rs_parallelworkerdata,
+													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		/* FIXME: Integrate more neatly */
+		if (scan->pgsr)
+		{
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+
 		/* serial scan */
 		if (scan->rs_base.rs_parallel == NULL)
 			return scan->rs_startblock;
 		else
 		{
-			/* parallel scan */
-			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
-													 scan->rs_parallelworkerdata,
-													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
-
 			/* may return InvalidBlockNumber if there are no more blocks */
 			return table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
 													 scan->rs_parallelworkerdata,
@@ -525,6 +649,12 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 		 */
 		scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
 
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/*
 		 * Start from last page of the scan.  Ensure we take into account
 		 * rs_numblocks if it's been adjusted by heap_setscanlimits().
@@ -626,11 +756,33 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * heap_setscanlimits().
  */
 static inline BlockNumber
-heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
+heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir,
+						 Buffer *pgsr_buf)
 {
 	if (ScanDirectionIsForward(dir))
 	{
-		if (scan->rs_base.rs_parallel == NULL)
+		if (scan->pgsr)
+		{
+#ifdef USE_ASSERT_CHECKING
+			block++;
+
+			/* wrap back to the start of the heap */
+			if (block >= scan->rs_nblocks)
+				block = 0;
+
+			/* we're done if we're back at where we started */
+			if (block == scan->rs_startblock)
+				block = InvalidBlockNumber;
+#endif
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+
+			Assert(scan->rs_base.rs_parallel ||
+				   block == BufferGetBlockNumber(*pgsr_buf));
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+		else if (scan->rs_base.rs_parallel == NULL)
 		{
 			block++;
 
@@ -675,6 +827,12 @@ heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir
 	}
 	else
 	{
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/* we're done if the last block is the start position */
 		if (block == scan->rs_startblock)
 			return InvalidBlockNumber;
@@ -724,13 +882,14 @@ heapgettup(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -751,7 +910,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -803,9 +962,10 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+		pgsr_buf = InvalidBuffer;
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -839,13 +999,14 @@ heapgettup_pagemode(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -872,7 +1033,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
@@ -904,7 +1065,7 @@ continue_page:
 		}
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -952,6 +1113,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_parallel = parallel_scan;
 	scan->rs_strategy = NULL;	/* set in initscan */
 
+	scan->pgsr = NULL;
+
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
 	 */
@@ -1058,6 +1221,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b..20e83f48e3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2333,7 +2333,7 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
+	heapgetpage(scan, blockno, InvalidBuffer);
 	hscan->rs_inited = true;
 
 	return true;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f6859..21e71d6091 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -59,6 +59,7 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	OffsetNumber rs_coffset;	/* current offset # in non-page-at-a-time mode */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+	BlockNumber rs_prefetch_block;	/* block being prefetched */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
@@ -72,6 +73,8 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
+	struct PgStreamingRead *pgsr;
+
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
@@ -246,7 +249,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer buffer);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.43.2

v6-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchtext/x-patch; charset=US-ASCII; name=v6-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchDownload

From 6102cb82f4d452e2362b85e73b6b053cf0e675a1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 12 Feb 2021 14:19:10 -0800
Subject: [PATCH v6 5/7] WIP: Use streaming reads in nbtree vacuum scan.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/nbtree/nbtree.c | 75 ++++++++++++++++++++++++------
 src/include/access/nbtree.h        |  3 ++
 2 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 41df1027d2..13fb6537e4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -32,6 +32,8 @@
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
+#include "utils/builtins.h"
 #include "utils/fmgrprotos.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -79,7 +81,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
 static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 IndexBulkDeleteCallback callback, void *callback_state,
 						 BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno);
 static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
 										  IndexTuple posting,
 										  OffsetNumber updatedoffset,
@@ -876,6 +878,24 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+static BlockNumber
+btvacuumscan_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+					   void *per_buffer_data)
+{
+	BTVacState *vstate = (BTVacState *) pgsr_private;
+
+	if (vstate->nextblock == vstate->num_pages)
+		return InvalidBlockNumber;
+
+	/*
+	 * We can't use _bt_getbuf() here because it always applies
+	 * _bt_checkpage(), which will barf on an all-zero page. We want to
+	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
+	 * buffer access strategy. Also want to use AIO.
+	 */
+	return vstate->nextblock++;
+}
+
 /*
  * btvacuumscan --- scan the index for VACUUMing purposes
  *
@@ -969,6 +989,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	scanblkno = BTREE_METAPAGE + 1;
 	for (;;)
 	{
+		PgStreamingRead *pgsr;
+
 		/* Get the current relation length */
 		if (needLock)
 			LockRelationForExtension(rel, ExclusiveLock);
@@ -983,14 +1005,40 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (scanblkno >= num_pages)
 			break;
+
+		vstate.nextblock = scanblkno;
+		vstate.num_pages = num_pages;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_MAINTENANCE,
+											  &vstate,
+											  0,
+											  vstate.info->strategy,
+											  BMR_REL(vstate.info->index),
+											  MAIN_FORKNUM,
+											  btvacuumscan_pgsr_next);
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; scanblkno < num_pages; scanblkno++)
 		{
-			btvacuumpage(&vstate, scanblkno);
+			Buffer		buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
+
+			Assert(BufferIsValid(buf));
+			Assert(BufferGetBlockNumber(buf) == scanblkno);
+
+			btvacuumpage(&vstate, buf, scanblkno);
+
 			if (info->report_progress)
 				pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
 											 scanblkno);
+
+			/* call vacuum_delay_point while not holding any buffer lock */
+			vacuum_delay_point();
 		}
+
+		if (pg_streaming_read_buffer_get_next(pgsr, NULL) != 0)
+			elog(ERROR, "aio and plain pos out of sync");
+
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Set statistics num_pages field to final size of index */
@@ -1023,7 +1071,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
  * recycled (i.e. before the page split).
  */
 static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno)
 {
 	IndexVacuumInfo *info = vstate->info;
 	IndexBulkDeleteResult *stats = vstate->stats;
@@ -1034,28 +1082,28 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
 	bool		attempt_pagedel;
 	BlockNumber blkno,
 				backtrack_to;
-	Buffer		buf;
 	Page		page;
 	BTPageOpaque opaque;
 
 	blkno = scanblkno;
 
+	Assert(BufferIsValid(buf));
 backtrack:
 
 	attempt_pagedel = false;
 	backtrack_to = P_NONE;
 
-	/* call vacuum_delay_point while not holding any buffer lock */
-	vacuum_delay_point();
-
 	/*
-	 * We can't use _bt_getbuf() here because it always applies
-	 * _bt_checkpage(), which will barf on an all-zero page. We want to
-	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
-	 * buffer access strategy.
+	 * If we're not backtracking we already read the page asynchronously. Not
+	 * using AIO when backtracking isn't too tragic - it's relatively rare,
+	 * and the buffer quite possibly is still in shared buffers.
 	 */
-	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-							 info->strategy);
+	if (!BufferIsValid(buf))
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								 info->strategy);
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	_bt_lockbuf(rel, buf, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = NULL;
@@ -1342,6 +1390,7 @@ backtrack:
 	if (backtrack_to != P_NONE)
 	{
 		blkno = backtrack_to;
+		buf = InvalidBuffer;
 		goto backtrack;
 	}
 }
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052e..2dbbbd36e7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -343,6 +343,9 @@ typedef struct BTVacState
 	int			maxbufsize;		/* max bufsize that respects work_mem */
 	BTPendingFSM *pendingpages; /* One entry per newly deleted page */
 	int			npendingpages;	/* current # valid pendingpages */
+
+	BlockNumber nextblock;
+	BlockNumber num_pages;
 } BTVacState;
 
 /*
-- 
2.43.2

v6-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patchtext/x-patch; charset=US-ASCII; name=v6-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patchDownload

From 6ae8f5644b74ff552c8987c7f13daf28e6392729 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 22 Aug 2023 16:48:30 +1200
Subject: [PATCH v6 6/7] WIP: Use streaming reads in bitmap heapscan.

XXX Cherry-picked from
https://github.com/melanieplageman/postgres/tree/bhs_aio and lightly
modified by TM, for demonstration purposes.

Author: Melanie Plageman <melanieplageman@gmail.com>
---
 src/backend/access/gin/ginget.c           |  17 +-
 src/backend/access/gin/ginscan.c          |   7 +
 src/backend/access/heap/heapam.c          |  71 +++
 src/backend/access/heap/heapam_handler.c  |  78 ++--
 src/backend/commands/explain.c            |  21 +-
 src/backend/executor/nodeBitmapHeapscan.c | 521 ++--------------------
 src/backend/nodes/tidbitmap.c             |  76 ++--
 src/include/access/heapam.h               |   7 +
 src/include/access/relscan.h              |   8 +
 src/include/access/tableam.h              |  71 ++-
 src/include/executor/nodeBitmapHeapscan.h |   1 +
 src/include/nodes/execnodes.h             |  39 +-
 src/include/nodes/tidbitmap.h             |   4 +-
 13 files changed, 284 insertions(+), 637 deletions(-)

diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0b4f2ebadb..f238a3b071 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -373,7 +373,10 @@ restartScanEntry:
 			if (entry->matchBitmap)
 			{
 				if (entry->matchIterator)
+				{
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+				}
 				entry->matchIterator = NULL;
 				tbm_free(entry->matchBitmap);
 				entry->matchBitmap = NULL;
@@ -386,6 +389,9 @@ restartScanEntry:
 		if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap))
 		{
 			entry->matchIterator = tbm_begin_iterate(entry->matchBitmap);
+			entry->matchResult = (TBMIterateResult *)
+				palloc0(sizeof(TBMIterateResult) + MaxHeapTuplesPerPage *
+						sizeof(OffsetNumber));
 			entry->isFinished = false;
 		}
 	}
@@ -823,21 +829,24 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 		{
 			/*
 			 * If we've exhausted all items on this block, move to next block
-			 * in the bitmap.
+			 * in the bitmap. tbm_iterate() sets matchResult->blockno to
+			 * InvalidBlockNumber when the bitmap is exhausted.
 			 */
-			while (entry->matchResult == NULL ||
+			while ((!BlockNumberIsValid(entry->matchResult->blockno)) ||
 				   (entry->matchResult->ntuples >= 0 &&
 					entry->offset >= entry->matchResult->ntuples) ||
 				   entry->matchResult->blockno < advancePastBlk ||
 				   (ItemPointerIsLossyPage(&advancePast) &&
 					entry->matchResult->blockno == advancePastBlk))
 			{
-				entry->matchResult = tbm_iterate(entry->matchIterator);
 
-				if (entry->matchResult == NULL)
+				tbm_iterate(entry->matchIterator, entry->matchResult);
+				if (!BlockNumberIsValid(entry->matchResult->blockno))
 				{
 					ItemPointerSetInvalid(&entry->curItem);
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+					entry->matchResult = NULL;
 					entry->matchIterator = NULL;
 					entry->isFinished = true;
 					break;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d38544..be27f9fe07 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -246,7 +246,14 @@ ginFreeScanKeys(GinScanOpaque so)
 		if (entry->list)
 			pfree(entry->list);
 		if (entry->matchIterator)
+		{
 			tbm_end_iterate(entry->matchIterator);
+			if (entry->matchResult)
+			{
+				pfree(entry->matchResult);
+				entry->matchResult = NULL;
+			}
+		}
 		if (entry->matchBitmap)
 			tbm_free(entry->matchBitmap);
 	}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ab88eff296..e02716328f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -222,6 +222,53 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+							void *per_buffer_data)
+{
+	TBMIterateResult *tbmres;
+	HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+	tbmres = (TBMIterateResult *) per_buffer_data;
+
+	for (;;)
+	{
+		if (hdesc->rs_base.shared_tbmiterator)
+			tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+		else
+			tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+		if (!BlockNumberIsValid(tbmres->blockno))
+			return InvalidBlockNumber;
+
+		/*
+		 * Ignore any claimed entries past what we think is the end of the
+		 * relation. It may have been extended after the start of our scan (we
+		 * only hold an AccessShareLock, and it could be inserts from this
+		 * backend).  We don't take this optimization in SERIALIZABLE
+		 * isolation though, as we need to examine all invisible tuples
+		 * reachable by the index.
+		 */
+		if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+			continue;
+
+		if (tbmres->ntuples >= 0)
+			hdesc->rs_base.exact_pages++;
+		else
+			hdesc->rs_base.lossy_pages++;
+
+		if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+			!tbmres->recheck &&
+			VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer))
+		{
+			hdesc->empty_tuples += tbmres->ntuples;
+			continue;
+		}
+
+		return tbmres->blockno;
+	}
+}
+
 static BlockNumber
 heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
 					  void *per_buffer_data)
@@ -429,6 +476,11 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		pg_streaming_read_free(scan->pgsr);
 		scan->pgsr = NULL;
 	}
+	if (BufferIsValid(scan->vmbuffer))
+	{
+		ReleaseBuffer(scan->vmbuffer);
+		scan->vmbuffer = InvalidBuffer;
+	}
 
 	/*
 	 * FIXME: This probably should be done in the !rs_inited blocks instead.
@@ -442,6 +494,19 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		else
 			scan->pgsr = heap_pgsr_single_alloc(scan);
 	}
+	else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+	{
+		scan->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+													scan,
+													offsetof(TBMIterateResult, offsets) + MaxHeapTuplesPerPage * sizeof(OffsetNumber),
+													scan->rs_strategy,
+													BMR_REL(scan->rs_base.rs_rd),
+													MAIN_FORKNUM,
+													bitmapheap_pgsr_next_single);
+
+
+		scan->rs_inited = true;
+	}
 }
 
 /*
@@ -1114,6 +1179,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_strategy = NULL;	/* set in initscan */
 
 	scan->pgsr = NULL;
+	scan->vmbuffer = InvalidBuffer;
+	scan->empty_tuples = 0;
+	scan->rs_base.exact_pages = 0;
+	scan->rs_base.lossy_pages = 0;
+	scan->rs_base.tbmiterator = NULL;
+	scan->rs_base.shared_tbmiterator = NULL;
 
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 20e83f48e3..946ecf92bb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "storage/streaming_read.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -2108,38 +2109,55 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
  * Executor related callbacks for the heap AM
  * ------------------------------------------------------------------------
  */
-
 static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
-							  TBMIterateResult *tbmres)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
-	BlockNumber block = tbmres->blockno;
+	TBMIterateResult *tbmres;
+	void	   *io_private;
 	Buffer		buffer;
 	Snapshot	snapshot;
 	int			ntup;
 
-	hscan->rs_cindex = 0;
-	hscan->rs_ntuples = 0;
+	Assert(hscan->pgsr);
 
 	/*
-	 * Ignore any claimed entries past what we think is the end of the
-	 * relation. It may have been extended after the start of our scan (we
-	 * only hold an AccessShareLock, and it could be inserts from this
-	 * backend).  We don't take this optimization in SERIALIZABLE isolation
-	 * though, as we need to examine all invisible tuples reachable by the
-	 * index.
+	 * Release the buffer containing the previous block and reset the per-page
+	 * counters. Reset hscan->rs_ntuples here just to be safe.
 	 */
-	if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
-		return false;
+	if (BufferIsValid(hscan->rs_cbuf))
+	{
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
+	}
+
+	hscan->rs_cindex = 0;
+	hscan->rs_ntuples = 0;
+
+	hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->pgsr, &io_private);
+
+	if (BufferIsInvalid(hscan->rs_cbuf))
+	{
+		if (BufferIsValid(hscan->vmbuffer))
+		{
+			ReleaseBuffer(hscan->vmbuffer);
+			hscan->vmbuffer = InvalidBuffer;
+		}
+
+		return hscan->empty_tuples > 0;
+	}
+
+	Assert(io_private);
+
+	tbmres = (TBMIterateResult *) io_private;
+
+	Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
+
+	*recheck = tbmres->recheck;
+
+	hscan->rs_cblock = tbmres->blockno;
+	hscan->rs_ntuples = tbmres->ntuples;
 
-	/*
-	 * Acquire pin on the target heap page, trading in any pin we held before.
-	 */
-	hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
-										  scan->rs_rd,
-										  block);
-	hscan->rs_cblock = block;
 	buffer = hscan->rs_cbuf;
 	snapshot = scan->rs_snapshot;
 
@@ -2175,7 +2193,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			ItemPointerData tid;
 			HeapTupleData heapTuple;
 
-			ItemPointerSet(&tid, block, offnum);
+			ItemPointerSet(&tid, tbmres->blockno, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
 									   &heapTuple, NULL, true))
 				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
@@ -2203,7 +2221,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			loctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
 			loctup.t_len = ItemIdGetLength(lp);
 			loctup.t_tableOid = scan->rs_rd->rd_id;
-			ItemPointerSet(&loctup.t_self, block, offnum);
+			ItemPointerSet(&loctup.t_self, tbmres->blockno, offnum);
 			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 			if (valid)
 			{
@@ -2221,25 +2239,31 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 	Assert(ntup <= MaxHeapTuplesPerPage);
 	hscan->rs_ntuples = ntup;
 
-	return ntup > 0;
+	return true;
 }
 
 static bool
-heapam_scan_bitmap_next_tuple(TableScanDesc scan,
-							  TBMIterateResult *tbmres,
-							  TupleTableSlot *slot)
+heapam_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 	OffsetNumber targoffset;
 	Page		page;
 	ItemId		lp;
 
+	if (hscan->empty_tuples)
+	{
+		ExecStoreAllNullTuple(slot);
+		hscan->empty_tuples--;
+		return true;
+	}
+
 	/*
 	 * Out of range?  If so, nothing more to look at on this page
 	 */
 	if (hscan->rs_cindex < 0 || hscan->rs_cindex >= hscan->rs_ntuples)
 		return false;
 
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	targoffset = hscan->rs_vistuples[hscan->rs_cindex];
 	page = BufferGetPage(hscan->rs_cbuf);
 	lp = PageGetItemId(page, targoffset);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 78754bc6ba..d5688de3a3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/relscan.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "commands/createas.h"
@@ -111,7 +112,7 @@ static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_tidbitmap_info(BitmapHeapScanState *planstate,
+static void show_tidbitmap_info(TableScanDesc tdesc,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
 									   PlanState *planstate, ExplainState *es);
@@ -1873,7 +1874,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			if (es->analyze)
-				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
+				show_tidbitmap_info(((BitmapHeapScanState *) planstate)->ss.ss_currentScanDesc, es);
 			break;
 		case T_SampleScan:
 			show_tablesample(((SampleScan *) plan)->tablesample,
@@ -3462,25 +3463,25 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
  * If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
  */
 static void
-show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
+show_tidbitmap_info(TableScanDesc tdesc, ExplainState *es)
 {
 	if (es->format != EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainPropertyInteger("Exact Heap Blocks", NULL,
-							   planstate->exact_pages, es);
+							   tdesc->exact_pages, es);
 		ExplainPropertyInteger("Lossy Heap Blocks", NULL,
-							   planstate->lossy_pages, es);
+							   tdesc->lossy_pages, es);
 	}
 	else
 	{
-		if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+		if (tdesc->exact_pages > 0 || tdesc->lossy_pages > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "Heap Blocks:");
-			if (planstate->exact_pages > 0)
-				appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
-			if (planstate->lossy_pages > 0)
-				appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages);
+			if (tdesc->exact_pages > 0)
+				appendStringInfo(es->str, " exact=%ld", tdesc->exact_pages);
+			if (tdesc->lossy_pages > 0)
+				appendStringInfo(es->str, " lossy=%ld", tdesc->lossy_pages);
 			appendStringInfoChar(es->str, '\n');
 		}
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 345b67649e..9482f4ffbd 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -51,11 +51,6 @@
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-												TBMIterateResult *tbmres);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
-								  TableScanDesc scan);
 static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
 
 
@@ -71,10 +66,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	ExprContext *econtext;
 	TableScanDesc scan;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator = NULL;
-	TBMSharedIterator *shared_tbmiterator = NULL;
-	TBMIterateResult *tbmres;
-	TupleTableSlot *slot;
+	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 	ParallelBitmapHeapState *pstate = node->pstate;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
 
@@ -85,23 +77,10 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	slot = node->ss.ss_ScanTupleSlot;
 	scan = node->ss.ss_currentScanDesc;
 	tbm = node->tbm;
-	if (pstate == NULL)
-		tbmiterator = node->tbmiterator;
-	else
-		shared_tbmiterator = node->shared_tbmiterator;
-	tbmres = node->tbmres;
 
 	/*
 	 * If we haven't yet performed the underlying index scan, do it, and begin
 	 * the iteration over the bitmap.
-	 *
-	 * For prefetching, we use *two* iterators, one for the pages we are
-	 * actually scanning and another that runs ahead of the first for
-	 * prefetching.  node->prefetch_pages tracks exactly how many pages ahead
-	 * the prefetch iterator is.  Also, node->prefetch_target tracks the
-	 * desired prefetch distance, which starts small and increases up to the
-	 * node->prefetch_maximum.  This is to avoid doing a lot of prefetching in
-	 * a scan that stops after a few tuples because of a LIMIT.
 	 */
 	if (!node->initialized)
 	{
@@ -113,17 +92,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				elog(ERROR, "unrecognized result from subplan");
 
 			node->tbm = tbm;
-			node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->prefetch_iterator = tbm_begin_iterate(tbm);
-				node->prefetch_pages = 0;
-				node->prefetch_target = -1;
-			}
-#endif							/* USE_PREFETCH */
+			scan->tbmiterator = tbm_begin_iterate(tbm);
 		}
 		else
 		{
@@ -145,194 +114,59 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				 * dsa_pointer of the iterator state which will be used by
 				 * multiple processes to iterate jointly.
 				 */
-				pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
-				if (node->prefetch_maximum > 0)
-				{
-					pstate->prefetch_iterator =
-						tbm_prepare_shared_iterate(tbm);
-
-					/*
-					 * We don't need the mutex here as we haven't yet woke up
-					 * others.
-					 */
-					pstate->prefetch_pages = 0;
-					pstate->prefetch_target = -1;
-				}
-#endif
+				pstate->tbmiterator =
+					tbm_prepare_shared_iterate(tbm);
 
 				/* We have initialized the shared state so wake up others. */
 				BitmapDoneInitializingSharedState(pstate);
 			}
 
 			/* Allocate a private iterator and attach the shared state to it */
-			node->shared_tbmiterator = shared_tbmiterator =
+			scan->shared_tbmiterator =
 				tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->shared_prefetch_iterator =
-					tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
-			}
-#endif							/* USE_PREFETCH */
 		}
 		node->initialized = true;
+
+		/* Get the first block. if none, end of scan */
+		if (!table_scan_bitmap_next_block(scan, &node->recheck))
+			goto exit;
 	}
 
-	for (;;)
+	do
 	{
-		bool		skip_fetch;
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * Get next page of results if needed
-		 */
-		if (tbmres == NULL)
-		{
-			if (!pstate)
-				node->tbmres = tbmres = tbm_iterate(tbmiterator);
-			else
-				node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
-			if (tbmres == NULL)
-			{
-				/* no more entries in the bitmap */
-				break;
-			}
-
-			BitmapAdjustPrefetchIterator(node, tbmres);
-
-			/*
-			 * We can skip fetching the heap page if we don't need any fields
-			 * from the heap, and the bitmap entries don't need rechecking,
-			 * and all tuples on the page are visible to our transaction.
-			 *
-			 * XXX: It's a layering violation that we do these checks above
-			 * tableam, they should probably moved below it at some point.
-			 */
-			skip_fetch = (node->can_skip_fetch &&
-						  !tbmres->recheck &&
-						  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-										 tbmres->blockno,
-										 &node->vmbuffer));
-
-			if (skip_fetch)
-			{
-				/* can't be lossy in the skip_fetch case */
-				Assert(tbmres->ntuples >= 0);
-
-				/*
-				 * The number of tuples on this page is put into
-				 * node->return_empty_tuples.
-				 */
-				node->return_empty_tuples = tbmres->ntuples;
-			}
-			else if (!table_scan_bitmap_next_block(scan, tbmres))
-			{
-				/* AM doesn't think this block is valid, skip */
-				continue;
-			}
-
-			if (tbmres->ntuples >= 0)
-				node->exact_pages++;
-			else
-				node->lossy_pages++;
-
-			/* Adjust the prefetch target */
-			BitmapAdjustPrefetchTarget(node);
-		}
-		else
+		while (table_scan_bitmap_next_tuple(scan, slot))
 		{
-			/*
-			 * Continuing in previously obtained page.
-			 */
-
-#ifdef USE_PREFETCH
-
-			/*
-			 * Try to prefetch at least a few pages even before we get to the
-			 * second page if we don't stop reading after the first tuple.
-			 */
-			if (!pstate)
-			{
-				if (node->prefetch_target < node->prefetch_maximum)
-					node->prefetch_target++;
-			}
-			else if (pstate->prefetch_target < node->prefetch_maximum)
-			{
-				/* take spinlock while updating shared state */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_target < node->prefetch_maximum)
-					pstate->prefetch_target++;
-				SpinLockRelease(&pstate->mutex);
-			}
-#endif							/* USE_PREFETCH */
-		}
+			CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We issue prefetch requests *after* fetching the current page to try
-		 * to avoid having prefetching interfere with the main I/O. Also, this
-		 * should happen only when we have determined there is still something
-		 * to do on the current page, else we may uselessly prefetch the same
-		 * page we are just about to request for real.
-		 *
-		 * XXX: It's a layering violation that we do these checks above
-		 * tableam, they should probably moved below it at some point.
-		 */
-		BitmapPrefetch(node, scan);
-
-		if (node->return_empty_tuples > 0)
-		{
-			/*
-			 * If we don't have to fetch the tuple, just return nulls.
-			 */
-			ExecStoreAllNullTuple(slot);
+			if (!node->recheck)
+				return slot;
 
-			if (--node->return_empty_tuples == 0)
-			{
-				/* no more tuples to return in the next round */
-				node->tbmres = tbmres = NULL;
-			}
-		}
-		else
-		{
-			/*
-			 * Attempt to fetch tuple from AM.
-			 */
-			if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
-			{
-				/* nothing more to look at on this page */
-				node->tbmres = tbmres = NULL;
-				continue;
-			}
+			econtext->ecxt_scantuple = slot;
+			if (ExecQualAndReset(node->bitmapqualorig, econtext))
+				return slot;
 
-			/*
-			 * If we are using lossy info, we have to recheck the qual
-			 * conditions at every tuple.
-			 */
-			if (tbmres->recheck)
-			{
-				econtext->ecxt_scantuple = slot;
-				if (!ExecQualAndReset(node->bitmapqualorig, econtext))
-				{
-					/* Fails recheck, so drop it and loop back for another */
-					InstrCountFiltered2(node, 1);
-					ExecClearTuple(slot);
-					continue;
-				}
-			}
+			/* Fails recheck, so drop it and loop back for another */
+			InstrCountFiltered2(node, 1);
+			ExecClearTuple(slot);
 		}
+	} while (table_scan_bitmap_next_block(scan, &node->recheck));
 
-		/* OK to return this tuple */
-		return slot;
-	}
+exit:
 
 	/*
-	 * if we get here it means we are at the end of the scan..
+	 * Release iterator
 	 */
-	return ExecClearTuple(slot);
+	if (scan->tbmiterator)
+	{
+		tbm_end_iterate(scan->tbmiterator);
+		scan->tbmiterator = NULL;
+	}
+	else if (scan->shared_tbmiterator)
+	{
+		tbm_end_shared_iterate(scan->shared_tbmiterator);
+		scan->shared_tbmiterator = NULL;
+	}
+	return NULL;
 }
 
 /*
@@ -350,215 +184,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
 	ConditionVariableBroadcast(&pstate->cv);
 }
 
-/*
- *	BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-							 TBMIterateResult *tbmres)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (node->prefetch_pages > 0)
-		{
-			/* The main iterator has closed the distance by one page */
-			node->prefetch_pages--;
-		}
-		else if (prefetch_iterator)
-		{
-			/* Do not let the prefetch iterator get behind the main one */
-			TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-
-			if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
-				elog(ERROR, "prefetch and main iterators are out of sync");
-		}
-		return;
-	}
-
-	if (node->prefetch_maximum > 0)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_pages > 0)
-		{
-			pstate->prefetch_pages--;
-			SpinLockRelease(&pstate->mutex);
-		}
-		else
-		{
-			/* Release the mutex before iterating */
-			SpinLockRelease(&pstate->mutex);
-
-			/*
-			 * In case of shared mode, we can not ensure that the current
-			 * blockno of the main iterator and that of the prefetch iterator
-			 * are same.  It's possible that whatever blockno we are
-			 * prefetching will be processed by another process.  Therefore,
-			 * we don't validate the blockno here as we do in non-parallel
-			 * case.
-			 */
-			if (prefetch_iterator)
-				tbm_shared_iterate(prefetch_iterator);
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max.  Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		if (node->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (node->prefetch_target >= node->prefetch_maximum / 2)
-			node->prefetch_target = node->prefetch_maximum;
-		else if (node->prefetch_target > 0)
-			node->prefetch_target *= 2;
-		else
-			node->prefetch_target++;
-		return;
-	}
-
-	/* Do an unlocked check first to save spinlock acquisitions. */
-	if (pstate->prefetch_target < node->prefetch_maximum)
-	{
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
-			pstate->prefetch_target = node->prefetch_maximum;
-		else if (pstate->prefetch_target > 0)
-			pstate->prefetch_target *= 2;
-		else
-			pstate->prefetch_target++;
-		SpinLockRelease(&pstate->mutex);
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (node->prefetch_pages < node->prefetch_target)
-			{
-				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-				bool		skip_fetch;
-
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = NULL;
-					break;
-				}
-				node->prefetch_pages++;
-
-				/*
-				 * If we expect not to have to actually read this heap page,
-				 * skip this prefetch call, but continue to run the prefetch
-				 * logic normally.  (Would it be better not to increment
-				 * prefetch_pages?)
-				 *
-				 * This depends on the assumption that the index AM will
-				 * report the same recheck flag for this future heap page as
-				 * it did for the current heap page; which is not a certainty
-				 * but is true in many cases.
-				 */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-
-		return;
-	}
-
-	if (pstate->prefetch_pages < pstate->prefetch_target)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (1)
-			{
-				TBMIterateResult *tbmpre;
-				bool		do_prefetch = false;
-				bool		skip_fetch;
-
-				/*
-				 * Recheck under the mutex. If some other process has already
-				 * done enough prefetching then we need not to do anything.
-				 */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_pages < pstate->prefetch_target)
-				{
-					pstate->prefetch_pages++;
-					do_prefetch = true;
-				}
-				SpinLockRelease(&pstate->mutex);
-
-				if (!do_prefetch)
-					return;
-
-				tbmpre = tbm_shared_iterate(prefetch_iterator);
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_shared_iterate(prefetch_iterator);
-					node->shared_prefetch_iterator = NULL;
-					break;
-				}
-
-				/* As above, skip prefetch if we expect not to need page */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
 /*
  * BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
  */
@@ -604,29 +229,10 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 	table_rescan(node->ss.ss_currentScanDesc, NULL);
 
 	/* release bitmaps and buffers if any */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 	node->tbm = NULL;
-	node->tbmiterator = NULL;
-	node->tbmres = NULL;
-	node->prefetch_iterator = NULL;
 	node->initialized = false;
-	node->shared_tbmiterator = NULL;
-	node->shared_prefetch_iterator = NULL;
-	node->vmbuffer = InvalidBuffer;
-	node->pvmbuffer = InvalidBuffer;
 
 	ExecScanReScan(&node->ss);
 
@@ -660,20 +266,8 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
 	/*
 	 * release bitmaps and buffers if any
 	 */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 
 	/*
 	 * close heap scan
@@ -692,6 +286,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		can_skip_fetch;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -711,32 +306,10 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
 
 	scanstate->tbm = NULL;
-	scanstate->tbmiterator = NULL;
-	scanstate->tbmres = NULL;
-	scanstate->return_empty_tuples = 0;
-	scanstate->vmbuffer = InvalidBuffer;
-	scanstate->pvmbuffer = InvalidBuffer;
-	scanstate->exact_pages = 0;
-	scanstate->lossy_pages = 0;
-	scanstate->prefetch_iterator = NULL;
-	scanstate->prefetch_pages = 0;
-	scanstate->prefetch_target = 0;
 	scanstate->pscan_len = 0;
 	scanstate->initialized = false;
-	scanstate->shared_tbmiterator = NULL;
-	scanstate->shared_prefetch_iterator = NULL;
 	scanstate->pstate = NULL;
 
-	/*
-	 * We can potentially skip fetching heap pages if we do not need any
-	 * columns of the table, either for checking non-indexable quals or for
-	 * returning data.  This test is a bit simplistic, as it checks the
-	 * stronger condition that there's no qual or return tlist at all.  But in
-	 * most cases it's probably not worth working harder than that.
-	 */
-	scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
-								 node->scan.plan.targetlist == NIL);
-
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -775,19 +348,22 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->bitmapqualorig =
 		ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
 
+	scanstate->ss.ss_currentRelation = currentRelation;
+
 	/*
-	 * Maximum number of prefetches for the tablespace if configured,
-	 * otherwise the current value of the effective_io_concurrency GUC.
+	 * We can potentially skip fetching heap pages if we do not need any
+	 * columns of the table, either for checking non-indexable quals or for
+	 * returning data.  This test is a bit simplistic, as it checks the
+	 * stronger condition that there's no qual or return tlist at all.  But in
+	 * most cases it's probably not worth working harder than that.
 	 */
-	scanstate->prefetch_maximum =
-		get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
-	scanstate->ss.ss_currentRelation = currentRelation;
+	can_skip_fetch = (node->scan.plan.qual == NIL &&
+					  node->scan.plan.targetlist == NIL);
 
 	scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
 														  estate->es_snapshot,
 														  0,
-														  NULL);
+														  NULL, can_skip_fetch);
 
 	/*
 	 * all done.
@@ -872,12 +448,9 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 	pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
 
 	pstate->tbmiterator = 0;
-	pstate->prefetch_iterator = 0;
 
 	/* Initialize the mutex */
 	SpinLockInit(&pstate->mutex);
-	pstate->prefetch_pages = 0;
-	pstate->prefetch_target = 0;
 	pstate->state = BM_INITIAL;
 
 	ConditionVariableInit(&pstate->cv);
@@ -909,11 +482,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 	if (DsaPointerIsValid(pstate->tbmiterator))
 		tbm_free_shared_area(dsa, pstate->tbmiterator);
 
-	if (DsaPointerIsValid(pstate->prefetch_iterator))
-		tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
 	pstate->tbmiterator = InvalidDsaPointer;
-	pstate->prefetch_iterator = InvalidDsaPointer;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index e8ab5d78fc..3f84abeb90 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -181,7 +181,6 @@ struct TBMIterator
 	int			spageptr;		/* next spages index */
 	int			schunkptr;		/* next schunks index */
 	int			schunkbit;		/* next bit to check in current schunk */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /*
@@ -222,7 +221,6 @@ struct TBMSharedIterator
 	PTEntryArray *ptbase;		/* pagetable element array */
 	PTIterationArray *ptpages;	/* sorted exact page index list */
 	PTIterationArray *ptchunks; /* sorted lossy page index list */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /* Local function prototypes */
@@ -696,8 +694,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * Create the TBMIterator struct, with enough trailing space to serve the
 	 * needs of the TBMIterateResult sub-struct.
 	 */
-	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-									  MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+	iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
 	iterator->tbm = tbm;
 
 	/*
@@ -958,20 +955,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
 /*
  * tbm_iterate - scan through next page of a TIDBitmap
  *
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan.  Pages are guaranteed to be delivered in numerical
- * order.  If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition.  If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway.  (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order.  tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition.  If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway.  (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
  */
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
 {
 	TIDBitmap  *tbm = iterator->tbm;
-	TBMIterateResult *output = &(iterator->output);
 
 	Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
 
@@ -999,6 +997,7 @@ tbm_iterate(TBMIterator *iterator)
 	 * If both chunk and per-page data remain, must output the numerically
 	 * earlier page.
 	 */
+	Assert(tbmres);
 	if (iterator->schunkptr < tbm->nchunks)
 	{
 		PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -1009,11 +1008,11 @@ tbm_iterate(TBMIterator *iterator)
 			chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			iterator->schunkbit++;
-			return output;
+			return;
 		}
 	}
 
@@ -1029,16 +1028,17 @@ tbm_iterate(TBMIterator *iterator)
 			page = tbm->spages[iterator->spageptr];
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		iterator->spageptr++;
-		return output;
+		return;
 	}
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
@@ -1048,10 +1048,9 @@ tbm_iterate(TBMIterator *iterator)
  *	across multiple processes.  We need to acquire the iterator LWLock,
  *	before accessing the shared members.
  */
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
 {
-	TBMIterateResult *output = &iterator->output;
 	TBMSharedIteratorState *istate = iterator->state;
 	PagetableEntry *ptbase = NULL;
 	int		   *idxpages = NULL;
@@ -1102,13 +1101,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 			chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			istate->schunkbit++;
 
 			LWLockRelease(&istate->lock);
-			return output;
+			return;
 		}
 	}
 
@@ -1118,21 +1117,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 		int			ntuples;
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		istate->spageptr++;
 
 		LWLockRelease(&istate->lock);
 
-		return output;
+		return;
 	}
 
 	LWLockRelease(&istate->lock);
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 21e71d6091..e28fb71aaf 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -20,6 +20,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "nodes/execnodes.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -76,9 +77,15 @@ typedef struct HeapScanDescData
 	struct PgStreamingRead *pgsr;
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
+
+	/*
+	 * MFIXME: not sure if vmbuffer is being released in the right place.
+	 */
+	Buffer		vmbuffer;		/* current VM buffer for BHS */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
+	int			empty_tuples;
 }			HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..cdb44316a3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/spin.h"
@@ -40,6 +41,13 @@ typedef struct TableScanDescData
 	ItemPointerData rs_mintid;
 	ItemPointerData rs_maxtid;
 
+	/* Only used for Bitmap table scans */
+	TBMIterator *tbmiterator;
+	TBMSharedIterator *shared_tbmiterator;
+	long		exact_pages;
+	long		lossy_pages;
+	int			empty_tuples;
+
 	/*
 	 * Information about type and behaviour of the scan, a bitmask of members
 	 * of the ScanOptions enum (see tableam.h).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d..b5626805a6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -32,6 +32,7 @@ extern PGDLLIMPORT char *default_table_access_method;
 extern PGDLLIMPORT bool synchronize_seqscans;
 
 
+struct BitmapHeapScanState;
 struct BulkInsertStateData;
 struct IndexInfo;
 struct SampleScanState;
@@ -62,7 +63,8 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
-}			ScanOptions;
+	SO_CAN_SKIP_FETCH = 1 << 10
+} ScanOptions;
 
 /*
  * Result codes for table_{update,delete,lock_tuple}, and for visibility
@@ -773,53 +775,36 @@ typedef struct TableAmRoutine
 	 */
 
 	/*
-	 * Prepare to fetch / check / return tuples from `tbmres->blockno` as part
-	 * of a bitmap table scan. `scan` was started via table_beginscan_bm().
-	 * Return false if there are no tuples to be found on the page, true
-	 * otherwise.
+	 * Prepare to fetch / check / return tuples obtained from the streaming
+	 * read helper as part of a bitmap table scan. `scan` was started via
+	 * table_beginscan_bm(). Return false if the relation is exhausted and
+	 * true otherwise.
 	 *
 	 * This will typically read and pin the target block, and do the necessary
 	 * work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
 	 * make sense to perform tuple visibility checks at this time). For some
-	 * AMs it will make more sense to do all the work referencing `tbmres`
-	 * contents here, for others it might be better to defer more work to
-	 * scan_bitmap_next_tuple.
+	 * AMs, it might be better to defer more work to scan_bitmap_next_tuple().
 	 *
-	 * If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
-	 * on the page have to be returned, otherwise the tuples at offsets in
-	 * `tbmres->offsets` need to be returned.
+	 * After examining the page from the TBMIterateResult returned by the
+	 * streaming read helper, most AMs will set relevant fields in the `scan`
+	 * for the benefit of scan_bitmap_next_tuple(). All AMs must set
+	 * `recheck`, passed in by the caller, based on the value in the
+	 * TBMIterateResult so that BitmapHeapNext() can determine whether or not
+	 * to yield tuples.
 	 *
-	 * XXX: Currently this may only be implemented if the AM uses md.c as its
-	 * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
-	 * blockids directly to the underlying storage. nodeBitmapHeapscan.c
-	 * performs prefetching directly using that interface.  This probably
-	 * needs to be rectified at a later point.
-	 *
-	 * XXX: Currently this may only be implemented if the AM uses the
-	 * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
-	 * perform prefetching.  This probably needs to be rectified at a later
-	 * point.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_block) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres);
+	bool		(*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck);
 
 	/*
 	 * Fetch the next tuple of a bitmap table scan into `slot` and return true
 	 * if a visible tuple was found, false otherwise.
 	 *
-	 * For some AMs it will make more sense to do all the work referencing
-	 * `tbmres` contents in scan_bitmap_next_block, for others it might be
-	 * better to defer more work to this callback.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres,
-										   TupleTableSlot *slot);
+	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan, TupleTableSlot *slot);
 
 	/*
 	 * Prepare to fetch tuples from the next block in a sample scan. Return
@@ -944,10 +929,13 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
  */
 static inline TableScanDesc
 table_beginscan_bm(Relation rel, Snapshot snapshot,
-				   int nkeys, struct ScanKeyData *key)
+				   int nkeys, struct ScanKeyData *key, bool can_skip_fetch)
 {
 	uint32		flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
 
+	if (can_skip_fetch)
+		flags |= SO_CAN_SKIP_FETCH;
+
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
 
@@ -1955,8 +1943,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
  * used after verifying the presence (at plan time or such).
  */
 static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1966,8 +1953,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
 
-	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
-														   tbmres);
+	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck);
 }
 
 /*
@@ -1979,9 +1965,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
  * returned false.
  */
 static inline bool
-table_scan_bitmap_next_tuple(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres,
-							 TupleTableSlot *slot)
+table_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
@@ -1992,7 +1976,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
 
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
-														   tbmres,
 														   slot);
 }
 
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index ea003a9caa..39ae7b29ef 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -28,5 +28,6 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 										  ParallelContext *pcxt);
 extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 										   ParallelWorkerContext *pwcxt);
+extern void table_bitmap_scan_setup(BitmapHeapScanState *scanstate);
 
 #endif							/* NODEBITMAPHEAPSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..eab4ae0f54 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1682,11 +1682,7 @@ typedef enum
 /* ----------------
  *	 ParallelBitmapHeapState information
  *		tbmiterator				iterator for scanning current pages
- *		prefetch_iterator		iterator for prefetching ahead of current page
- *		mutex					mutual exclusion for the prefetching variable
- *								and state
- *		prefetch_pages			# pages prefetch iterator is ahead of current
- *		prefetch_target			current target prefetch distance
+ *		mutex					mutual exclusion for the state
  *		state					current state of the TIDBitmap
  *		cv						conditional wait variable
  *		phs_snapshot_data		snapshot data shared to workers
@@ -1695,10 +1691,7 @@ typedef enum
 typedef struct ParallelBitmapHeapState
 {
 	dsa_pointer tbmiterator;
-	dsa_pointer prefetch_iterator;
 	slock_t		mutex;
-	int			prefetch_pages;
-	int			prefetch_target;
 	SharedBitmapState state;
 	ConditionVariable cv;
 	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1709,22 +1702,9 @@ typedef struct ParallelBitmapHeapState
  *
  *		bitmapqualorig	   execution state for bitmapqualorig expressions
  *		tbm				   bitmap obtained from child index scan(s)
- *		tbmiterator		   iterator for scanning current pages
- *		tbmres			   current-page data
- *		can_skip_fetch	   can we potentially skip tuple fetches in this scan?
- *		return_empty_tuples number of empty tuples to return
- *		vmbuffer		   buffer for visibility-map lookups
- *		pvmbuffer		   ditto, for prefetched pages
- *		exact_pages		   total number of exact pages retrieved
- *		lossy_pages		   total number of lossy pages retrieved
- *		prefetch_iterator  iterator for prefetching ahead of current page
- *		prefetch_pages	   # pages prefetch iterator is ahead of current
- *		prefetch_target    current target prefetch distance
- *		prefetch_maximum   maximum value for prefetch_target
+ *		recheck			   whether or not the tuple needs to be rechecked before returning
  *		pscan_len		   size of the shared memory for parallel bitmap
  *		initialized		   is node is ready to iterate
- *		shared_tbmiterator	   shared iterator
- *		shared_prefetch_iterator shared iterator for prefetching
  *		pstate			   shared state for parallel bitmap scan
  * ----------------
  */
@@ -1733,22 +1713,9 @@ typedef struct BitmapHeapScanState
 	ScanState	ss;				/* its first field is NodeTag */
 	ExprState  *bitmapqualorig;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator;
-	TBMIterateResult *tbmres;
-	bool		can_skip_fetch;
-	int			return_empty_tuples;
-	Buffer		vmbuffer;
-	Buffer		pvmbuffer;
-	long		exact_pages;
-	long		lossy_pages;
-	TBMIterator *prefetch_iterator;
-	int			prefetch_pages;
-	int			prefetch_target;
-	int			prefetch_maximum;
+	bool		recheck;
 	Size		pscan_len;
 	bool		initialized;
-	TBMSharedIterator *shared_tbmiterator;
-	TBMSharedIterator *shared_prefetch_iterator;
 	ParallelBitmapHeapState *pstate;
 } BitmapHeapScanState;
 
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 1945f0639b..97e40097fd 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -64,8 +64,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
 
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
 extern void tbm_end_iterate(TBMIterator *iterator);
 extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
-- 
2.43.2

v6-0007-WIP-pg_buffercache_invalidate.patchtext/x-patch; charset=US-ASCII; name=v6-0007-WIP-pg_buffercache_invalidate.patchDownload

From 9b049491c79332f8780ef4d3973856d558cf3859 Mon Sep 17 00:00:00 2001
From: Palak <palakchaturvedi2843@gmail.com>
Date: Fri, 30 Jun 2023 08:21:06 +0000
Subject: [PATCH v6 7/7] WIP: pg_buffercache_invalidate

See https://commitfest.postgresql.org/47/4426/
---
 contrib/pg_buffercache/Makefile               |  2 +-
 contrib/pg_buffercache/meson.build            |  1 +
 .../pg_buffercache--1.4--1.5.sql              |  6 ++
 contrib/pg_buffercache/pg_buffercache.control |  2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c | 34 +++++++++
 src/backend/storage/buffer/bufmgr.c           | 71 +++++++++++++++++++
 src/include/storage/bufmgr.h                  |  3 +
 7 files changed, 117 insertions(+), 2 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index d6b58d4da9..eae65ead9e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,7 @@ OBJS = \
 EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
-	pg_buffercache--1.3--1.4.sql
+	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index c86e33cc95..1ca3452918 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -22,6 +22,7 @@ install_data(
   'pg_buffercache--1.2--1.3.sql',
   'pg_buffercache--1.2.sql',
   'pg_buffercache--1.3--1.4.sql',
+  'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql
new file mode 100644
index 0000000000..c96848c77d
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql
@@ -0,0 +1,6 @@
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.5'" to load this file. \quit
+
+CREATE FUNCTION pg_buffercache_invalidate(IN int, IN bool default true)
+RETURNS bool
+AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate'
+LANGUAGE C PARALLEL SAFE;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index a82ae5f9bb..5ee875f77d 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.4'
+default_version = '1.5'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3316732365..93ee4ef724 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -64,6 +64,40 @@ PG_FUNCTION_INFO_V1(pg_buffercache_pages);
 PG_FUNCTION_INFO_V1(pg_buffercache_summary);
 PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 
+PG_FUNCTION_INFO_V1(pg_buffercache_invalidate);
+Datum
+pg_buffercache_invalidate(PG_FUNCTION_ARGS)
+{
+	Buffer		bufnum;
+	bool		result;
+	bool		force;
+
+	if (PG_ARGISNULL(0))
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("buffernum cannot be NULL")));
+	}
+
+	bufnum = PG_GETARG_INT32(0);
+	if (bufnum <= 0 || bufnum > NBuffers)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("buffernum is not valid")));
+
+	}
+
+	/*
+	 * Check whether to force invalidate the dirty buffer. The default value of force is true.
+	 */
+
+	force = PG_GETARG_BOOL(1);
+
+	result = TryInvalidateBuffer(bufnum, force);
+	PG_RETURN_BOOL(result);
+}
+
 Datum
 pg_buffercache_pages(PG_FUNCTION_ARGS)
 {
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 729d1f9172..fd35aa6495 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5941,3 +5941,74 @@ ResOwnerPrintBufferPin(Datum res)
 {
 	return DebugPrintBufferRefcount(DatumGetInt32(res));
 }
+
+/*
+ * Try Invalidating a buffer using bufnum.
+ * If the buffer is invalid, the function returns false.
+ * The function checks for dirty buffer and flushes the dirty buffer before invalidating.
+ * If the buffer is still dirty it returns false.
+ */
+bool
+TryInvalidateBuffer(Buffer bufnum, bool force)
+{
+	BufferDesc *bufHdr = GetBufferDescriptor(bufnum - 1);
+	uint32		buf_state;
+	ReservePrivateRefCountEntry();
+
+	buf_state = LockBufHdr(bufHdr);
+	if ((buf_state & BM_VALID) == BM_VALID)
+	{
+		/*
+		 * The buffer is pinned therefore cannot invalidate.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+		{
+			UnlockBufHdr(bufHdr, buf_state);
+			return false;
+		}
+		if ((buf_state & BM_DIRTY) == BM_DIRTY)
+		{
+			/*
+			 * If the buffer is dirty and the user has not asked to clear the dirty buffer return false.
+			 * Otherwise clear the dirty buffer.
+			 */
+			if(!force){
+				return false;
+			}
+			/*
+			 * Try once to flush the dirty buffer.
+			 */
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+			PinBuffer_Locked(bufHdr);
+			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+			FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			UnpinBuffer(bufHdr);
+			buf_state = LockBufHdr(bufHdr);
+			if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				UnlockBufHdr(bufHdr, buf_state);
+				return false;
+			}
+
+			/*
+			 * If its dirty again or not valid anymore give up.
+			 */
+
+			if ((buf_state & (BM_DIRTY | BM_VALID)) != (BM_VALID))
+			{
+				UnlockBufHdr(bufHdr, buf_state);
+				return false;
+			}
+
+		}
+
+		InvalidateBuffer(bufHdr);
+		return true;
+	}
+	else
+	{
+		UnlockBufHdr(bufHdr, buf_state);
+		return false;
+	}
+}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b57f71f97e..f57d76ae35 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -295,6 +295,9 @@ extern bool BgBufferSync(struct WritebackContext *wb_context);
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
+
+extern bool TryInvalidateBuffer(Buffer bufnum, bool force);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
-- 
2.43.2

#30

Dilip Kumar

dilipbalaut@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#29)

Re: Streaming I/O, vectored I/O (WIP)

On Sat, Mar 9, 2024 at 3:55 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Hi Thomas,

I am planning to review this patch set, so started going through 0001,
I have a question related to how we are issuing smgrprefetch in
StartReadBuffers()

+ if (operation->io_buffers_len > 0)
+ {
+ if (flags & READ_BUFFERS_ISSUE_ADVICE)
  {
- if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
- {
- ereport(WARNING,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s; zeroing out page",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
- MemSet((char *) bufBlock, 0, BLCKSZ);
- }
- else
- ereport(ERROR,
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("invalid page in block %u of relation %s",
- blockNum,
- relpath(smgr->smgr_rlocator, forkNum))));
+ /*
+ * In theory we should only do this if PrepareReadBuffers() had to
+ * allocate new buffers above.  That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the advice.
+ * That'd be a better simulation of true asynchronous I/O, which
+ * would only start the I/O once, but isn't done here for
+ * simplicity.  Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
  }

This is always issuing smgrprefetch starting with the input blockNum,
shouldn't we pass the first blockNum which we did not find in the
Buffer pool? So basically in the loop above this call where we are
doing PrepareReadBuffer() we should track the first blockNum for which
the found is not true and pass that blockNum into the smgrprefetch()
as a first block right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#31

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Dilip Kumar (#30)

Re: Streaming I/O, vectored I/O (WIP)

On Tue, Mar 12, 2024 at 7:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I am planning to review this patch set, so started going through 0001,
I have a question related to how we are issuing smgrprefetch in
StartReadBuffers()

Thanks!

+ /*
+ * In theory we should only do this if PrepareReadBuffers() had to
+ * allocate new buffers above.  That way, if two calls to
+ * StartReadBuffers() were made for the same blocks before
+ * WaitReadBuffers(), only the first would issue the advice.
+ * That'd be a better simulation of true asynchronous I/O, which
+ * would only start the I/O once, but isn't done here for
+ * simplicity.  Note also that the following call might actually
+ * issue two advice calls if we cross a segment boundary; in a
+ * true asynchronous version we might choose to process only one
+ * real I/O at a time in that case.
+ */
+ smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
}
This is always issuing smgrprefetch starting with the input blockNum,
shouldn't we pass the first blockNum which we did not find in the
Buffer pool? So basically in the loop above this call where we are
doing PrepareReadBuffer() we should track the first blockNum for which
the found is not true and pass that blockNum into the smgrprefetch()
as a first block right?

I think you'd be right if StartReadBuffers() were capable of
processing a sequence consisting of a hit followed by misses, but
currently it always gives up after the first hit. That is, it always
processes some number of misses (0-16) and then at most one hit. So
for now the variable would always turn out to be the same as blockNum.

The reason is that I wanted to allows "full sized" read system calls
to form. If you said "hey please read these 16 blocks" (I'm calling
that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
then it could only form a read of 14 blocks, but there might be more
blocks that could be read after those. We would have some arbitrary
shorter read system calls, when we wanted to make them all as big as
possible. So in the current patch you say "hey please read these 16
blocks" and it returns saying "only read 1", you call again with 15
and it says "only read 1", and you call again and says "read 16!"
(assuming 2 more were readable after the original range we started
with). Then physical reads are maximised. Maybe there is some nice
way to solve that, but I thought this way was the simplest (and if
there is some instruction-cache-locality/tight-loop/perf reason why we
should work harder to find ranges of hits, it could be for later).
Does that make sense?

#32

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#31)

Re: Streaming I/O, vectored I/O (WIP)

On Tue, Mar 12, 2024 at 7:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:

possible. So in the current patch you say "hey please read these 16
blocks" and it returns saying "only read 1", you call again with 15

Oops, typo worth correcting: s/15/16/. Point being that the caller is
interested in more blocks after the original 16, so it uses 16 again
when it calls back (because that's the size of the Buffer array it
provides).

#33

Dilip Kumar

dilipbalaut@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#31)

Re: Streaming I/O, vectored I/O (WIP)

On Tue, Mar 12, 2024 at 12:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I think you'd be right if StartReadBuffers() were capable of
processing a sequence consisting of a hit followed by misses, but
currently it always gives up after the first hit. That is, it always
processes some number of misses (0-16) and then at most one hit. So
for now the variable would always turn out to be the same as blockNum.

Okay, then shouldn't this "if (found)" block immediately break the
loop so that when we hit the block we just return that block? So it
makes sense what you explained but with the current code if there are
the first few hits followed by misses then we will issue the
smgrprefetch() for the initial hit blocks as well.

+ if (found)
+ {
+ /*
+ * Terminate the read as soon as we get a hit.  It could be a
+ * single buffer hit, or it could be a hit that follows a readable
+ * range.  We don't want to create more than one readable range,
+ * so we stop here.
+ */
+ actual_nblocks = operation->nblocks = *nblocks = i + 1;    (Dilip: I
think we should break after this?)
+ }
+ else
+ {
+ /* Extend the readable range to cover this block. */
+ operation->io_buffers_len++;
+ }
+ }

The reason is that I wanted to allows "full sized" read system calls
to form. If you said "hey please read these 16 blocks" (I'm calling
that "full sized", AKA MAX_BUFFERS_PER_TRANSFER), and it found 2 hits,
then it could only form a read of 14 blocks, but there might be more
blocks that could be read after those. We would have some arbitrary
shorter read system calls, when we wanted to make them all as big as
possible. So in the current patch you say "hey please read these 16
blocks" and it returns saying "only read 1", you call again with 15
and it says "only read 1", and you call again and says "read 16!"
(assuming 2 more were readable after the original range we started
with). Then physical reads are maximised. Maybe there is some nice
way to solve that, but I thought this way was the simplest (and if
there is some instruction-cache-locality/tight-loop/perf reason why we
should work harder to find ranges of hits, it could be for later).
Does that make sense?

Understood, I think this makes sense.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#34

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Dilip Kumar (#33)

9 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Tue, Mar 12, 2024 at 11:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

+ actual_nblocks = operation->nblocks = *nblocks = i + 1;
(Dilip: I think we should break after this?)

In the next loop, i < actual_nblocks is false so the loop terminates.
But yeah that was a bit obscure, so I have added an explicit break.

Here also is a new version also of the streaming_read.c patch. This
is based on feedback from the bitmap heap scan thread, where Tomas and
Melanie noticed some problems when comparing effective_io_concurrency
= 0 and 1 against master. The attached random.sql exercises a bunch
of random scans with different settings, and random.txt shows the
resulting system calls, unpatched vs patched. Looking at it that way,
I was able to tweak the coding until I had the same behaviour as
master, except with fewer system calls wherever possible due to
coalescing or suppressing sequential advice.

Attachments:

random.txttext/plain; charset=US-ASCII; name=random.txtDownload

v7-0001-Provide-vectored-variant-of-ReadBuffer.patchapplication/octet-stream; name=v7-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From d19e5554afc6ef5eb4455406e54740853a35140a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v7 1/7] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

With some more infrastructure in later work, StartReadBuffers() should
be extended to start real asynchronous I/O instead of advice.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 642 ++++++++++++++++++--------
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/include/storage/bufmgr.h          |  45 ++
 src/tools/pgindent/typedefs.list      |   1 +
 4 files changed, 493 insertions(+), 209 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..d0e9c7deff 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -471,10 +476,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +504,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +785,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +797,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +819,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +990,68 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1070,405 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
 
+static Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	operation->bmr = bmr;
+	operation->forknum = forkNum;
+	operation->blocknum = blockNum;
+	operation->buffers = buffers;
+	operation->nblocks = actual_nblocks;
+	operation->strategy = strategy;
+	operation->flags = flags;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	operation->io_buffers_len = 0;
 
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
-	else
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		bool		found;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		buffers[i] = PrepareReadBuffer(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			operation->io_buffers_len++;
+		}
+	}
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+	if (operation->io_buffers_len > 0)
+	{
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
+			/*
+			 * In theory we should only do this if PrepareReadBuffers() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
 		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
 	}
+	else
+	{
+		return false;
+	}
+}
 
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
 	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
 	}
 	else
 	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
 	}
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for StartReadBuffers.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1476,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1535,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1603,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1638,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +1994,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2259,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2282,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2597,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2611,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3679,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5414,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5435,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a5..985a2c7049 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..b57f71f97e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int			io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +292,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3a7f75b08..26129b8e22 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2267,6 +2267,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.3 (Apple Git-146)

v7-0002-Provide-API-for-streaming-reads-of-relations.patchapplication/octet-stream; name=v7-0002-Provide-API-for-streaming-reads-of-relations.patchDownload

From 71ebe6eb85b292753cd4f784ab8bf60fa549d655 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v7 2/7] Provide API for "streaming" reads of relations.

"Streaming reads" can be used as a more efficient replacement for
a series of ReadBuffer() calls.

The client code supplies a callback that can say which block to read
next, and then consumes individual buffers one at a time.  This division
allows streaming_read.c to build up large calls to StartReadBuffers(),
and issue fadvise() advice about future random reads in a systematic
way.

This API is based on an idea proposed by Andres Freund, to pave the way
for asynchronous I/O in future work as required to support direct I/O.
The main aim is to create an abstraction that insulates client code from
future improvements to the I/O subsystem.

An extended API may be necessary in future for more complicated cases
(for example recovery), but this basic API is thought to be sufficient
for many common usage patterns involving predictable access to a single
relation fork.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 659 +++++++++++++++++++++++
 src/backend/storage/meson.build          |   1 +
 src/include/storage/streaming_read.h     |  52 ++
 src/tools/pgindent/typedefs.list         |   2 +
 7 files changed, 734 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..d5c29b750d
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,659 @@
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		need_wait;
+	bool		advice_issued;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index;
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+	ReadBuffersOperation operation;
+} PgStreamingReadRange;
+
+/*
+ * Streaming read object.
+ */
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			next_tail_buffer;
+	int			distance;
+	bool		started;
+	bool		finished;
+	bool		advice_enabled;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/*
+	 * Make sure our bmr's smgr and persistent are populated.  The caller
+	 * asserts that the storage manager will remain valid.
+	 */
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+
+	/*
+	 * Choose a maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least
+	 * MAX_BUFFER_PER_TRANSFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+	pgsr->callback = next_block_cb;
+	pgsr->bmr = bmr;
+	pgsr->forknum = forknum;
+
+	pgsr->unget_blocknum = InvalidBlockNumber;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out doing full-sized
+	 * reads.
+	 */
+	if (flags & PGSR_FLAG_FULL)
+		pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
+	else
+		pgsr->distance = 1;
+
+	/*
+	 * Space for the callback to store extra data along with each block.  Note
+	 * that we need one more than max_pinned_buffers, so we can return a
+	 * pointer to a slot that can't be overwritten until the next call.
+	 */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * size);
+
+	return pgsr;
+}
+
+/*
+ * Find the per-buffer data index for the Nth block of a range.
+ */
+static int
+get_per_buffer_data_index(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	int			result;
+
+	/*
+	 * Find slot in the circular buffer of per-buffer data, without using the
+	 * expensive % operator.
+	 */
+	result = range->per_buffer_data_index + n;
+	while (result >= pgsr->size)
+		result -= pgsr->size;
+	Assert(result == (range->per_buffer_data_index + n) % pgsr->size);
+
+	return result;
+}
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data_by_index(PgStreamingRead *pgsr, int per_buffer_data_index)
+{
+	return (char *) pgsr->per_buffer_data +
+		pgsr->per_buffer_data_size * per_buffer_data_index;
+}
+
+/*
+ * Return a pointer to the per-buffer data for the Nth block of a range.
+ */
+static void *
+get_per_buffer_data(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	return get_per_buffer_data_by_index(pgsr,
+										get_per_buffer_data_index(pgsr,
+																  range,
+																  n));
+}
+
+/*
+ * Start reading the head range, and create a new head range.  The new head
+ * range is returned.  It may not be empty, if StartReadBuffers() couldn't
+ * start the entire range; in that case the returned range contains the
+ * remaining portion of the range.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_start_head_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+	PgStreamingReadRange *new_head_range;
+	int			nblocks_pinned;
+	int			flags;
+
+	/* Caller should make sure we never exceed max_ios. */
+	Assert((pgsr->ios_in_progress < pgsr->max_ios) ||
+		   (pgsr->ios_in_progress == 0 && pgsr->max_ios == 0));
+
+	/* Should only call if the head range has some blocks to read. */
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (pgsr->advice_enabled &&
+		pgsr->max_ios > 0 &&
+		pgsr->started &&
+		head_range->blocknum != pgsr->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* Suppress advice on the first call, because it's too late to benefit. */
+	if (!pgsr->started)
+		pgsr->started = true;
+
+	/* We shouldn't be trying to pin more buffers that we're allowed to. */
+	Assert(pgsr->pinned_buffers + head_range->nblocks <= pgsr->max_pinned_buffers);
+
+	/* Start reading as many blocks as we can from the head range. */
+	nblocks_pinned = head_range->nblocks;
+	head_range->need_wait =
+		StartReadBuffers(pgsr->bmr,
+						 head_range->buffers,
+						 pgsr->forknum,
+						 head_range->blocknum,
+						 &nblocks_pinned,
+						 pgsr->strategy,
+						 flags,
+						 &head_range->operation);
+
+	Assert(pgsr->pinned_buffers <= pgsr->max_pinned_buffers);
+
+	if (head_range->need_wait && (flags & READ_BUFFERS_ISSUE_ADVICE))
+	{
+		/*
+		 * Since we've issued advice, we count an I/O in progress until we
+		 * call WaitReadBuffers().
+		 */
+		head_range->advice_issued = true;
+		pgsr->ios_in_progress++;
+		Assert(pgsr->ios_in_progress <= pgsr->max_ios);
+	}
+
+	/*
+	 * StartReadBuffers() might have pinned fewer blocks than we asked it to,
+	 * but always at least one.
+	 */
+	Assert(nblocks_pinned <= head_range->nblocks);
+	Assert(nblocks_pinned >= 1);
+	pgsr->pinned_buffers += nblocks_pinned;
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time.
+	 */
+	pgsr->seq_blocknum = head_range->blocknum + nblocks_pinned;
+
+	/*
+	 * Create a new head range.  There must be space, because we have enough
+	 * elements for every range to hold just one block, up to the pin limit.
+	 */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	new_head_range = &pgsr->ranges[pgsr->head];
+	new_head_range->nblocks = 0;
+	new_head_range->advice_issued = false;
+
+	/*
+	 * If we didn't manage to start the whole read above, we split the range,
+	 * moving the remainder into the new head range.
+	 */
+	if (nblocks_pinned < head_range->nblocks)
+	{
+		int			nblocks_remaining = head_range->nblocks - nblocks_pinned;
+
+		head_range->nblocks = nblocks_pinned;
+
+		new_head_range->blocknum = head_range->blocknum + nblocks_pinned;
+		new_head_range->nblocks = nblocks_remaining;
+	}
+
+	/* The new range has per-buffer data starting after the previous range. */
+	new_head_range->per_buffer_data_index =
+		get_per_buffer_data_index(pgsr, head_range, nblocks_pinned);
+
+	return new_head_range;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow pg_streaming_unget_block() to work.
+ */
+static BlockNumber
+pg_streaming_get_block(PgStreamingRead *pgsr, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(pgsr->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = pgsr->unget_blocknum;
+		pgsr->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * pg_streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == pgsr->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by pg_streaming_get_block().
+ */
+static void
+pg_streaming_unget_block(PgStreamingRead *pgsr, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(pgsr->unget_blocknum == InvalidBlockNumber);
+	pgsr->unget_blocknum = blocknum;
+	pgsr->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *range;
+
+	/* If we're finished, don't look ahead. */
+	if (pgsr->finished)
+		return;
+
+	/*
+	 * We we've already started the maximum allowed number of I/Os, don't look
+	 * ahead.  There is a special case for max_ios == 0.
+	 */
+	if (pgsr->max_ios > 0 && pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/* Can't pin any more buffers. */
+	if (pgsr->pinned_buffers == pgsr->distance)
+		return;
+
+	/*
+	 * Keep trying to add new blocks to the end of the head range while doing
+	 * so wouldn't exceed the distance limit.
+	 */
+	range = &pgsr->ranges[pgsr->head];
+	while (pgsr->pinned_buffers + range->nblocks < pgsr->distance)
+	{
+		BlockNumber blocknum;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		if (range->nblocks == lengthof(range->buffers))
+		{
+			/* Start as much of it as we can. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/* If we're now at the I/O limit, stop here. */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+				return;
+
+			/*
+			 * That might have only been partially started, but always
+			 * processes at least one so that'll do for now.
+			 */
+			Assert(range->nblocks < lengthof(range->buffers));
+		}
+
+		/* Find per-buffer data slot for the next block. */
+		per_buffer_data = get_per_buffer_data(pgsr, range, range->nblocks);
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pg_streaming_get_block(pgsr, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			pgsr->finished = true;
+			break;
+		}
+
+		/*
+		 * Is there a head range that we cannot extend, because the requested
+		 * block is not consecutive?
+		 */
+		if (range->nblocks > 0 &&
+			range->blocknum + range->nblocks != blocknum)
+		{
+			/* Yes.  Start it, so we can begin building a new one. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * It's possible that it was only partially started, and we have a
+			 * new range with the remainder.  Keep starting I/Os until we get
+			 * it all out of the way, or we hit the I/O limit.
+			 */
+			while (range->nblocks > 0 && pgsr->ios_in_progress < pgsr->max_ios)
+				range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * We do have to worry about I/O capacity running out if the head
+			 * range was split.  In that case we have to 'unget' the block
+			 * returned by the callback.
+			 */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+			{
+				pg_streaming_unget_block(pgsr, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* If we have a new, empty range, initialize the start block. */
+		if (range->nblocks == 0)
+			range->blocknum = blocknum;
+
+		/* This block extends the range by one. */
+		Assert(range->blocknum + range->nblocks == blocknum);
+		range->nblocks++;
+	};
+
+	/*
+	 * Normally we don't start the head range, preferring to give it a chance
+	 * to grow to full size once more buffers have been consumed.  In cases
+	 * where that can't possibly happen, we might as well start the read
+	 * immediately.
+	 */
+	if ((range->nblocks > 0 && pgsr->finished) ||
+		(range->nblocks == pgsr->distance))
+		pg_streaming_read_start_head_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	PgStreamingReadRange *tail_range;
+
+	for (;;)
+	{
+		if (pgsr->tail != pgsr->head)
+		{
+			tail_range = &pgsr->ranges[pgsr->tail];
+
+			/*
+			 * Do we need to wait for a ReadBuffers operation to finish before
+			 * returning the buffers in this range?
+			 */
+			if (tail_range->need_wait)
+			{
+				int			distance;
+
+				Assert(pgsr->next_tail_buffer == 0);
+				WaitReadBuffers(&tail_range->operation);
+				tail_range->need_wait = false;
+
+				/*
+				 * We don't really know if the kernel generated a physical I/O
+				 * when we issued advice, let alone when it finished, but it
+				 * has certainly finished now because we've performed the
+				 * read.
+				 */
+				if (tail_range->advice_issued)
+				{
+
+					Assert(pgsr->ios_in_progress > 0);
+					pgsr->ios_in_progress--;
+
+					/*
+					 * Look-ahead distance ramps up rapidly if we're issuing
+					 * advice, so we can search for new more I/Os to start.
+					 */
+					distance = pgsr->distance * 2;
+					distance = Min(distance, pgsr->max_pinned_buffers);
+					pgsr->distance = distance;
+				}
+				else
+				{
+					/*
+					 * There is no point in increasing look-ahead distance if
+					 * we've already reached the full I/O size, since we're
+					 * not issuing advice.  Extra distance would only pin more
+					 * buffers for no benefit.
+					 */
+					if (pgsr->distance > MAX_BUFFERS_PER_TRANSFER)
+					{
+						/*
+						 * Look-ahead distance gradually decays to full I/O
+						 * size.
+						 */
+						pgsr->distance--;
+					}
+					else
+					{
+						/*
+						 * Look-ahead distance ramps up rapidly, but not more
+						 * that the full I/O size.
+						 */
+						distance = pgsr->distance * 2;
+						distance = Min(distance, MAX_BUFFERS_PER_TRANSFER);
+						distance = Min(distance, pgsr->max_pinned_buffers);
+						pgsr->distance = distance;
+					}
+				}
+			}
+			else if (pgsr->next_tail_buffer == 0)
+			{
+				/* No I/O necessary. Look-ahead distance gradually decays. */
+				if (pgsr->distance > 1)
+					pgsr->distance--;
+			}
+
+			/* Are there more buffers available in this range? */
+			if (pgsr->next_tail_buffer < tail_range->nblocks)
+			{
+				int			buffer_index;
+				Buffer		buffer;
+
+				buffer_index = pgsr->next_tail_buffer++;
+				buffer = tail_range->buffers[buffer_index];
+
+				Assert(BufferIsValid(buffer));
+
+				/* We are giving away ownership of this pinned buffer. */
+				Assert(pgsr->pinned_buffers > 0);
+				pgsr->pinned_buffers--;
+
+				if (per_buffer_data)
+					*per_buffer_data = get_per_buffer_data(pgsr, tail_range, buffer_index);
+
+				/* We may be able to get another I/O started. */
+				pg_streaming_read_look_ahead(pgsr);
+
+				return buffer;
+			}
+
+			/* Advance tail to next range. */
+			if (++pgsr->tail == pgsr->size)
+				pgsr->tail = 0;
+			pgsr->next_tail_buffer = 0;
+		}
+		else
+		{
+			/*
+			 * If tail crashed into head, and head is not empty, then it is
+			 * time to start that range.  Otherwise, force a look-ahead, to
+			 * kick start the stream.
+			 */
+			Assert(pgsr->tail == pgsr->head);
+			if (pgsr->ranges[pgsr->head].nblocks > 0)
+			{
+				pg_streaming_read_start_head_range(pgsr);
+			}
+			else
+			{
+				pg_streaming_read_look_ahead(pgsr);
+
+				/* Finished? */
+				if (pgsr->tail == pgsr->head &&
+					pgsr->ranges[pgsr->head].nblocks == 0)
+					break;
+			}
+		}
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	pgsr->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(pgsr->pinned_buffers == 0);
+	Assert(pgsr->ios_in_progress == 0);
+
+	/* Release memory. */
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..c4d3892bb2
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,52 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 26129b8e22..299c77ea69 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2097,6 +2097,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.39.3 (Apple Git-146)

v7-0003-Use-streaming-reads-in-pg_prewarm.patchapplication/octet-stream; name=v7-0003-Use-streaming-reads-in-pg_prewarm.patchDownload

From fcd68808264b44c5417f5cb59c0b51794ca7ede0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v7 3/7] Use streaming reads in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use streaming reads.  This
provides a simple example of such a transformation, and generates fewer
system calls.

Reviewed-by:
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..1cc84bcb0c 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_FULL,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.3 (Apple Git-146)

v7-0004-WIP-Use-streaming-reads-in-heapam-scans.patchapplication/octet-stream; name=v7-0004-WIP-Use-streaming-reads-in-heapam-scans.patchDownload

From 719f20602d157431ef69c7da122da9ac31a8339c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 12 Apr 2023 18:16:06 -0700
Subject: [PATCH v7 4/7] WIP: Use streaming reads in heapam scans.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/heap/heapam.c         | 205 +++++++++++++++++++++--
 src/backend/access/heap/heapam_handler.c |   2 +-
 src/include/access/heapam.h              |   5 +-
 3 files changed, 192 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625..ab88eff296 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -62,6 +62,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/relcache.h"
@@ -221,6 +222,89 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	BlockNumber blockno;
+
+	Assert(!scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	if (scan->rs_prefetch_block == InvalidBlockNumber)
+	{
+		scan->rs_prefetch_block = blockno = scan->rs_startblock;
+	}
+	else
+	{
+		blockno = ++scan->rs_prefetch_block;
+
+		/* wrap back to the start of the heap */
+		if (blockno >= scan->rs_nblocks)
+			scan->rs_prefetch_block = blockno = 0;
+
+		/* we're done if we're back at where we started */
+		if (blockno == scan->rs_startblock)
+			return InvalidBlockNumber;
+
+		/* check if the limit imposed by heap_setscanlimits() is met */
+		if (scan->rs_numblocks != InvalidBlockNumber)
+		{
+			if (--scan->rs_numblocks == 0)
+				return InvalidBlockNumber;
+		}
+	}
+
+	return blockno;
+}
+
+static BlockNumber
+heap_pgsr_next_parallel(PgStreamingRead *pgsr, void *pgsr_private,
+						void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+	ParallelBlockTableScanDesc pbscan =
+		(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
+	ParallelBlockTableScanWorker pbscanwork =
+		scan->rs_parallelworkerdata;
+	BlockNumber blockno;
+
+	Assert(scan->rs_base.rs_parallel);
+	Assert(scan->rs_nblocks > 0);
+
+	/* Note that other processes might have already finished the scan */
+	blockno = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+												pbscanwork, pbscan);
+
+	return blockno;
+}
+
+static PgStreamingRead *
+heap_pgsr_single_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+										  scan,
+										  0,
+										  scan->rs_strategy,
+										  BMR_REL(scan->rs_base.rs_rd),
+										  MAIN_FORKNUM,
+										  heap_pgsr_next_single);
+}
+
+static PgStreamingRead *
+heap_pgsr_parallel_alloc(HeapScanDesc scan)
+{
+	return pg_streaming_read_buffer_alloc(PGSR_FLAG_SEQUENTIAL,
+										  scan,
+										  0,
+										  scan->rs_strategy,
+										  BMR_REL(scan->rs_base.rs_rd),
+										  MAIN_FORKNUM,
+										  heap_pgsr_next_parallel);
+}
+
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -338,6 +422,26 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 */
 	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN)
 		pgstat_count_heap_scan(scan->rs_base.rs_rd);
+
+	scan->rs_prefetch_block = InvalidBlockNumber;
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
+	/*
+	 * FIXME: This probably should be done in the !rs_inited blocks instead.
+	 */
+	scan->pgsr = NULL;
+	if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
+		(scan->rs_base.rs_flags & SO_TYPE_SEQSCAN))
+	{
+		if (scan->rs_base.rs_parallel)
+			scan->pgsr = heap_pgsr_parallel_alloc(scan);
+		else
+			scan->pgsr = heap_pgsr_single_alloc(scan);
+	}
 }
 
 /*
@@ -370,7 +474,7 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
  * which tuples on the page are visible.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer pgsr_buffer)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
 	Buffer		buffer;
@@ -397,9 +501,20 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
+	if (BufferIsValid(pgsr_buffer))
+	{
+		Assert(scan->pgsr);
+		Assert(BufferGetBlockNumber(pgsr_buffer) == block);
+		scan->rs_cbuf = pgsr_buffer;
+	}
+	else
+	{
+		Assert(!scan->pgsr);
+
+		/* read page using selected strategy */
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+										   RBM_NORMAL, scan->rs_strategy);
+	}
 	scan->rs_cblock = block;
 
 	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
@@ -486,7 +601,7 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
  * of the pages before we can get a chance to get our first page.
  */
 static BlockNumber
-heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
+heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir, Buffer *pgsr_buf)
 {
 	Assert(!scan->rs_inited);
 
@@ -496,16 +611,25 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
+		if (scan->rs_base.rs_parallel != NULL)
+			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+													 scan->rs_parallelworkerdata,
+													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		/* FIXME: Integrate more neatly */
+		if (scan->pgsr)
+		{
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+
 		/* serial scan */
 		if (scan->rs_base.rs_parallel == NULL)
 			return scan->rs_startblock;
 		else
 		{
-			/* parallel scan */
-			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
-													 scan->rs_parallelworkerdata,
-													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
-
 			/* may return InvalidBlockNumber if there are no more blocks */
 			return table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
 													 scan->rs_parallelworkerdata,
@@ -525,6 +649,12 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 		 */
 		scan->rs_base.rs_flags &= ~SO_ALLOW_SYNC;
 
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/*
 		 * Start from last page of the scan.  Ensure we take into account
 		 * rs_numblocks if it's been adjusted by heap_setscanlimits().
@@ -626,11 +756,33 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * heap_setscanlimits().
  */
 static inline BlockNumber
-heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
+heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir,
+						 Buffer *pgsr_buf)
 {
 	if (ScanDirectionIsForward(dir))
 	{
-		if (scan->rs_base.rs_parallel == NULL)
+		if (scan->pgsr)
+		{
+#ifdef USE_ASSERT_CHECKING
+			block++;
+
+			/* wrap back to the start of the heap */
+			if (block >= scan->rs_nblocks)
+				block = 0;
+
+			/* we're done if we're back at where we started */
+			if (block == scan->rs_startblock)
+				block = InvalidBlockNumber;
+#endif
+			*pgsr_buf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+			if (*pgsr_buf == InvalidBuffer)
+				return InvalidBlockNumber;
+
+			Assert(scan->rs_base.rs_parallel ||
+				   block == BufferGetBlockNumber(*pgsr_buf));
+			return BufferGetBlockNumber(*pgsr_buf);
+		}
+		else if (scan->rs_base.rs_parallel == NULL)
 		{
 			block++;
 
@@ -675,6 +827,12 @@ heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir
 	}
 	else
 	{
+		if (scan->pgsr)
+		{
+			pg_streaming_read_free(scan->pgsr);
+			scan->pgsr = NULL;
+		}
+
 		/* we're done if the last block is the start position */
 		if (block == scan->rs_startblock)
 			return InvalidBlockNumber;
@@ -724,13 +882,14 @@ heapgettup(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -751,7 +910,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -803,9 +962,10 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+		pgsr_buf = InvalidBuffer;
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -839,13 +999,14 @@ heapgettup_pagemode(HeapScanDesc scan,
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
 	BlockNumber block;
+	Buffer		pgsr_buf = InvalidBuffer;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
 	if (unlikely(!scan->rs_inited))
 	{
-		block = heapgettup_initial_block(scan, dir);
+		block = heapgettup_initial_block(scan, dir, &pgsr_buf);
 		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
 		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
 		scan->rs_inited = true;
@@ -872,7 +1033,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapgetpage((TableScanDesc) scan, block, pgsr_buf);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
@@ -904,7 +1065,7 @@ continue_page:
 		}
 
 		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
+		block = heapgettup_advance_block(scan, block, dir, &pgsr_buf);
 	}
 
 	/* end of scan */
@@ -952,6 +1113,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_parallel = parallel_scan;
 	scan->rs_strategy = NULL;	/* set in initscan */
 
+	scan->pgsr = NULL;
+
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
 	 */
@@ -1058,6 +1221,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->pgsr)
+	{
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+	}
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b..20e83f48e3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2333,7 +2333,7 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
+	heapgetpage(scan, blockno, InvalidBuffer);
 	hscan->rs_inited = true;
 
 	return true;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f6859..21e71d6091 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -59,6 +59,7 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	OffsetNumber rs_coffset;	/* current offset # in non-page-at-a-time mode */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+	BlockNumber rs_prefetch_block;	/* block being prefetched */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
@@ -72,6 +73,8 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
+	struct PgStreamingRead *pgsr;
+
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
@@ -246,7 +249,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapgetpage(TableScanDesc sscan, BlockNumber block, Buffer buffer);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.39.3 (Apple Git-146)

v7-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchapplication/octet-stream; name=v7-0005-WIP-Use-streaming-reads-in-nbtree-vacuum-scan.patchDownload

From f02b03f7940a0f19f596b711e16d9931f77ef888 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 12 Feb 2021 14:19:10 -0800
Subject: [PATCH v7 5/7] WIP: Use streaming reads in nbtree vacuum scan.

XXX Cherry-picked from https://github.com/anarazel/postgres/tree/aio and
lightly modified by TM, for demonstration purposes.

Author: Andres Freund <andres@anarazel.de>
---
 src/backend/access/nbtree/nbtree.c | 75 ++++++++++++++++++++++++------
 src/include/access/nbtree.h        |  3 ++
 2 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 41df1027d2..13fb6537e4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -32,6 +32,8 @@
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
+#include "utils/builtins.h"
 #include "utils/fmgrprotos.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
@@ -79,7 +81,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
 static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 IndexBulkDeleteCallback callback, void *callback_state,
 						 BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno);
 static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
 										  IndexTuple posting,
 										  OffsetNumber updatedoffset,
@@ -876,6 +878,24 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+static BlockNumber
+btvacuumscan_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+					   void *per_buffer_data)
+{
+	BTVacState *vstate = (BTVacState *) pgsr_private;
+
+	if (vstate->nextblock == vstate->num_pages)
+		return InvalidBlockNumber;
+
+	/*
+	 * We can't use _bt_getbuf() here because it always applies
+	 * _bt_checkpage(), which will barf on an all-zero page. We want to
+	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
+	 * buffer access strategy. Also want to use AIO.
+	 */
+	return vstate->nextblock++;
+}
+
 /*
  * btvacuumscan --- scan the index for VACUUMing purposes
  *
@@ -969,6 +989,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	scanblkno = BTREE_METAPAGE + 1;
 	for (;;)
 	{
+		PgStreamingRead *pgsr;
+
 		/* Get the current relation length */
 		if (needLock)
 			LockRelationForExtension(rel, ExclusiveLock);
@@ -983,14 +1005,40 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (scanblkno >= num_pages)
 			break;
+
+		vstate.nextblock = scanblkno;
+		vstate.num_pages = num_pages;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_MAINTENANCE,
+											  &vstate,
+											  0,
+											  vstate.info->strategy,
+											  BMR_REL(vstate.info->index),
+											  MAIN_FORKNUM,
+											  btvacuumscan_pgsr_next);
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; scanblkno < num_pages; scanblkno++)
 		{
-			btvacuumpage(&vstate, scanblkno);
+			Buffer		buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
+
+			Assert(BufferIsValid(buf));
+			Assert(BufferGetBlockNumber(buf) == scanblkno);
+
+			btvacuumpage(&vstate, buf, scanblkno);
+
 			if (info->report_progress)
 				pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
 											 scanblkno);
+
+			/* call vacuum_delay_point while not holding any buffer lock */
+			vacuum_delay_point();
 		}
+
+		if (pg_streaming_read_buffer_get_next(pgsr, NULL) != 0)
+			elog(ERROR, "aio and plain pos out of sync");
+
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Set statistics num_pages field to final size of index */
@@ -1023,7 +1071,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
  * recycled (i.e. before the page split).
  */
 static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf, BlockNumber scanblkno)
 {
 	IndexVacuumInfo *info = vstate->info;
 	IndexBulkDeleteResult *stats = vstate->stats;
@@ -1034,28 +1082,28 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
 	bool		attempt_pagedel;
 	BlockNumber blkno,
 				backtrack_to;
-	Buffer		buf;
 	Page		page;
 	BTPageOpaque opaque;
 
 	blkno = scanblkno;
 
+	Assert(BufferIsValid(buf));
 backtrack:
 
 	attempt_pagedel = false;
 	backtrack_to = P_NONE;
 
-	/* call vacuum_delay_point while not holding any buffer lock */
-	vacuum_delay_point();
-
 	/*
-	 * We can't use _bt_getbuf() here because it always applies
-	 * _bt_checkpage(), which will barf on an all-zero page. We want to
-	 * recycle all-zero pages, not fail.  Also, we want to use a nondefault
-	 * buffer access strategy.
+	 * If we're not backtracking we already read the page asynchronously. Not
+	 * using AIO when backtracking isn't too tragic - it's relatively rare,
+	 * and the buffer quite possibly is still in shared buffers.
 	 */
-	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-							 info->strategy);
+	if (!BufferIsValid(buf))
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								 info->strategy);
+
+	Assert(BufferGetBlockNumber(buf) == blkno);
+
 	_bt_lockbuf(rel, buf, BT_READ);
 	page = BufferGetPage(buf);
 	opaque = NULL;
@@ -1342,6 +1390,7 @@ backtrack:
 	if (backtrack_to != P_NONE)
 	{
 		blkno = backtrack_to;
+		buf = InvalidBuffer;
 		goto backtrack;
 	}
 }
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6eb162052e..2dbbbd36e7 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -343,6 +343,9 @@ typedef struct BTVacState
 	int			maxbufsize;		/* max bufsize that respects work_mem */
 	BTPendingFSM *pendingpages; /* One entry per newly deleted page */
 	int			npendingpages;	/* current # valid pendingpages */
+
+	BlockNumber nextblock;
+	BlockNumber num_pages;
 } BTVacState;
 
 /*
-- 
2.39.3 (Apple Git-146)

v7-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patchapplication/octet-stream; name=v7-0006-WIP-Use-streaming-reads-in-bitmap-heapscan.patchDownload

From 681e6af81c1ba91f880b22b679a2970052e44799 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 22 Aug 2023 16:48:30 +1200
Subject: [PATCH v7 6/7] WIP: Use streaming reads in bitmap heapscan.

XXX Cherry-picked from
https://github.com/melanieplageman/postgres/tree/bhs_aio and lightly
modified by TM, for demonstration purposes.

Author: Melanie Plageman <melanieplageman@gmail.com>
---
 src/backend/access/gin/ginget.c           |  17 +-
 src/backend/access/gin/ginscan.c          |   7 +
 src/backend/access/heap/heapam.c          |  71 +++
 src/backend/access/heap/heapam_handler.c  |  78 ++--
 src/backend/commands/explain.c            |  21 +-
 src/backend/executor/nodeBitmapHeapscan.c | 521 ++--------------------
 src/backend/nodes/tidbitmap.c             |  76 ++--
 src/include/access/heapam.h               |   7 +
 src/include/access/relscan.h              |   8 +
 src/include/access/tableam.h              |  71 ++-
 src/include/executor/nodeBitmapHeapscan.h |   1 +
 src/include/nodes/execnodes.h             |  39 +-
 src/include/nodes/tidbitmap.h             |   4 +-
 13 files changed, 284 insertions(+), 637 deletions(-)

diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 0b4f2ebadb..f238a3b071 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -373,7 +373,10 @@ restartScanEntry:
 			if (entry->matchBitmap)
 			{
 				if (entry->matchIterator)
+				{
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+				}
 				entry->matchIterator = NULL;
 				tbm_free(entry->matchBitmap);
 				entry->matchBitmap = NULL;
@@ -386,6 +389,9 @@ restartScanEntry:
 		if (entry->matchBitmap && !tbm_is_empty(entry->matchBitmap))
 		{
 			entry->matchIterator = tbm_begin_iterate(entry->matchBitmap);
+			entry->matchResult = (TBMIterateResult *)
+				palloc0(sizeof(TBMIterateResult) + MaxHeapTuplesPerPage *
+						sizeof(OffsetNumber));
 			entry->isFinished = false;
 		}
 	}
@@ -823,21 +829,24 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 		{
 			/*
 			 * If we've exhausted all items on this block, move to next block
-			 * in the bitmap.
+			 * in the bitmap. tbm_iterate() sets matchResult->blockno to
+			 * InvalidBlockNumber when the bitmap is exhausted.
 			 */
-			while (entry->matchResult == NULL ||
+			while ((!BlockNumberIsValid(entry->matchResult->blockno)) ||
 				   (entry->matchResult->ntuples >= 0 &&
 					entry->offset >= entry->matchResult->ntuples) ||
 				   entry->matchResult->blockno < advancePastBlk ||
 				   (ItemPointerIsLossyPage(&advancePast) &&
 					entry->matchResult->blockno == advancePastBlk))
 			{
-				entry->matchResult = tbm_iterate(entry->matchIterator);
 
-				if (entry->matchResult == NULL)
+				tbm_iterate(entry->matchIterator, entry->matchResult);
+				if (!BlockNumberIsValid(entry->matchResult->blockno))
 				{
 					ItemPointerSetInvalid(&entry->curItem);
 					tbm_end_iterate(entry->matchIterator);
+					pfree(entry->matchResult);
+					entry->matchResult = NULL;
 					entry->matchIterator = NULL;
 					entry->isFinished = true;
 					break;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index af24d38544..be27f9fe07 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -246,7 +246,14 @@ ginFreeScanKeys(GinScanOpaque so)
 		if (entry->list)
 			pfree(entry->list);
 		if (entry->matchIterator)
+		{
 			tbm_end_iterate(entry->matchIterator);
+			if (entry->matchResult)
+			{
+				pfree(entry->matchResult);
+				entry->matchResult = NULL;
+			}
+		}
 		if (entry->matchBitmap)
 			tbm_free(entry->matchBitmap);
 	}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ab88eff296..e02716328f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -222,6 +222,53 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+bitmapheap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+							void *per_buffer_data)
+{
+	TBMIterateResult *tbmres;
+	HeapScanDesc hdesc = (HeapScanDesc) pgsr_private;
+
+	tbmres = (TBMIterateResult *) per_buffer_data;
+
+	for (;;)
+	{
+		if (hdesc->rs_base.shared_tbmiterator)
+			tbm_shared_iterate(hdesc->rs_base.shared_tbmiterator, tbmres);
+		else
+			tbm_iterate(hdesc->rs_base.tbmiterator, tbmres);
+
+		if (!BlockNumberIsValid(tbmres->blockno))
+			return InvalidBlockNumber;
+
+		/*
+		 * Ignore any claimed entries past what we think is the end of the
+		 * relation. It may have been extended after the start of our scan (we
+		 * only hold an AccessShareLock, and it could be inserts from this
+		 * backend).  We don't take this optimization in SERIALIZABLE
+		 * isolation though, as we need to examine all invisible tuples
+		 * reachable by the index.
+		 */
+		if (!IsolationIsSerializable() && tbmres->blockno >= hdesc->rs_nblocks)
+			continue;
+
+		if (tbmres->ntuples >= 0)
+			hdesc->rs_base.exact_pages++;
+		else
+			hdesc->rs_base.lossy_pages++;
+
+		if (hdesc->rs_base.rs_flags & SO_CAN_SKIP_FETCH &&
+			!tbmres->recheck &&
+			VM_ALL_VISIBLE(hdesc->rs_base.rs_rd, tbmres->blockno, &hdesc->vmbuffer))
+		{
+			hdesc->empty_tuples += tbmres->ntuples;
+			continue;
+		}
+
+		return tbmres->blockno;
+	}
+}
+
 static BlockNumber
 heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
 					  void *per_buffer_data)
@@ -429,6 +476,11 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		pg_streaming_read_free(scan->pgsr);
 		scan->pgsr = NULL;
 	}
+	if (BufferIsValid(scan->vmbuffer))
+	{
+		ReleaseBuffer(scan->vmbuffer);
+		scan->vmbuffer = InvalidBuffer;
+	}
 
 	/*
 	 * FIXME: This probably should be done in the !rs_inited blocks instead.
@@ -442,6 +494,19 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 		else
 			scan->pgsr = heap_pgsr_single_alloc(scan);
 	}
+	else if (scan->rs_base.rs_flags & SO_TYPE_BITMAPSCAN)
+	{
+		scan->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+													scan,
+													offsetof(TBMIterateResult, offsets) + MaxHeapTuplesPerPage * sizeof(OffsetNumber),
+													scan->rs_strategy,
+													BMR_REL(scan->rs_base.rs_rd),
+													MAIN_FORKNUM,
+													bitmapheap_pgsr_next_single);
+
+
+		scan->rs_inited = true;
+	}
 }
 
 /*
@@ -1114,6 +1179,12 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_strategy = NULL;	/* set in initscan */
 
 	scan->pgsr = NULL;
+	scan->vmbuffer = InvalidBuffer;
+	scan->empty_tuples = 0;
+	scan->rs_base.exact_pages = 0;
+	scan->rs_base.lossy_pages = 0;
+	scan->rs_base.tbmiterator = NULL;
+	scan->rs_base.shared_tbmiterator = NULL;
 
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 20e83f48e3..946ecf92bb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "storage/streaming_read.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -2108,38 +2109,55 @@ heapam_estimate_rel_size(Relation rel, int32 *attr_widths,
  * Executor related callbacks for the heap AM
  * ------------------------------------------------------------------------
  */
-
 static bool
-heapam_scan_bitmap_next_block(TableScanDesc scan,
-							  TBMIterateResult *tbmres)
+heapam_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
-	BlockNumber block = tbmres->blockno;
+	TBMIterateResult *tbmres;
+	void	   *io_private;
 	Buffer		buffer;
 	Snapshot	snapshot;
 	int			ntup;
 
-	hscan->rs_cindex = 0;
-	hscan->rs_ntuples = 0;
+	Assert(hscan->pgsr);
 
 	/*
-	 * Ignore any claimed entries past what we think is the end of the
-	 * relation. It may have been extended after the start of our scan (we
-	 * only hold an AccessShareLock, and it could be inserts from this
-	 * backend).  We don't take this optimization in SERIALIZABLE isolation
-	 * though, as we need to examine all invisible tuples reachable by the
-	 * index.
+	 * Release the buffer containing the previous block and reset the per-page
+	 * counters. Reset hscan->rs_ntuples here just to be safe.
 	 */
-	if (!IsolationIsSerializable() && block >= hscan->rs_nblocks)
-		return false;
+	if (BufferIsValid(hscan->rs_cbuf))
+	{
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
+	}
+
+	hscan->rs_cindex = 0;
+	hscan->rs_ntuples = 0;
+
+	hscan->rs_cbuf = pg_streaming_read_buffer_get_next(hscan->pgsr, &io_private);
+
+	if (BufferIsInvalid(hscan->rs_cbuf))
+	{
+		if (BufferIsValid(hscan->vmbuffer))
+		{
+			ReleaseBuffer(hscan->vmbuffer);
+			hscan->vmbuffer = InvalidBuffer;
+		}
+
+		return hscan->empty_tuples > 0;
+	}
+
+	Assert(io_private);
+
+	tbmres = (TBMIterateResult *) io_private;
+
+	Assert(BufferGetBlockNumber(hscan->rs_cbuf) == tbmres->blockno);
+
+	*recheck = tbmres->recheck;
+
+	hscan->rs_cblock = tbmres->blockno;
+	hscan->rs_ntuples = tbmres->ntuples;
 
-	/*
-	 * Acquire pin on the target heap page, trading in any pin we held before.
-	 */
-	hscan->rs_cbuf = ReleaseAndReadBuffer(hscan->rs_cbuf,
-										  scan->rs_rd,
-										  block);
-	hscan->rs_cblock = block;
 	buffer = hscan->rs_cbuf;
 	snapshot = scan->rs_snapshot;
 
@@ -2175,7 +2193,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			ItemPointerData tid;
 			HeapTupleData heapTuple;
 
-			ItemPointerSet(&tid, block, offnum);
+			ItemPointerSet(&tid, tbmres->blockno, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
 									   &heapTuple, NULL, true))
 				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
@@ -2203,7 +2221,7 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 			loctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
 			loctup.t_len = ItemIdGetLength(lp);
 			loctup.t_tableOid = scan->rs_rd->rd_id;
-			ItemPointerSet(&loctup.t_self, block, offnum);
+			ItemPointerSet(&loctup.t_self, tbmres->blockno, offnum);
 			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 			if (valid)
 			{
@@ -2221,25 +2239,31 @@ heapam_scan_bitmap_next_block(TableScanDesc scan,
 	Assert(ntup <= MaxHeapTuplesPerPage);
 	hscan->rs_ntuples = ntup;
 
-	return ntup > 0;
+	return true;
 }
 
 static bool
-heapam_scan_bitmap_next_tuple(TableScanDesc scan,
-							  TBMIterateResult *tbmres,
-							  TupleTableSlot *slot)
+heapam_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 	OffsetNumber targoffset;
 	Page		page;
 	ItemId		lp;
 
+	if (hscan->empty_tuples)
+	{
+		ExecStoreAllNullTuple(slot);
+		hscan->empty_tuples--;
+		return true;
+	}
+
 	/*
 	 * Out of range?  If so, nothing more to look at on this page
 	 */
 	if (hscan->rs_cindex < 0 || hscan->rs_cindex >= hscan->rs_ntuples)
 		return false;
 
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	targoffset = hscan->rs_vistuples[hscan->rs_cindex];
 	page = BufferGetPage(hscan->rs_cbuf);
 	lp = PageGetItemId(page, targoffset);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a9d5056af4..8edf728ae2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/relscan.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "commands/createas.h"
@@ -111,7 +112,7 @@ static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 							  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
-static void show_tidbitmap_info(BitmapHeapScanState *planstate,
+static void show_tidbitmap_info(TableScanDesc tdesc,
 								ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
 									   PlanState *planstate, ExplainState *es);
@@ -1885,7 +1886,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			if (es->analyze)
-				show_tidbitmap_info((BitmapHeapScanState *) planstate, es);
+				show_tidbitmap_info(((BitmapHeapScanState *) planstate)->ss.ss_currentScanDesc, es);
 			break;
 		case T_SampleScan:
 			show_tablesample(((SampleScan *) plan)->tablesample,
@@ -3474,25 +3475,25 @@ show_hashagg_info(AggState *aggstate, ExplainState *es)
  * If it's EXPLAIN ANALYZE, show exact/lossy pages for a BitmapHeapScan node
  */
 static void
-show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
+show_tidbitmap_info(TableScanDesc tdesc, ExplainState *es)
 {
 	if (es->format != EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainPropertyInteger("Exact Heap Blocks", NULL,
-							   planstate->exact_pages, es);
+							   tdesc->exact_pages, es);
 		ExplainPropertyInteger("Lossy Heap Blocks", NULL,
-							   planstate->lossy_pages, es);
+							   tdesc->lossy_pages, es);
 	}
 	else
 	{
-		if (planstate->exact_pages > 0 || planstate->lossy_pages > 0)
+		if (tdesc->exact_pages > 0 || tdesc->lossy_pages > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfoString(es->str, "Heap Blocks:");
-			if (planstate->exact_pages > 0)
-				appendStringInfo(es->str, " exact=%ld", planstate->exact_pages);
-			if (planstate->lossy_pages > 0)
-				appendStringInfo(es->str, " lossy=%ld", planstate->lossy_pages);
+			if (tdesc->exact_pages > 0)
+				appendStringInfo(es->str, " exact=%ld", tdesc->exact_pages);
+			if (tdesc->lossy_pages > 0)
+				appendStringInfo(es->str, " lossy=%ld", tdesc->lossy_pages);
 			appendStringInfoChar(es->str, '\n');
 		}
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 345b67649e..9482f4ffbd 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -51,11 +51,6 @@
 
 static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
 static inline void BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate);
-static inline void BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-												TBMIterateResult *tbmres);
-static inline void BitmapAdjustPrefetchTarget(BitmapHeapScanState *node);
-static inline void BitmapPrefetch(BitmapHeapScanState *node,
-								  TableScanDesc scan);
 static bool BitmapShouldInitializeSharedState(ParallelBitmapHeapState *pstate);
 
 
@@ -71,10 +66,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	ExprContext *econtext;
 	TableScanDesc scan;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator = NULL;
-	TBMSharedIterator *shared_tbmiterator = NULL;
-	TBMIterateResult *tbmres;
-	TupleTableSlot *slot;
+	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 	ParallelBitmapHeapState *pstate = node->pstate;
 	dsa_area   *dsa = node->ss.ps.state->es_query_dsa;
 
@@ -85,23 +77,10 @@ BitmapHeapNext(BitmapHeapScanState *node)
 	slot = node->ss.ss_ScanTupleSlot;
 	scan = node->ss.ss_currentScanDesc;
 	tbm = node->tbm;
-	if (pstate == NULL)
-		tbmiterator = node->tbmiterator;
-	else
-		shared_tbmiterator = node->shared_tbmiterator;
-	tbmres = node->tbmres;
 
 	/*
 	 * If we haven't yet performed the underlying index scan, do it, and begin
 	 * the iteration over the bitmap.
-	 *
-	 * For prefetching, we use *two* iterators, one for the pages we are
-	 * actually scanning and another that runs ahead of the first for
-	 * prefetching.  node->prefetch_pages tracks exactly how many pages ahead
-	 * the prefetch iterator is.  Also, node->prefetch_target tracks the
-	 * desired prefetch distance, which starts small and increases up to the
-	 * node->prefetch_maximum.  This is to avoid doing a lot of prefetching in
-	 * a scan that stops after a few tuples because of a LIMIT.
 	 */
 	if (!node->initialized)
 	{
@@ -113,17 +92,7 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				elog(ERROR, "unrecognized result from subplan");
 
 			node->tbm = tbm;
-			node->tbmiterator = tbmiterator = tbm_begin_iterate(tbm);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->prefetch_iterator = tbm_begin_iterate(tbm);
-				node->prefetch_pages = 0;
-				node->prefetch_target = -1;
-			}
-#endif							/* USE_PREFETCH */
+			scan->tbmiterator = tbm_begin_iterate(tbm);
 		}
 		else
 		{
@@ -145,194 +114,59 @@ BitmapHeapNext(BitmapHeapScanState *node)
 				 * dsa_pointer of the iterator state which will be used by
 				 * multiple processes to iterate jointly.
 				 */
-				pstate->tbmiterator = tbm_prepare_shared_iterate(tbm);
-#ifdef USE_PREFETCH
-				if (node->prefetch_maximum > 0)
-				{
-					pstate->prefetch_iterator =
-						tbm_prepare_shared_iterate(tbm);
-
-					/*
-					 * We don't need the mutex here as we haven't yet woke up
-					 * others.
-					 */
-					pstate->prefetch_pages = 0;
-					pstate->prefetch_target = -1;
-				}
-#endif
+				pstate->tbmiterator =
+					tbm_prepare_shared_iterate(tbm);
 
 				/* We have initialized the shared state so wake up others. */
 				BitmapDoneInitializingSharedState(pstate);
 			}
 
 			/* Allocate a private iterator and attach the shared state to it */
-			node->shared_tbmiterator = shared_tbmiterator =
+			scan->shared_tbmiterator =
 				tbm_attach_shared_iterate(dsa, pstate->tbmiterator);
-			node->tbmres = tbmres = NULL;
-
-#ifdef USE_PREFETCH
-			if (node->prefetch_maximum > 0)
-			{
-				node->shared_prefetch_iterator =
-					tbm_attach_shared_iterate(dsa, pstate->prefetch_iterator);
-			}
-#endif							/* USE_PREFETCH */
 		}
 		node->initialized = true;
+
+		/* Get the first block. if none, end of scan */
+		if (!table_scan_bitmap_next_block(scan, &node->recheck))
+			goto exit;
 	}
 
-	for (;;)
+	do
 	{
-		bool		skip_fetch;
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * Get next page of results if needed
-		 */
-		if (tbmres == NULL)
-		{
-			if (!pstate)
-				node->tbmres = tbmres = tbm_iterate(tbmiterator);
-			else
-				node->tbmres = tbmres = tbm_shared_iterate(shared_tbmiterator);
-			if (tbmres == NULL)
-			{
-				/* no more entries in the bitmap */
-				break;
-			}
-
-			BitmapAdjustPrefetchIterator(node, tbmres);
-
-			/*
-			 * We can skip fetching the heap page if we don't need any fields
-			 * from the heap, and the bitmap entries don't need rechecking,
-			 * and all tuples on the page are visible to our transaction.
-			 *
-			 * XXX: It's a layering violation that we do these checks above
-			 * tableam, they should probably moved below it at some point.
-			 */
-			skip_fetch = (node->can_skip_fetch &&
-						  !tbmres->recheck &&
-						  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-										 tbmres->blockno,
-										 &node->vmbuffer));
-
-			if (skip_fetch)
-			{
-				/* can't be lossy in the skip_fetch case */
-				Assert(tbmres->ntuples >= 0);
-
-				/*
-				 * The number of tuples on this page is put into
-				 * node->return_empty_tuples.
-				 */
-				node->return_empty_tuples = tbmres->ntuples;
-			}
-			else if (!table_scan_bitmap_next_block(scan, tbmres))
-			{
-				/* AM doesn't think this block is valid, skip */
-				continue;
-			}
-
-			if (tbmres->ntuples >= 0)
-				node->exact_pages++;
-			else
-				node->lossy_pages++;
-
-			/* Adjust the prefetch target */
-			BitmapAdjustPrefetchTarget(node);
-		}
-		else
+		while (table_scan_bitmap_next_tuple(scan, slot))
 		{
-			/*
-			 * Continuing in previously obtained page.
-			 */
-
-#ifdef USE_PREFETCH
-
-			/*
-			 * Try to prefetch at least a few pages even before we get to the
-			 * second page if we don't stop reading after the first tuple.
-			 */
-			if (!pstate)
-			{
-				if (node->prefetch_target < node->prefetch_maximum)
-					node->prefetch_target++;
-			}
-			else if (pstate->prefetch_target < node->prefetch_maximum)
-			{
-				/* take spinlock while updating shared state */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_target < node->prefetch_maximum)
-					pstate->prefetch_target++;
-				SpinLockRelease(&pstate->mutex);
-			}
-#endif							/* USE_PREFETCH */
-		}
+			CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * We issue prefetch requests *after* fetching the current page to try
-		 * to avoid having prefetching interfere with the main I/O. Also, this
-		 * should happen only when we have determined there is still something
-		 * to do on the current page, else we may uselessly prefetch the same
-		 * page we are just about to request for real.
-		 *
-		 * XXX: It's a layering violation that we do these checks above
-		 * tableam, they should probably moved below it at some point.
-		 */
-		BitmapPrefetch(node, scan);
-
-		if (node->return_empty_tuples > 0)
-		{
-			/*
-			 * If we don't have to fetch the tuple, just return nulls.
-			 */
-			ExecStoreAllNullTuple(slot);
+			if (!node->recheck)
+				return slot;
 
-			if (--node->return_empty_tuples == 0)
-			{
-				/* no more tuples to return in the next round */
-				node->tbmres = tbmres = NULL;
-			}
-		}
-		else
-		{
-			/*
-			 * Attempt to fetch tuple from AM.
-			 */
-			if (!table_scan_bitmap_next_tuple(scan, tbmres, slot))
-			{
-				/* nothing more to look at on this page */
-				node->tbmres = tbmres = NULL;
-				continue;
-			}
+			econtext->ecxt_scantuple = slot;
+			if (ExecQualAndReset(node->bitmapqualorig, econtext))
+				return slot;
 
-			/*
-			 * If we are using lossy info, we have to recheck the qual
-			 * conditions at every tuple.
-			 */
-			if (tbmres->recheck)
-			{
-				econtext->ecxt_scantuple = slot;
-				if (!ExecQualAndReset(node->bitmapqualorig, econtext))
-				{
-					/* Fails recheck, so drop it and loop back for another */
-					InstrCountFiltered2(node, 1);
-					ExecClearTuple(slot);
-					continue;
-				}
-			}
+			/* Fails recheck, so drop it and loop back for another */
+			InstrCountFiltered2(node, 1);
+			ExecClearTuple(slot);
 		}
+	} while (table_scan_bitmap_next_block(scan, &node->recheck));
 
-		/* OK to return this tuple */
-		return slot;
-	}
+exit:
 
 	/*
-	 * if we get here it means we are at the end of the scan..
+	 * Release iterator
 	 */
-	return ExecClearTuple(slot);
+	if (scan->tbmiterator)
+	{
+		tbm_end_iterate(scan->tbmiterator);
+		scan->tbmiterator = NULL;
+	}
+	else if (scan->shared_tbmiterator)
+	{
+		tbm_end_shared_iterate(scan->shared_tbmiterator);
+		scan->shared_tbmiterator = NULL;
+	}
+	return NULL;
 }
 
 /*
@@ -350,215 +184,6 @@ BitmapDoneInitializingSharedState(ParallelBitmapHeapState *pstate)
 	ConditionVariableBroadcast(&pstate->cv);
 }
 
-/*
- *	BitmapAdjustPrefetchIterator - Adjust the prefetch iterator
- */
-static inline void
-BitmapAdjustPrefetchIterator(BitmapHeapScanState *node,
-							 TBMIterateResult *tbmres)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (node->prefetch_pages > 0)
-		{
-			/* The main iterator has closed the distance by one page */
-			node->prefetch_pages--;
-		}
-		else if (prefetch_iterator)
-		{
-			/* Do not let the prefetch iterator get behind the main one */
-			TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-
-			if (tbmpre == NULL || tbmpre->blockno != tbmres->blockno)
-				elog(ERROR, "prefetch and main iterators are out of sync");
-		}
-		return;
-	}
-
-	if (node->prefetch_maximum > 0)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_pages > 0)
-		{
-			pstate->prefetch_pages--;
-			SpinLockRelease(&pstate->mutex);
-		}
-		else
-		{
-			/* Release the mutex before iterating */
-			SpinLockRelease(&pstate->mutex);
-
-			/*
-			 * In case of shared mode, we can not ensure that the current
-			 * blockno of the main iterator and that of the prefetch iterator
-			 * are same.  It's possible that whatever blockno we are
-			 * prefetching will be processed by another process.  Therefore,
-			 * we don't validate the blockno here as we do in non-parallel
-			 * case.
-			 */
-			if (prefetch_iterator)
-				tbm_shared_iterate(prefetch_iterator);
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapAdjustPrefetchTarget - Adjust the prefetch target
- *
- * Increase prefetch target if it's not yet at the max.  Note that
- * we will increase it to zero after fetching the very first
- * page/tuple, then to one after the second tuple is fetched, then
- * it doubles as later pages are fetched.
- */
-static inline void
-BitmapAdjustPrefetchTarget(BitmapHeapScanState *node)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		if (node->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (node->prefetch_target >= node->prefetch_maximum / 2)
-			node->prefetch_target = node->prefetch_maximum;
-		else if (node->prefetch_target > 0)
-			node->prefetch_target *= 2;
-		else
-			node->prefetch_target++;
-		return;
-	}
-
-	/* Do an unlocked check first to save spinlock acquisitions. */
-	if (pstate->prefetch_target < node->prefetch_maximum)
-	{
-		SpinLockAcquire(&pstate->mutex);
-		if (pstate->prefetch_target >= node->prefetch_maximum)
-			 /* don't increase any further */ ;
-		else if (pstate->prefetch_target >= node->prefetch_maximum / 2)
-			pstate->prefetch_target = node->prefetch_maximum;
-		else if (pstate->prefetch_target > 0)
-			pstate->prefetch_target *= 2;
-		else
-			pstate->prefetch_target++;
-		SpinLockRelease(&pstate->mutex);
-	}
-#endif							/* USE_PREFETCH */
-}
-
-/*
- * BitmapPrefetch - Prefetch, if prefetch_pages are behind prefetch_target
- */
-static inline void
-BitmapPrefetch(BitmapHeapScanState *node, TableScanDesc scan)
-{
-#ifdef USE_PREFETCH
-	ParallelBitmapHeapState *pstate = node->pstate;
-
-	if (pstate == NULL)
-	{
-		TBMIterator *prefetch_iterator = node->prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (node->prefetch_pages < node->prefetch_target)
-			{
-				TBMIterateResult *tbmpre = tbm_iterate(prefetch_iterator);
-				bool		skip_fetch;
-
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_iterate(prefetch_iterator);
-					node->prefetch_iterator = NULL;
-					break;
-				}
-				node->prefetch_pages++;
-
-				/*
-				 * If we expect not to have to actually read this heap page,
-				 * skip this prefetch call, but continue to run the prefetch
-				 * logic normally.  (Would it be better not to increment
-				 * prefetch_pages?)
-				 *
-				 * This depends on the assumption that the index AM will
-				 * report the same recheck flag for this future heap page as
-				 * it did for the current heap page; which is not a certainty
-				 * but is true in many cases.
-				 */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-
-		return;
-	}
-
-	if (pstate->prefetch_pages < pstate->prefetch_target)
-	{
-		TBMSharedIterator *prefetch_iterator = node->shared_prefetch_iterator;
-
-		if (prefetch_iterator)
-		{
-			while (1)
-			{
-				TBMIterateResult *tbmpre;
-				bool		do_prefetch = false;
-				bool		skip_fetch;
-
-				/*
-				 * Recheck under the mutex. If some other process has already
-				 * done enough prefetching then we need not to do anything.
-				 */
-				SpinLockAcquire(&pstate->mutex);
-				if (pstate->prefetch_pages < pstate->prefetch_target)
-				{
-					pstate->prefetch_pages++;
-					do_prefetch = true;
-				}
-				SpinLockRelease(&pstate->mutex);
-
-				if (!do_prefetch)
-					return;
-
-				tbmpre = tbm_shared_iterate(prefetch_iterator);
-				if (tbmpre == NULL)
-				{
-					/* No more pages to prefetch */
-					tbm_end_shared_iterate(prefetch_iterator);
-					node->shared_prefetch_iterator = NULL;
-					break;
-				}
-
-				/* As above, skip prefetch if we expect not to need page */
-				skip_fetch = (node->can_skip_fetch &&
-							  (node->tbmres ? !node->tbmres->recheck : false) &&
-							  VM_ALL_VISIBLE(node->ss.ss_currentRelation,
-											 tbmpre->blockno,
-											 &node->pvmbuffer));
-
-				if (!skip_fetch)
-					PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, tbmpre->blockno);
-			}
-		}
-	}
-#endif							/* USE_PREFETCH */
-}
-
 /*
  * BitmapHeapRecheck -- access method routine to recheck a tuple in EvalPlanQual
  */
@@ -604,29 +229,10 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 	table_rescan(node->ss.ss_currentScanDesc, NULL);
 
 	/* release bitmaps and buffers if any */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 	node->tbm = NULL;
-	node->tbmiterator = NULL;
-	node->tbmres = NULL;
-	node->prefetch_iterator = NULL;
 	node->initialized = false;
-	node->shared_tbmiterator = NULL;
-	node->shared_prefetch_iterator = NULL;
-	node->vmbuffer = InvalidBuffer;
-	node->pvmbuffer = InvalidBuffer;
 
 	ExecScanReScan(&node->ss);
 
@@ -660,20 +266,8 @@ ExecEndBitmapHeapScan(BitmapHeapScanState *node)
 	/*
 	 * release bitmaps and buffers if any
 	 */
-	if (node->tbmiterator)
-		tbm_end_iterate(node->tbmiterator);
-	if (node->prefetch_iterator)
-		tbm_end_iterate(node->prefetch_iterator);
 	if (node->tbm)
 		tbm_free(node->tbm);
-	if (node->shared_tbmiterator)
-		tbm_end_shared_iterate(node->shared_tbmiterator);
-	if (node->shared_prefetch_iterator)
-		tbm_end_shared_iterate(node->shared_prefetch_iterator);
-	if (node->vmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->vmbuffer);
-	if (node->pvmbuffer != InvalidBuffer)
-		ReleaseBuffer(node->pvmbuffer);
 
 	/*
 	 * close heap scan
@@ -692,6 +286,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		can_skip_fetch;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -711,32 +306,10 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.ExecProcNode = ExecBitmapHeapScan;
 
 	scanstate->tbm = NULL;
-	scanstate->tbmiterator = NULL;
-	scanstate->tbmres = NULL;
-	scanstate->return_empty_tuples = 0;
-	scanstate->vmbuffer = InvalidBuffer;
-	scanstate->pvmbuffer = InvalidBuffer;
-	scanstate->exact_pages = 0;
-	scanstate->lossy_pages = 0;
-	scanstate->prefetch_iterator = NULL;
-	scanstate->prefetch_pages = 0;
-	scanstate->prefetch_target = 0;
 	scanstate->pscan_len = 0;
 	scanstate->initialized = false;
-	scanstate->shared_tbmiterator = NULL;
-	scanstate->shared_prefetch_iterator = NULL;
 	scanstate->pstate = NULL;
 
-	/*
-	 * We can potentially skip fetching heap pages if we do not need any
-	 * columns of the table, either for checking non-indexable quals or for
-	 * returning data.  This test is a bit simplistic, as it checks the
-	 * stronger condition that there's no qual or return tlist at all.  But in
-	 * most cases it's probably not worth working harder than that.
-	 */
-	scanstate->can_skip_fetch = (node->scan.plan.qual == NIL &&
-								 node->scan.plan.targetlist == NIL);
-
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -775,19 +348,22 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->bitmapqualorig =
 		ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
 
+	scanstate->ss.ss_currentRelation = currentRelation;
+
 	/*
-	 * Maximum number of prefetches for the tablespace if configured,
-	 * otherwise the current value of the effective_io_concurrency GUC.
+	 * We can potentially skip fetching heap pages if we do not need any
+	 * columns of the table, either for checking non-indexable quals or for
+	 * returning data.  This test is a bit simplistic, as it checks the
+	 * stronger condition that there's no qual or return tlist at all.  But in
+	 * most cases it's probably not worth working harder than that.
 	 */
-	scanstate->prefetch_maximum =
-		get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
-
-	scanstate->ss.ss_currentRelation = currentRelation;
+	can_skip_fetch = (node->scan.plan.qual == NIL &&
+					  node->scan.plan.targetlist == NIL);
 
 	scanstate->ss.ss_currentScanDesc = table_beginscan_bm(currentRelation,
 														  estate->es_snapshot,
 														  0,
-														  NULL);
+														  NULL, can_skip_fetch);
 
 	/*
 	 * all done.
@@ -872,12 +448,9 @@ ExecBitmapHeapInitializeDSM(BitmapHeapScanState *node,
 	pstate = shm_toc_allocate(pcxt->toc, node->pscan_len);
 
 	pstate->tbmiterator = 0;
-	pstate->prefetch_iterator = 0;
 
 	/* Initialize the mutex */
 	SpinLockInit(&pstate->mutex);
-	pstate->prefetch_pages = 0;
-	pstate->prefetch_target = 0;
 	pstate->state = BM_INITIAL;
 
 	ConditionVariableInit(&pstate->cv);
@@ -909,11 +482,7 @@ ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 	if (DsaPointerIsValid(pstate->tbmiterator))
 		tbm_free_shared_area(dsa, pstate->tbmiterator);
 
-	if (DsaPointerIsValid(pstate->prefetch_iterator))
-		tbm_free_shared_area(dsa, pstate->prefetch_iterator);
-
 	pstate->tbmiterator = InvalidDsaPointer;
-	pstate->prefetch_iterator = InvalidDsaPointer;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index e8ab5d78fc..3f84abeb90 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -181,7 +181,6 @@ struct TBMIterator
 	int			spageptr;		/* next spages index */
 	int			schunkptr;		/* next schunks index */
 	int			schunkbit;		/* next bit to check in current schunk */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /*
@@ -222,7 +221,6 @@ struct TBMSharedIterator
 	PTEntryArray *ptbase;		/* pagetable element array */
 	PTIterationArray *ptpages;	/* sorted exact page index list */
 	PTIterationArray *ptchunks; /* sorted lossy page index list */
-	TBMIterateResult output;	/* MUST BE LAST (because variable-size) */
 };
 
 /* Local function prototypes */
@@ -696,8 +694,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
 	 * Create the TBMIterator struct, with enough trailing space to serve the
 	 * needs of the TBMIterateResult sub-struct.
 	 */
-	iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
-									  MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
+	iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
 	iterator->tbm = tbm;
 
 	/*
@@ -958,20 +955,21 @@ tbm_advance_schunkbit(PagetableEntry *chunk, int *schunkbitp)
 /*
  * tbm_iterate - scan through next page of a TIDBitmap
  *
- * Returns a TBMIterateResult representing one page, or NULL if there are
- * no more pages to scan.  Pages are guaranteed to be delivered in numerical
- * order.  If result->ntuples < 0, then the bitmap is "lossy" and failed to
- * remember the exact tuples to look at on this page --- the caller must
- * examine all tuples on the page and check if they meet the intended
- * condition.  If result->recheck is true, only the indicated tuples need
- * be examined, but the condition must be rechecked anyway.  (For ease of
- * testing, recheck is always set true when ntuples < 0.)
+ * Caller must pass in a TBMIterateResult to be filled.
+ *
+ * Pages are guaranteed to be delivered in numerical order.  tbmres->blockno is
+ * set to InvalidBlockNumber when there are no more pages to scan. If
+ * tbmres->ntuples < 0, then the bitmap is "lossy" and failed to remember the
+ * exact tuples to look at on this page --- the caller must examine all tuples
+ * on the page and check if they meet the intended condition.  If
+ * tbmres->recheck is true, only the indicated tuples need be examined, but the
+ * condition must be rechecked anyway.  (For ease of testing, recheck is always
+ * set true when ntuples < 0.)
  */
-TBMIterateResult *
-tbm_iterate(TBMIterator *iterator)
+void
+tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres)
 {
 	TIDBitmap  *tbm = iterator->tbm;
-	TBMIterateResult *output = &(iterator->output);
 
 	Assert(tbm->iterating == TBM_ITERATING_PRIVATE);
 
@@ -999,6 +997,7 @@ tbm_iterate(TBMIterator *iterator)
 	 * If both chunk and per-page data remain, must output the numerically
 	 * earlier page.
 	 */
+	Assert(tbmres);
 	if (iterator->schunkptr < tbm->nchunks)
 	{
 		PagetableEntry *chunk = tbm->schunks[iterator->schunkptr];
@@ -1009,11 +1008,11 @@ tbm_iterate(TBMIterator *iterator)
 			chunk_blockno < tbm->spages[iterator->spageptr]->blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			iterator->schunkbit++;
-			return output;
+			return;
 		}
 	}
 
@@ -1029,16 +1028,17 @@ tbm_iterate(TBMIterator *iterator)
 			page = tbm->spages[iterator->spageptr];
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		iterator->spageptr++;
-		return output;
+		return;
 	}
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
@@ -1048,10 +1048,9 @@ tbm_iterate(TBMIterator *iterator)
  *	across multiple processes.  We need to acquire the iterator LWLock,
  *	before accessing the shared members.
  */
-TBMIterateResult *
-tbm_shared_iterate(TBMSharedIterator *iterator)
+void
+tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres)
 {
-	TBMIterateResult *output = &iterator->output;
 	TBMSharedIteratorState *istate = iterator->state;
 	PagetableEntry *ptbase = NULL;
 	int		   *idxpages = NULL;
@@ -1102,13 +1101,13 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 			chunk_blockno < ptbase[idxpages[istate->spageptr]].blockno)
 		{
 			/* Return a lossy page indicator from the chunk */
-			output->blockno = chunk_blockno;
-			output->ntuples = -1;
-			output->recheck = true;
+			tbmres->blockno = chunk_blockno;
+			tbmres->ntuples = -1;
+			tbmres->recheck = true;
 			istate->schunkbit++;
 
 			LWLockRelease(&istate->lock);
-			return output;
+			return;
 		}
 	}
 
@@ -1118,21 +1117,22 @@ tbm_shared_iterate(TBMSharedIterator *iterator)
 		int			ntuples;
 
 		/* scan bitmap to extract individual offset numbers */
-		ntuples = tbm_extract_page_tuple(page, output);
-		output->blockno = page->blockno;
-		output->ntuples = ntuples;
-		output->recheck = page->recheck;
+		ntuples = tbm_extract_page_tuple(page, tbmres);
+		tbmres->blockno = page->blockno;
+		tbmres->ntuples = ntuples;
+		tbmres->recheck = page->recheck;
 		istate->spageptr++;
 
 		LWLockRelease(&istate->lock);
 
-		return output;
+		return;
 	}
 
 	LWLockRelease(&istate->lock);
 
 	/* Nothing more in the bitmap */
-	return NULL;
+	tbmres->blockno = InvalidBlockNumber;
+	return;
 }
 
 /*
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 21e71d6091..e28fb71aaf 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -20,6 +20,7 @@
 #include "access/skey.h"
 #include "access/table.h"		/* for backward compatibility */
 #include "access/tableam.h"
+#include "nodes/execnodes.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
 #include "storage/bufpage.h"
@@ -76,9 +77,15 @@ typedef struct HeapScanDescData
 	struct PgStreamingRead *pgsr;
 
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
+
+	/*
+	 * MFIXME: not sure if vmbuffer is being released in the right place.
+	 */
+	Buffer		vmbuffer;		/* current VM buffer for BHS */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
+	int			empty_tuples;
 }			HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 521043304a..cdb44316a3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -16,6 +16,7 @@
 
 #include "access/htup_details.h"
 #include "access/itup.h"
+#include "nodes/tidbitmap.h"
 #include "port/atomics.h"
 #include "storage/buf.h"
 #include "storage/spin.h"
@@ -40,6 +41,13 @@ typedef struct TableScanDescData
 	ItemPointerData rs_mintid;
 	ItemPointerData rs_maxtid;
 
+	/* Only used for Bitmap table scans */
+	TBMIterator *tbmiterator;
+	TBMSharedIterator *shared_tbmiterator;
+	long		exact_pages;
+	long		lossy_pages;
+	int			empty_tuples;
+
 	/*
 	 * Information about type and behaviour of the scan, a bitmask of members
 	 * of the ScanOptions enum (see tableam.h).
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d..b5626805a6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -32,6 +32,7 @@ extern PGDLLIMPORT char *default_table_access_method;
 extern PGDLLIMPORT bool synchronize_seqscans;
 
 
+struct BitmapHeapScanState;
 struct BulkInsertStateData;
 struct IndexInfo;
 struct SampleScanState;
@@ -62,7 +63,8 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
-}			ScanOptions;
+	SO_CAN_SKIP_FETCH = 1 << 10
+} ScanOptions;
 
 /*
  * Result codes for table_{update,delete,lock_tuple}, and for visibility
@@ -773,53 +775,36 @@ typedef struct TableAmRoutine
 	 */
 
 	/*
-	 * Prepare to fetch / check / return tuples from `tbmres->blockno` as part
-	 * of a bitmap table scan. `scan` was started via table_beginscan_bm().
-	 * Return false if there are no tuples to be found on the page, true
-	 * otherwise.
+	 * Prepare to fetch / check / return tuples obtained from the streaming
+	 * read helper as part of a bitmap table scan. `scan` was started via
+	 * table_beginscan_bm(). Return false if the relation is exhausted and
+	 * true otherwise.
 	 *
 	 * This will typically read and pin the target block, and do the necessary
 	 * work to allow scan_bitmap_next_tuple() to return tuples (e.g. it might
 	 * make sense to perform tuple visibility checks at this time). For some
-	 * AMs it will make more sense to do all the work referencing `tbmres`
-	 * contents here, for others it might be better to defer more work to
-	 * scan_bitmap_next_tuple.
+	 * AMs, it might be better to defer more work to scan_bitmap_next_tuple().
 	 *
-	 * If `tbmres->blockno` is -1, this is a lossy scan and all visible tuples
-	 * on the page have to be returned, otherwise the tuples at offsets in
-	 * `tbmres->offsets` need to be returned.
+	 * After examining the page from the TBMIterateResult returned by the
+	 * streaming read helper, most AMs will set relevant fields in the `scan`
+	 * for the benefit of scan_bitmap_next_tuple(). All AMs must set
+	 * `recheck`, passed in by the caller, based on the value in the
+	 * TBMIterateResult so that BitmapHeapNext() can determine whether or not
+	 * to yield tuples.
 	 *
-	 * XXX: Currently this may only be implemented if the AM uses md.c as its
-	 * storage manager, and uses ItemPointer->ip_blkid in a manner that maps
-	 * blockids directly to the underlying storage. nodeBitmapHeapscan.c
-	 * performs prefetching directly using that interface.  This probably
-	 * needs to be rectified at a later point.
-	 *
-	 * XXX: Currently this may only be implemented if the AM uses the
-	 * visibilitymap, as nodeBitmapHeapscan.c unconditionally accesses it to
-	 * perform prefetching.  This probably needs to be rectified at a later
-	 * point.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_block) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres);
+	bool		(*scan_bitmap_next_block) (TableScanDesc scan, bool *recheck);
 
 	/*
 	 * Fetch the next tuple of a bitmap table scan into `slot` and return true
 	 * if a visible tuple was found, false otherwise.
 	 *
-	 * For some AMs it will make more sense to do all the work referencing
-	 * `tbmres` contents in scan_bitmap_next_block, for others it might be
-	 * better to defer more work to this callback.
-	 *
-	 * Optional callback, but either both scan_bitmap_next_block and
-	 * scan_bitmap_next_tuple need to exist, or neither.
+	 * Optional callback, but scan_bitmap_setup, scan_bitmap_next_block, and
+	 * scan_bitmap_next_tuple need to exist, or none of them.
 	 */
-	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan,
-										   struct TBMIterateResult *tbmres,
-										   TupleTableSlot *slot);
+	bool		(*scan_bitmap_next_tuple) (TableScanDesc scan, TupleTableSlot *slot);
 
 	/*
 	 * Prepare to fetch tuples from the next block in a sample scan. Return
@@ -944,10 +929,13 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
  */
 static inline TableScanDesc
 table_beginscan_bm(Relation rel, Snapshot snapshot,
-				   int nkeys, struct ScanKeyData *key)
+				   int nkeys, struct ScanKeyData *key, bool can_skip_fetch)
 {
 	uint32		flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE;
 
+	if (can_skip_fetch)
+		flags |= SO_CAN_SKIP_FETCH;
+
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
 
@@ -1955,8 +1943,7 @@ table_relation_estimate_size(Relation rel, int32 *attr_widths,
  * used after verifying the presence (at plan time or such).
  */
 static inline bool
-table_scan_bitmap_next_block(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres)
+table_scan_bitmap_next_block(TableScanDesc scan, bool *recheck)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_block with valid
@@ -1966,8 +1953,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
 	if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
 		elog(ERROR, "unexpected table_scan_bitmap_next_block call during logical decoding");
 
-	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan,
-														   tbmres);
+	return scan->rs_rd->rd_tableam->scan_bitmap_next_block(scan, recheck);
 }
 
 /*
@@ -1979,9 +1965,7 @@ table_scan_bitmap_next_block(TableScanDesc scan,
  * returned false.
  */
 static inline bool
-table_scan_bitmap_next_tuple(TableScanDesc scan,
-							 struct TBMIterateResult *tbmres,
-							 TupleTableSlot *slot)
+table_scan_bitmap_next_tuple(TableScanDesc scan, TupleTableSlot *slot)
 {
 	/*
 	 * We don't expect direct calls to table_scan_bitmap_next_tuple with valid
@@ -1992,7 +1976,6 @@ table_scan_bitmap_next_tuple(TableScanDesc scan,
 		elog(ERROR, "unexpected table_scan_bitmap_next_tuple call during logical decoding");
 
 	return scan->rs_rd->rd_tableam->scan_bitmap_next_tuple(scan,
-														   tbmres,
 														   slot);
 }
 
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index ea003a9caa..39ae7b29ef 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -28,5 +28,6 @@ extern void ExecBitmapHeapReInitializeDSM(BitmapHeapScanState *node,
 										  ParallelContext *pcxt);
 extern void ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
 										   ParallelWorkerContext *pwcxt);
+extern void table_bitmap_scan_setup(BitmapHeapScanState *scanstate);
 
 #endif							/* NODEBITMAPHEAPSCAN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 444a5f0fd5..eab4ae0f54 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1682,11 +1682,7 @@ typedef enum
 /* ----------------
  *	 ParallelBitmapHeapState information
  *		tbmiterator				iterator for scanning current pages
- *		prefetch_iterator		iterator for prefetching ahead of current page
- *		mutex					mutual exclusion for the prefetching variable
- *								and state
- *		prefetch_pages			# pages prefetch iterator is ahead of current
- *		prefetch_target			current target prefetch distance
+ *		mutex					mutual exclusion for the state
  *		state					current state of the TIDBitmap
  *		cv						conditional wait variable
  *		phs_snapshot_data		snapshot data shared to workers
@@ -1695,10 +1691,7 @@ typedef enum
 typedef struct ParallelBitmapHeapState
 {
 	dsa_pointer tbmiterator;
-	dsa_pointer prefetch_iterator;
 	slock_t		mutex;
-	int			prefetch_pages;
-	int			prefetch_target;
 	SharedBitmapState state;
 	ConditionVariable cv;
 	char		phs_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
@@ -1709,22 +1702,9 @@ typedef struct ParallelBitmapHeapState
  *
  *		bitmapqualorig	   execution state for bitmapqualorig expressions
  *		tbm				   bitmap obtained from child index scan(s)
- *		tbmiterator		   iterator for scanning current pages
- *		tbmres			   current-page data
- *		can_skip_fetch	   can we potentially skip tuple fetches in this scan?
- *		return_empty_tuples number of empty tuples to return
- *		vmbuffer		   buffer for visibility-map lookups
- *		pvmbuffer		   ditto, for prefetched pages
- *		exact_pages		   total number of exact pages retrieved
- *		lossy_pages		   total number of lossy pages retrieved
- *		prefetch_iterator  iterator for prefetching ahead of current page
- *		prefetch_pages	   # pages prefetch iterator is ahead of current
- *		prefetch_target    current target prefetch distance
- *		prefetch_maximum   maximum value for prefetch_target
+ *		recheck			   whether or not the tuple needs to be rechecked before returning
  *		pscan_len		   size of the shared memory for parallel bitmap
  *		initialized		   is node is ready to iterate
- *		shared_tbmiterator	   shared iterator
- *		shared_prefetch_iterator shared iterator for prefetching
  *		pstate			   shared state for parallel bitmap scan
  * ----------------
  */
@@ -1733,22 +1713,9 @@ typedef struct BitmapHeapScanState
 	ScanState	ss;				/* its first field is NodeTag */
 	ExprState  *bitmapqualorig;
 	TIDBitmap  *tbm;
-	TBMIterator *tbmiterator;
-	TBMIterateResult *tbmres;
-	bool		can_skip_fetch;
-	int			return_empty_tuples;
-	Buffer		vmbuffer;
-	Buffer		pvmbuffer;
-	long		exact_pages;
-	long		lossy_pages;
-	TBMIterator *prefetch_iterator;
-	int			prefetch_pages;
-	int			prefetch_target;
-	int			prefetch_maximum;
+	bool		recheck;
 	Size		pscan_len;
 	bool		initialized;
-	TBMSharedIterator *shared_tbmiterator;
-	TBMSharedIterator *shared_prefetch_iterator;
 	ParallelBitmapHeapState *pstate;
 } BitmapHeapScanState;
 
diff --git a/src/include/nodes/tidbitmap.h b/src/include/nodes/tidbitmap.h
index 1945f0639b..97e40097fd 100644
--- a/src/include/nodes/tidbitmap.h
+++ b/src/include/nodes/tidbitmap.h
@@ -64,8 +64,8 @@ extern bool tbm_is_empty(const TIDBitmap *tbm);
 
 extern TBMIterator *tbm_begin_iterate(TIDBitmap *tbm);
 extern dsa_pointer tbm_prepare_shared_iterate(TIDBitmap *tbm);
-extern TBMIterateResult *tbm_iterate(TBMIterator *iterator);
-extern TBMIterateResult *tbm_shared_iterate(TBMSharedIterator *iterator);
+extern void tbm_iterate(TBMIterator *iterator, TBMIterateResult *tbmres);
+extern void tbm_shared_iterate(TBMSharedIterator *iterator, TBMIterateResult *tbmres);
 extern void tbm_end_iterate(TBMIterator *iterator);
 extern void tbm_end_shared_iterate(TBMSharedIterator *iterator);
 extern TBMSharedIterator *tbm_attach_shared_iterate(dsa_area *dsa,
-- 
2.39.3 (Apple Git-146)

v7-0007-WIP-pg_buffercache_invalidate.patchapplication/octet-stream; name=v7-0007-WIP-pg_buffercache_invalidate.patchDownload

From 666551917f2ec13201f1965b73d62864dbe1f1e9 Mon Sep 17 00:00:00 2001
From: Palak <palakchaturvedi2843@gmail.com>
Date: Fri, 30 Jun 2023 08:21:06 +0000
Subject: [PATCH v7 7/7] WIP: pg_buffercache_invalidate

See https://commitfest.postgresql.org/47/4426/
---
 contrib/pg_buffercache/Makefile               |  2 +-
 contrib/pg_buffercache/meson.build            |  1 +
 .../pg_buffercache--1.4--1.5.sql              |  6 ++
 contrib/pg_buffercache/pg_buffercache.control |  2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c | 34 +++++++++
 src/backend/storage/buffer/bufmgr.c           | 71 +++++++++++++++++++
 src/include/storage/bufmgr.h                  |  3 +
 7 files changed, 117 insertions(+), 2 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index d6b58d4da9..eae65ead9e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,7 @@ OBJS = \
 EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
-	pg_buffercache--1.3--1.4.sql
+	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index c86e33cc95..1ca3452918 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -22,6 +22,7 @@ install_data(
   'pg_buffercache--1.2--1.3.sql',
   'pg_buffercache--1.2.sql',
   'pg_buffercache--1.3--1.4.sql',
+  'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql
new file mode 100644
index 0000000000..c96848c77d
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql
@@ -0,0 +1,6 @@
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.5'" to load this file. \quit
+
+CREATE FUNCTION pg_buffercache_invalidate(IN int, IN bool default true)
+RETURNS bool
+AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate'
+LANGUAGE C PARALLEL SAFE;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index a82ae5f9bb..5ee875f77d 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.4'
+default_version = '1.5'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3316732365..93ee4ef724 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -64,6 +64,40 @@ PG_FUNCTION_INFO_V1(pg_buffercache_pages);
 PG_FUNCTION_INFO_V1(pg_buffercache_summary);
 PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 
+PG_FUNCTION_INFO_V1(pg_buffercache_invalidate);
+Datum
+pg_buffercache_invalidate(PG_FUNCTION_ARGS)
+{
+	Buffer		bufnum;
+	bool		result;
+	bool		force;
+
+	if (PG_ARGISNULL(0))
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("buffernum cannot be NULL")));
+	}
+
+	bufnum = PG_GETARG_INT32(0);
+	if (bufnum <= 0 || bufnum > NBuffers)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("buffernum is not valid")));
+
+	}
+
+	/*
+	 * Check whether to force invalidate the dirty buffer. The default value of force is true.
+	 */
+
+	force = PG_GETARG_BOOL(1);
+
+	result = TryInvalidateBuffer(bufnum, force);
+	PG_RETURN_BOOL(result);
+}
+
 Datum
 pg_buffercache_pages(PG_FUNCTION_ARGS)
 {
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d0e9c7deff..0f81d256cc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5942,3 +5942,74 @@ ResOwnerPrintBufferPin(Datum res)
 {
 	return DebugPrintBufferRefcount(DatumGetInt32(res));
 }
+
+/*
+ * Try Invalidating a buffer using bufnum.
+ * If the buffer is invalid, the function returns false.
+ * The function checks for dirty buffer and flushes the dirty buffer before invalidating.
+ * If the buffer is still dirty it returns false.
+ */
+bool
+TryInvalidateBuffer(Buffer bufnum, bool force)
+{
+	BufferDesc *bufHdr = GetBufferDescriptor(bufnum - 1);
+	uint32		buf_state;
+	ReservePrivateRefCountEntry();
+
+	buf_state = LockBufHdr(bufHdr);
+	if ((buf_state & BM_VALID) == BM_VALID)
+	{
+		/*
+		 * The buffer is pinned therefore cannot invalidate.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+		{
+			UnlockBufHdr(bufHdr, buf_state);
+			return false;
+		}
+		if ((buf_state & BM_DIRTY) == BM_DIRTY)
+		{
+			/*
+			 * If the buffer is dirty and the user has not asked to clear the dirty buffer return false.
+			 * Otherwise clear the dirty buffer.
+			 */
+			if(!force){
+				return false;
+			}
+			/*
+			 * Try once to flush the dirty buffer.
+			 */
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+			PinBuffer_Locked(bufHdr);
+			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
+			FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+			UnpinBuffer(bufHdr);
+			buf_state = LockBufHdr(bufHdr);
+			if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+			{
+				UnlockBufHdr(bufHdr, buf_state);
+				return false;
+			}
+
+			/*
+			 * If its dirty again or not valid anymore give up.
+			 */
+
+			if ((buf_state & (BM_DIRTY | BM_VALID)) != (BM_VALID))
+			{
+				UnlockBufHdr(bufHdr, buf_state);
+				return false;
+			}
+
+		}
+
+		InvalidateBuffer(bufHdr);
+		return true;
+	}
+	else
+	{
+		UnlockBufHdr(bufHdr, buf_state);
+		return false;
+	}
+}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b57f71f97e..f57d76ae35 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -295,6 +295,9 @@ extern bool BgBufferSync(struct WritebackContext *wb_context);
 extern void LimitAdditionalPins(uint32 *additional_pins);
 extern void LimitAdditionalLocalPins(uint32 *additional_pins);
 
+
+extern bool TryInvalidateBuffer(Buffer bufnum, bool force);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
-- 
2.39.3 (Apple Git-146)

random.sqlapplication/octet-stream; name=random.sqlDownload

#35

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#34)

Re: Streaming I/O, vectored I/O (WIP)

I am planning to push the bufmgr.c patch soon. At that point the new
API won't have any direct callers yet, but the traditional
ReadBuffer() family of functions will internally reach
StartReadBuffers(nblocks=1) followed by WaitReadBuffers(),
ZeroBuffer() or nothing as appropriate. Any more thoughts or
objections? Naming, semantics, correctness of buffer protocol,
sufficiency of comments, something else?

#36

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#35)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

On Sat, 16 Mar 2024 at 02:53, Thomas Munro <thomas.munro@gmail.com> wrote:

I am planning to push the bufmgr.c patch soon. At that point the new
API won't have any direct callers yet, but the traditional
ReadBuffer() family of functions will internally reach
StartReadBuffers(nblocks=1) followed by WaitReadBuffers(),
ZeroBuffer() or nothing as appropriate. Any more thoughts or
objections? Naming, semantics, correctness of buffer protocol,
sufficiency of comments, something else?

+    if (StartReadBuffers(bmr,
+                         &buffer,
+                         forkNum,
+                         blockNum,
+                         &nblocks,
+                         strategy,
+                         flags,
+                         &operation))
+        WaitReadBuffers(&operation);

I think we need to call WaitReadBuffers when 'mode !=
RBM_ZERO_AND_CLEANUP_LOCK && mode != RBM_ZERO_AND_LOCK' or am I
missing something?

Couple of nitpicks:

It would be nice to explain what the PrepareReadBuffer function does
with a comment.

+    if (nblocks == 0)
+        return;                    /* nothing to do */
It is guaranteed that nblocks will be bigger than 0. Can't we just use
Assert(operation->io_buffers_len > 0);?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#37

Heikki Linnakangas

hlinnaka@iki.fi

almost 2 years ago

In reply to: Thomas Munro (#34)

Re: Streaming I/O, vectored I/O (WIP)

Some quick comments:

On 12/03/2024 15:02, Thomas Munro wrote:

src/backend/storage/aio/streaming_read.c
src/include/storage/streaming_read.h

Standard file header comments missing.

It would be nice to have a comment at the top of streaming_read.c,
explaining at a high level how the circular buffer, lookahead and all
that works. Maybe even some diagrams.

For example, what is head and what is tail? Before reading the code, I
assumed that 'head' was the next block range to return in
pg_streaming_read_buffer_get_next(). But I think it's actually the other
way round?

/*
* Create a new streaming read object that can be used to perform the
* equivalent of a series of ReadBuffer() calls for one fork of one relation.
* Internally, it generates larger vectored reads where possible by looking
* ahead.
*/
PgStreamingRead *
pg_streaming_read_buffer_alloc(int flags,
void *pgsr_private,
size_t per_buffer_data_size,
BufferAccessStrategy strategy,
BufferManagerRelation bmr,
ForkNumber forknum,
PgStreamingReadBufferCB next_block_cb)

I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
the work it does is memory allocation. But I'd suggest something like
'pg_streaming_read_begin' instead.

Do we really need the pg_ prefix in these?

Buffer
pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)

Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
for a shorter name.

/*
* pgsr->ranges is a circular buffer. When it is empty, head == tail.
* When it is full, there is an empty element between head and tail. Head
* can also be empty (nblocks == 0), therefore we need two extra elements
* for non-occupied ranges, on top of max_pinned_buffers to allow for the
* maxmimum possible number of occupied ranges of the smallest possible
* size of one.
*/
size = max_pinned_buffers + 2;

I didn't understand this explanation for why it's + 2.

/*
* Skip the initial ramp-up phase if the caller says we're going to be
* reading the whole relation. This way we start out doing full-sized
* reads.
*/
if (flags & PGSR_FLAG_FULL)
pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
else
pgsr->distance = 1;

Should this be "Max(MAX_BUFFERS_PER_TRANSFER,
pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than
MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So
perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?

--
Heikki Linnakangas
Neon (https://neon.tech)

#38

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#37)

3 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 12/03/2024 15:02, Thomas Munro wrote:

src/backend/storage/aio/streaming_read.c
src/include/storage/streaming_read.h

Standard file header comments missing.

Fixed.

It would be nice to have a comment at the top of streaming_read.c,
explaining at a high level how the circular buffer, lookahead and all
that works. Maybe even some diagrams.

Done.

For example, what is head and what is tail? Before reading the code, I
assumed that 'head' was the next block range to return in
pg_streaming_read_buffer_get_next(). But I think it's actually the other
way round?

Yeah. People seem to have different natural intuitions about head vs
tail in this sort of thing, so I've switched to descriptive names:
stream->{oldest,next}_buffer_index (see also below).

/*
* Create a new streaming read object that can be used to perform the
* equivalent of a series of ReadBuffer() calls for one fork of one relation.
* Internally, it generates larger vectored reads where possible by looking
* ahead.
*/
PgStreamingRead *
pg_streaming_read_buffer_alloc(int flags,
void *pgsr_private,
size_t per_buffer_data_size,
BufferAccessStrategy strategy,
BufferManagerRelation bmr,
ForkNumber forknum,
PgStreamingReadBufferCB next_block_cb)

I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
the work it does is memory allocation. But I'd suggest something like
'pg_streaming_read_begin' instead.

I like it. Done.

Do we really need the pg_ prefix in these?

Good question. My understanding of our convention is that pg_ is
needed for local replacements/variants/extensions of things with well
known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in
a few places where the word is very common/short and we want to avoid
collisions and make sure it's obviously ours (pg_popcount?), and I
guess places that reflect the name of a SQL identifier with a prefix,
but this doesn't seem to qualify for any of those things. It's a new
thing, our own thing entirely, and sufficiently distinctive and
unconfusable with standard stuff. So, prefix removed.

Lots of other patches on top of this one are using "pgsr" as a
variable name, ie containing that prefix; perhaps they would use "sr"
or "streaming_read" or "stream". I used "stream" in a few places in
this version.

Other names improved in this version IMHO: pgsr_private ->
callback_private. I find it clearer, as a way to indicate that the
provider of the callback "owns" it. I also reordered the arguments:
now it's streaming_read_buffer_begin(..., callback, callback_private,
per_buffer_data_size), to keep those three things together.

Buffer
pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)

Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
for a shorter name.

Hmm. The idea of 'buffer' appearing in a couple of names is that
there are conceptually other kinds of I/O that we might want to
stream, like raw files or buffers other than the buffer pool, maybe
even sockets, so this would be part of a family of similar interfaces.
I think it needs to be clear that this variant gives you buffers. I'm
OK with removing "get" but I guess it would be better to keep the
words in the same order across the three functions? What about these?

streaming_read_buffer_begin();
streaming_read_buffer_next();
streaming_read_buffer_end();

Tried like that in this version. Other ideas would be to make
"stream" the main noun, buffered_read_stream_begin() or something.
Ideas welcome.

It's also a bit grammatically weird to say StartReadBuffers() and
WaitReadBuffers() in the bufmgr API... Hmm. Perhaps we should just
call it ReadBuffers() and WaitForBufferIO()? Maybe surprising because
the former isn't just like ReadBuffer() ... but on the other hand no
one said it has to be, and sometimes it even is (when it gets a hit).
I suppose there could even be a flag READ_BUFFERS_WAIT or the opposite
to make the asynchrony optional or explicit if someone has a problem
with that.

(Hmm, that'd be a bit like the Windows native file API, where
ReadFile() is synchronous or asynchronous depending on flags.)

/*
* pgsr->ranges is a circular buffer. When it is empty, head == tail.
* When it is full, there is an empty element between head and tail. Head
* can also be empty (nblocks == 0), therefore we need two extra elements
* for non-occupied ranges, on top of max_pinned_buffers to allow for the
* maxmimum possible number of occupied ranges of the smallest possible
* size of one.
*/
size = max_pinned_buffers + 2;

I didn't understand this explanation for why it's + 2.

I think the logic was right but the explanation was incomplete, sorry.
It needed one gap between head and tail because head == tail means
empty, and another because the tail item was still 'live' for one
extra call until you examined it one more time to notice that all the
buffers had been extracted. In any case I have now deleted all that.
In this new version I don't need extra space between head and tail at
all, because empty is detected with stream->pinned_buffers == 0, and
tail moves forwards immediately when you consume buffers.

/*
* Skip the initial ramp-up phase if the caller says we're going to be
* reading the whole relation. This way we start out doing full-sized
* reads.
*/
if (flags & PGSR_FLAG_FULL)
pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
else
pgsr->distance = 1;

Should this be "Max(MAX_BUFFERS_PER_TRANSFER,
pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than
MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So
perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?

Right, done.

Here are some other changes:

* I'm fairly happy with the ABC adaptive distance algorithm so far, I
think, but I spent more time tidying up the way it is implemented. I
didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER],
so I created a new dense array stream->buffers that behaved as a
second circular queue.

* The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to
become the GUC that it always wanted to be: buffer_io_size defaulting
to 128kB. Seems like a reasonable thing to have? Could also
influence things like bulk write? (The main problem I have with the
GUC currently is choosing a category, async resources is wrong....)

* By analogy, it started to look a bit funny that each range had room
for a ReadBuffersOperation, and we had enough ranges for
max_pinned_buffers * 1 block range. So I booted that out to another
dense array, of size max_ios.

* At the same time, Bilal and Andres had been complaining privately
about 'range' management overheads showing up in perf and creating a
regression against master on fully cached scans that do nothing else
(eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O
and no CPU work with the page, a somewhat extreme case but a
reasonable way to isolate the management costs); having made the above
change, it suddenly seemed obvious that I should make the buffers
array the 'main' circular queue, pointing off to another place for
information required for dealing with misses. In this version, there
are no more range objects. This feels better and occupies and touches
less memory. See pictures below.

* The 'head range' is replaced by the pending_read_{blocknum,nblocks}
variables, which seems easier to understand. Essentially the
callback's block numbers are run-length compressed into there until
they can't be, at which point we start a read and start forming a new
pending read.

* A micro-optimisation arranges for the zero slots to be reused over
and over again if we're doing distance = 1, to avoid rotating through
memory for no benefit; I don't yet know if that pays for itself, it's
just a couple of lines...

* Various indexes and sizes that couldn't quite fit in uint8_t but
couldn't possibly exceed a few thousand because they are bounded by
numbers deriving from range-limited GUCs are now int16_t (while I was
looking for low hanging opportunities to reduce memory usage...)

In pictures, the circular queue arrangement changed from
max_pinned_buffers * fat range objects like this, where only the
per-buffer data was outside the range object (because its size is
variable), and in the worst case all-hit case there were a lot of
ranges with only one buffer in them:

ranges buf/data

                            +--------+--------------+----+      +-----+
                            |        |              |    |      |     |
                            +--------+--------------+----+      +-----+
                    tail -> | 10..10 | buffers[MAX] | op +----->|  ?  |
                            +--------+--------------+----+      +-----+
                            | 42..44 | buffers[MAX] | op +----->|  ?  |
                            +--------+--------------+----+      +-----+
                            | 60..60 | buffers[MAX] | op +--+   |  ?  |
                            +--------+--------------+----+  |   +-----+
                    head -> |        |              |    |  |   |  ?  |
                            +--------+--------------+----+  |   +-----+
                            |        |              |    |  +-->|  ?  |
                            +--------+--------------+----+      +-----+
                            |        |              |    |      |     |
                            +--------+--------------+----+      +-----+

... to something that you might call a "columnar" layout, where ops
are kicked out to their own array of size max_ios (a much smaller
number), and the buffers, per-buffer data and indexes pointing to
optional I/O objects are in parallel arrays of size
max_pinned_buffers, like this:

buffers buf/data buf/io ios (= ops)

+----+ +-----+ +---+ +--------+
| | | | | | +---->| 42..44 |
+----+ +-----+ +---+ | +--------+
oldest_buffer_index -> | 10 | | ? | | | | +-->| 60..60 |
+----+ +-----+ +---+ | | +--------+
| 42 | | ? | | 0 +--+ | | |
+----+ +-----+ +---+ | +--------+
| 43 | | ? | | | | | |
+----+ +-----+ +---+ | +--------+
| 44 | | ? | | | | | |
+----+ +-----+ +---+ | +--------+
| 60 | | ? | | 1 +----+
+----+ +-----+ +---+
next_buffer_index -> | | | | | |
+----+ +-----+ +---+

In other words, there is essentially no waste/padding now, and in the
all-hit case we stay in the zero'th elements of those arrays so they
can stay red hot. Still working on validating this refactoring with
other patches and test scenarios. I hope it's easier to understand,
and does a better job of explaining itself.

I'm also still processing a bunch of performance-related fixups mostly
for bufmgr.c sent by Andres off-list (things like: StartReadBuffer()
argument list is too wide, some things need inline, we should only
initialise the op if it will be needed, oh I squashed that last one
into the patch already), after he and Bilal studied some regressions
in cases with no I/O. And thinking about Bilal's earlier message
(extra read even when we're going to zero, oops, he's quite right
about that) and a patch he sent me for that. More on those soon.

Attachments:

v8-0001-Provide-vectored-variant-of-ReadBuffer.patchapplication/octet-stream; name=v8-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From dc1e328ea85f9ffd636f06cf632a5fc2a0eb681a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v8 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC buffer_io_size, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 658 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  45 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 535 insertions(+), 212 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..3af86c5938 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-buffer-io-size" xreflabel="buffer_io_size">
+       <term><varname>buffer_io_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>buffer_io_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the target I/O size in operations that coalesce buffer I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..b534767872 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,12 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/*
+ * How many buffers should be coalesced into single I/O operations where
+ * possible.
+ */
+int			buffer_io_size = DEFAULT_BUFFER_IO_SIZE;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +482,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +510,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +791,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +803,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +825,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +996,66 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1074,413 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_BUFFER_IO_SIZE);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	for (int i = 0; i < actual_nblocks; ++i)
+	{
+		bool		found;
+
+		buffers[i] = PinBufferForBlock(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
+
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+
+	if (io_buffers_len > 0)
+	{
+		/* Populate extra information needed for I/O. */
+		operation->io_buffers_len = io_buffers_len;
+		operation->blocknum = blockNum;
+		operation->buffers = buffers;
+		operation->nblocks = actual_nblocks;
+		operation->bmr = bmr;
+		operation->forknum = forkNum;
+		operation->strategy = strategy;
+		operation->flags = flags;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
 		}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
+	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFER_IO_SIZE];
+		void	   *io_pages[MAX_BUFFER_IO_SIZE];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1488,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1547,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1615,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1650,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2006,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2271,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2294,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2609,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2623,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3691,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5426,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5447,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a5..985a2c7049 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1e71e7db4a..7188947126 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"buffer_io_size",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Target size for coalescing reads and writes of buffered data blocks."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&buffer_io_size,
+		DEFAULT_BUFFER_IO_SIZE,
+		1, MAX_BUFFER_IO_SIZE,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f7..b7a4143df2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#buffer_io_size = 128kB
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..1cc198bde2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_BUFFER_IO_SIZE PG_IOV_MAX
+#define DEFAULT_BUFFER_IO_SIZE Min(MAX_BUFFER_IO_SIZE, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int buffer_io_size;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int16		nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +290,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e2a0525dd4..9933c8f4b1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2286,6 +2286,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.3 (Apple Git-146)

v8-0002-Provide-API-for-streaming-reads-of-relations.patchapplication/octet-stream; name=v8-0002-Provide-API-for-streaming-reads-of-relations.patchDownload

From 41310926bbf8f49b2199471aa2b1da1942679706 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v8 2/4] Provide API for "streaming" reads of relations.

"Streaming reads" can be used as a more efficient replacement for
a series of ReadBuffer() calls.

The client code supplies a callback that can say which block to read
next, and then consumes individual buffers one at a time.  This division
allows streaming_read.c to build up large calls to StartReadBuffers(),
and issue fadvise() advice about future random reads in a systematic
way.

This API is based on an idea proposed by Andres Freund, to pave the way
for asynchronous I/O in future work as required to support direct I/O.
The main aim is to create an abstraction that insulates client code from
future improvements to the I/O subsystem.

An extended API may be necessary in future for more complicated cases
(for example recovery), but this basic API is thought to be sufficient
for many common usage patterns involving predictable access to a single
relation fork.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 655 +++++++++++++++++++++++
 src/backend/storage/meson.build          |   1 +
 src/include/storage/streaming_read.h     |  50 ++
 src/tools/pgindent/typedefs.list         |   1 +
 7 files changed, 727 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..bcab44c802
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..39aef2a84a
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 0000000000..bf2759cd72
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,655 @@
+/*-------------------------------------------------------------------------
+ *
+ * streaming_read.c
+ *	  Mechanism for buffer access with look-ahead
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to size buffer_io_size, by attempting to merge them
+ * with a pending read.  When that isn't possible, the existing pending read
+ * is sent to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * buffer_io_size (the GUC controlling physical read size), because in this
+ * case only goal is larger read system calls.  Looking further ahead would
+ * pin many buffers and perform speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at buffer_io_size.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers, ready to be returned by streaming_read_buffer_next().
+ * Each buffer also has an optional variable sized object that is passed from
+ * the callback to the consumer of buffers.  A third array records whether
+ * WaitReadBuffers() must be called before returning the buffer, and if so,
+ * points to the relevant ReadBuffersOperation object.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data buf/io       ios
+ *
+ *                          +----+  +-----+  +---+        +--------+
+ *                          |    |  |     |  |   |  +---->| 42..44 |
+ *                          +----+  +-----+  +---+  |     +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  |   |  | +-->| 60..60 |
+ *                          +----+  +-----+  +---+  | |   +--------+
+ *                          | 42 |  |  ?  |  | 0 +--+ |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 43 |  |  ?  |  |   |    |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 44 |  |  ?  |  |   |    |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 60 |  |  ?  |  | 1 +----+
+ *                          +----+  +-----+  +---+
+ *     next_buffer_index -> |    |  |     |  |   |
+ *                          +----+  +-----+  +---+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/storage/aio/streaming_read.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+/*
+ * Streaming read object.
+ */
+struct StreamingRead
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		started;
+	bool		finished;
+	bool		advice_enabled;
+
+	StreamingReadBufferCB callback;
+	void	   *callback_private;
+
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	Buffer	   *buffers;
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int16	   *buffer_io_indexes;
+
+	/* Read operations that have been started by not waited for yet. */
+	ReadBuffersOperation *ios;
+	int16		next_io_index;
+
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Slot for next buffer to pin */
+};
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+StreamingRead *
+streaming_read_buffer_begin(int flags,
+							BufferAccessStrategy strategy,
+							BufferManagerRelation bmr,
+							ForkNumber forknum,
+							StreamingReadBufferCB callback,
+							void *callback_private,
+							size_t per_buffer_data_size)
+{
+	StreamingRead *stream;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/*
+	 * Make sure our bmr's smgr and persistent are populated.  The caller
+	 * asserts that the storage manager will remain valid.
+	 */
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & STREAMING_READ_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+
+	/*
+	 * Choose a maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least buffer_io_size so
+	 * that we can have a chance to build up a full sized read, even when
+	 * max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, buffer_io_size);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	stream = (StreamingRead *) palloc0(sizeof(StreamingRead));
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & STREAMING_READ_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->strategy = strategy;
+
+	stream->bmr = bmr;
+	stream->forknum = forknum;
+	stream->callback = callback;
+	stream->callback_private = callback_private;
+
+	stream->unget_blocknum = InvalidBlockNumber;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out doing full-sized
+	 * reads.
+	 */
+	if (flags & STREAMING_READ_FULL)
+		stream->distance = stream->max_pinned_buffers;
+	else
+		stream->distance = 1;
+
+	/*
+	 * Space for the buffers we pin.  Though we never pin more than
+	 * max_pinned_buffers, we want to be able to assume that all the buffers
+	 * for a single read are contiguous (i.e. don't wrap around halfway
+	 * through), so we let the final one run past that position temporarily by
+	 * allocating an extra buffer_io_size - 1 elements.
+	 */
+	stream->buffers = palloc((max_pinned_buffers + buffer_io_size - 1) *
+							 sizeof(stream->buffers[0]));
+
+	/* Space for per-buffer data, if configured. */
+	if (per_buffer_data_size)
+		stream->per_buffer_data =
+			palloc(per_buffer_data_size * (max_pinned_buffers +
+										   buffer_io_size - 1));
+
+	/* Space for the IOs that we might run. */
+	stream->buffer_io_indexes = palloc(max_pinned_buffers * sizeof(stream->buffer_io_indexes[0]));
+	stream->ios = palloc(max_ios * sizeof(ReadBuffersOperation));
+
+	return stream;
+}
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data(StreamingRead *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow streaming_unget_block() to work.
+ */
+static BlockNumber
+streaming_read_get_block(StreamingRead *stream, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(stream->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = stream->unget_blocknum;
+		stream->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == stream->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = stream->callback(stream, stream->callback_private, per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by streaming_get_block().
+ */
+static void
+streaming_read_unget_block(StreamingRead *stream, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(stream->unget_blocknum == InvalidBlockNumber);
+	stream->unget_blocknum = blocknum;
+	stream->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+streaming_read_start_pending_read(StreamingRead *stream)
+{
+	bool		need_wait;
+	int			nblocks;
+	int16		io_index;
+	int16		overflow;
+	int			flags;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= buffer_io_size);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (stream->advice_enabled &&
+		stream->started &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* Suppress advice on the first call, because it's too late to benefit. */
+	if (!stream->started)
+		stream->started = true;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	nblocks = stream->pending_read_nblocks;
+	need_wait =
+		StartReadBuffers(stream->bmr,
+						 &stream->buffers[stream->next_buffer_index],
+						 stream->forknum,
+						 stream->pending_read_blocknum,
+						 &nblocks,
+						 stream->strategy,
+						 flags,
+						 &stream->ios[stream->next_io_index]);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		io_index = -1;
+
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		io_index = stream->next_io_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+	}
+
+	/* Set up the pointer to the I/O for the head buffer, if there is one. */
+	stream->buffer_io_indexes[stream->next_buffer_index] = io_index;
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at max_pinned_buffers.  Move values that
+	 * overflowed into the extra space.  At the same time, put -1 in the I/O
+	 * slots for the rest of the buffers to indicate no I/O.  They are covered
+	 * by the head buffer's I/O, if there is one.  We avoid a % operator.
+	 */
+	overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
+	if (overflow > 0)
+	{
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->max_pinned_buffers],
+				sizeof(stream->buffers[0]) * overflow);
+		for (int i = 0; i < overflow; ++i)
+			stream->buffer_io_indexes[i] = -1;
+		for (int i = 1; i < nblocks - overflow; ++i)
+			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
+	}
+	else
+	{
+		for (int i = 1; i < nblocks; ++i)
+			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
+	}
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time and suppress advice.
+	 */
+	stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+
+	stream->next_buffer_index += nblocks;
+	stream->next_buffer_index %= stream->max_pinned_buffers;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+streaming_read_look_ahead(StreamingRead *stream)
+{
+	while (!stream->finished &&
+		   stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == buffer_io_size)
+		{
+			streaming_read_start_pending_read(stream);
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index > stream->max_pinned_buffers)
+			buffer_index -= stream->max_pinned_buffers;
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = streaming_read_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			stream->finished = true;
+			continue;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			streaming_read_start_pending_read(stream);
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				streaming_read_unget_block(stream, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * Normally we don't start the pending read just because we've hit a
+	 * limit, preferring to give it another chance to grow to a larger size
+	 * once more buffers have been consumed.  However, in cases where that
+	 * can't possibly happen, we might as well start the read immediately.
+	 */
+	if (((stream->pending_read_nblocks > 0 && stream->finished) ||
+		 (stream->pending_read_nblocks == stream->distance)) &&
+		stream->ios_in_progress < stream->max_ios)
+		streaming_read_start_pending_read(stream);
+}
+
+/*
+ * Pull one pinned buffer out of the stream.  Buffer holds the block numbers
+ * in the order specified by the callback.
+ */
+Buffer
+streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		io_index;
+	int16		oldest_buffer_index;
+
+	if (unlikely(stream->pinned_buffers == 0))
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		if (stream->finished)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but we need a special case to prime the stream when
+		 * we're getting started.
+		 */
+		Assert(!stream->started);
+		streaming_read_look_ahead(stream);
+		if (stream->pinned_buffers == 0)
+			return InvalidBuffer;
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->max_pinned_buffers);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	io_index = stream->buffer_io_indexes[oldest_buffer_index];
+	Assert(io_index >= -1 && io_index < stream->max_ios);
+	if (io_index >= 0)
+	{
+		int			distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].buffers == &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index]);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+
+		if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards full I/O size (behavior B). */
+			if (stream->distance > buffer_io_size)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, buffer_io_size);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+	/* Advance the oldest buffer, but clobber it first for debugging. */
+#ifdef USE_ASSERT_CHECKING
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+	stream->buffer_io_indexes[oldest_buffer_index] = -1;
+	if (stream->per_buffer_data)
+		memset(get_per_buffer_data(stream, oldest_buffer_index),
+			   0xff,
+			   stream->per_buffer_data_size);
+#endif
+	if (++stream->oldest_buffer_index == stream->max_pinned_buffers)
+		stream->oldest_buffer_index = 0;
+
+	/* We are transferring ownership of the pin to the caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/*
+	 * When distance is minimal, we finish up with no queued buffers.  As a
+	 * micro-optimization, we can then reset our circular queues, so that
+	 * all-cached streams re-use the same elements instead of rotating through
+	 * memory.
+	 */
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+		stream->oldest_buffer_index = 0;
+		stream->next_buffer_index = 0;
+		stream->next_io_index = 0;
+	}
+
+	/* Prepare for the next call. */
+	streaming_read_look_ahead(stream);
+
+	return buffer;
+}
+
+void
+streaming_read_buffer_end(StreamingRead *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = streaming_read_buffer_next(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Release memory. */
+	pfree(stream->buffers);
+	if (stream->per_buffer_data)
+		pfree(stream->per_buffer_data);
+	pfree(stream->ios);
+
+	pfree(stream);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 0000000000..9e4700d561
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,50 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define STREAMING_READ_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define STREAMING_READ_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define STREAMING_READ_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define STREAMING_READ_FULL 0x04
+
+struct StreamingRead;
+typedef struct StreamingRead StreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*StreamingReadBufferCB) (StreamingRead *stream,
+											  void *callback_private,
+											  void *per_buffer_data);
+
+extern StreamingRead *streaming_read_buffer_begin(int flags,
+												  BufferAccessStrategy strategy,
+												  BufferManagerRelation bmr,
+												  ForkNumber forknum,
+												  StreamingReadBufferCB callback,
+												  void *callback_private,
+												  size_t per_buffer_data_size);
+extern Buffer streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_private);
+extern void streaming_read_buffer_end(StreamingRead *stream);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9933c8f4b1..967d9ca159 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2693,6 +2693,7 @@ StopList
 StrategyNumber
 StreamCtl
 StreamStopReason
+StreamingRead
 String
 StringInfo
 StringInfoData
-- 
2.39.3 (Apple Git-146)

v8-0003-Use-streaming-reads-in-pg_prewarm.patchapplication/octet-stream; name=v8-0003-Use-streaming-reads-in-pg_prewarm.patchDownload

From c12341647bc7f9aba4aafdadec0cccfa9cf78033 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v8 3/4] Use streaming reads in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use streaming reads.  This
provides a simple example of such a transformation, and generates fewer
system calls.

Reviewed-by:
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..cd2e15aa3d 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(StreamingRead *stream,
+							   void *user_data,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = user_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		StreamingRead *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = streaming_read_buffer_begin(STREAMING_READ_FULL,
+											 NULL,
+											 BMR_REL(rel),
+											 forkNumber,
+											 pg_prewarm_streaming_read_next,
+											 &p,
+											 0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = streaming_read_buffer_next(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(streaming_read_buffer_next(stream, NULL) == InvalidBuffer);
+		streaming_read_buffer_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.3 (Apple Git-146)

#39

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#38)

Re: Streaming I/O, vectored I/O (WIP)

On Sun, Mar 24, 2024 at 9:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 12/03/2024 15:02, Thomas Munro wrote:

src/backend/storage/aio/streaming_read.c
src/include/storage/streaming_read.h

Standard file header comments missing.

Fixed.

It would be nice to have a comment at the top of streaming_read.c,
explaining at a high level how the circular buffer, lookahead and all
that works. Maybe even some diagrams.

Done.

For example, what is head and what is tail? Before reading the code, I
assumed that 'head' was the next block range to return in
pg_streaming_read_buffer_get_next(). But I think it's actually the other
way round?

Yeah. People seem to have different natural intuitions about head vs
tail in this sort of thing, so I've switched to descriptive names:
stream->{oldest,next}_buffer_index (see also below).

/*
* Create a new streaming read object that can be used to perform the
* equivalent of a series of ReadBuffer() calls for one fork of one relation.
* Internally, it generates larger vectored reads where possible by looking
* ahead.
*/
PgStreamingRead *
pg_streaming_read_buffer_alloc(int flags,
void *pgsr_private,
size_t per_buffer_data_size,
BufferAccessStrategy strategy,
BufferManagerRelation bmr,
ForkNumber forknum,
PgStreamingReadBufferCB next_block_cb)

I'm not a fan of the name, especially the 'alloc' part. Yeah, most of
the work it does is memory allocation. But I'd suggest something like
'pg_streaming_read_begin' instead.

I like it. Done.

Do we really need the pg_ prefix in these?

Good question. My understanding of our convention is that pg_ is
needed for local replacements/variants/extensions of things with well
known names (pg_locale_t, pg_strdup(), yada yada), and perhaps also in
a few places where the word is very common/short and we want to avoid
collisions and make sure it's obviously ours (pg_popcount?), and I
guess places that reflect the name of a SQL identifier with a prefix,
but this doesn't seem to qualify for any of those things. It's a new
thing, our own thing entirely, and sufficiently distinctive and
unconfusable with standard stuff. So, prefix removed.

Lots of other patches on top of this one are using "pgsr" as a
variable name, ie containing that prefix; perhaps they would use "sr"
or "streaming_read" or "stream". I used "stream" in a few places in
this version.

Other names improved in this version IMHO: pgsr_private ->
callback_private. I find it clearer, as a way to indicate that the
provider of the callback "owns" it. I also reordered the arguments:
now it's streaming_read_buffer_begin(..., callback, callback_private,
per_buffer_data_size), to keep those three things together.

I haven't reviewed the whole patch, but as I was rebasing
bitmapheapscan streaming read user, I found callback_private confusing
because it seems like it is a private callback, not private data
belonging to the callback. Perhaps call it callback_private_data? Also
maybe mention what it is for in the comment above
streaming_read_buffer_begin() and in the StreamingRead structure
itself.

- Melanie

#40

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#39)

3 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Mon, Mar 25, 2024 at 6:30 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I haven't reviewed the whole patch, but as I was rebasing
bitmapheapscan streaming read user, I found callback_private confusing
because it seems like it is a private callback, not private data
belonging to the callback. Perhaps call it callback_private_data? Also

WFM.

maybe mention what it is for in the comment above
streaming_read_buffer_begin() and in the StreamingRead structure
itself.

Yeah. I've tried to improve the comments on all three public
functions. I also moved the three public functions _begin(), _next(),
_end() to be next to each other after the static helper functions.

Working on perf regression/tuning reports today, more soon...

Attachments:

v9-0001-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From edd3d078cf8d4b0c2f08df82295825f7107ec62b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v9 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC buffer_io_size, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 658 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  45 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 535 insertions(+), 212 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c4086..3af86c59384 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-buffer-io-size" xreflabel="buffer_io_size">
+       <term><varname>buffer_io_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>buffer_io_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the target I/O size in operations that coalesce buffer I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..b5347678726 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,12 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/*
+ * How many buffers should be coalesced into single I/O operations where
+ * possible.
+ */
+int			buffer_io_size = DEFAULT_BUFFER_IO_SIZE;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +482,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +510,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +791,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +803,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +825,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +996,66 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1074,413 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_BUFFER_IO_SIZE);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	for (int i = 0; i < actual_nblocks; ++i)
+	{
+		bool		found;
+
+		buffers[i] = PinBufferForBlock(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
+
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+
+	if (io_buffers_len > 0)
+	{
+		/* Populate extra information needed for I/O. */
+		operation->io_buffers_len = io_buffers_len;
+		operation->blocknum = blockNum;
+		operation->buffers = buffers;
+		operation->nblocks = actual_nblocks;
+		operation->bmr = bmr;
+		operation->forknum = forkNum;
+		operation->strategy = strategy;
+		operation->flags = flags;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
 		}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
+	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFER_IO_SIZE];
+		void	   *io_pages[MAX_BUFFER_IO_SIZE];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1488,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1547,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1615,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1650,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2006,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2271,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2294,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2609,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2623,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3691,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5426,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5447,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1e71e7db4a0..71889471266 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"buffer_io_size",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Target size for coalescing reads and writes of buffered data blocks."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&buffer_io_size,
+		DEFAULT_BUFFER_IO_SIZE,
+		1, MAX_BUFFER_IO_SIZE,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f79..b7a4143df21 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#buffer_io_size = 128kB
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..1cc198bde21 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_BUFFER_IO_SIZE PG_IOV_MAX
+#define DEFAULT_BUFFER_IO_SIZE Min(MAX_BUFFER_IO_SIZE, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int buffer_io_size;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int16		nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +290,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e2a0525dd4a..9933c8f4b1d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2286,6 +2286,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.2

v9-0002-Provide-API-for-streaming-reads-of-relations.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Provide-API-for-streaming-reads-of-relations.patchDownload

From 6ec902e3b94d69df936d95376b93b73eca92f1e1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v9 2/4] Provide API for "streaming" reads of relations.

"Streaming reads" can be used as a more efficient replacement for
a series of ReadBuffer() calls.

The client code supplies a callback that can say which block to read
next, and then consumes individual buffers one at a time.  This division
allows streaming_read.c to build up large calls to StartReadBuffers(),
and issue fadvise() advice about future random reads in a systematic
way.

This API is based on an idea proposed by Andres Freund, to pave the way
for asynchronous I/O in future work as required to support direct I/O.
The longer term aim is to create an abstraction that insulates client
code from future improvements to the I/O subsystem.  In the short term,
this mechanism allows improvements and simplification even with
traditional synchronous I/O.

An extended API may be necessary in future for more complicated cases
(for example recovery, which has a related mechanism LsnReadQueue in
xlogprefetcher.c that could eventually be replaced by this), but this
basic API is thought to be sufficient for many common usage patterns
involving predictable access to a single relation fork.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 678 +++++++++++++++++++++++
 src/backend/storage/meson.build          |   1 +
 src/include/storage/streaming_read.h     |  50 ++
 src/tools/pgindent/typedefs.list         |   1 +
 7 files changed, 750 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..760a231500a
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,678 @@
+/*-------------------------------------------------------------------------
+ *
+ * streaming_read.c
+ *	  Mechanism for buffer access with look-ahead
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to size buffer_io_size, by attempting to merge them
+ * with a pending read.  When that isn't possible, the existing pending read
+ * is sent to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * buffer_io_size (the GUC controlling physical read size), because in this
+ * case only goal is larger read system calls.  Looking further ahead would
+ * pin many buffers and perform speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at buffer_io_size.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers, ready to be returned by streaming_read_buffer_next().
+ * Each buffer also has an optional variable sized object that is passed from
+ * the callback to the consumer of buffers.  A third array records whether
+ * WaitReadBuffers() must be called before returning the buffer, and if so,
+ * points to the relevant ReadBuffersOperation object.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data buf/io       ios
+ *
+ *                          +----+  +-----+  +---+        +--------+
+ *                          |    |  |     |  |   |  +---->| 42..44 |
+ *                          +----+  +-----+  +---+  |     +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  |   |  | +-->| 60..60 |
+ *                          +----+  +-----+  +---+  | |   +--------+
+ *                          | 42 |  |  ?  |  | 0 +--+ |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 43 |  |  ?  |  |   |    |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 44 |  |  ?  |  |   |    |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 60 |  |  ?  |  | 1 +----+
+ *                          +----+  +-----+  +---+
+ *     next_buffer_index -> |    |  |     |  |   |
+ *                          +----+  +-----+  +---+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/storage/aio/streaming_read.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+/*
+ * Streaming read object.
+ */
+struct StreamingRead
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		started;
+	bool		finished;
+	bool		advice_enabled;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	StreamingReadBufferCB callback;
+	void	   *callback_private_data;
+
+	/* The relation we will read. */
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	Buffer	   *buffers;
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int16	   *buffer_io_indexes;
+
+	/* Read operations that have been started by not waited for yet. */
+	ReadBuffersOperation *ios;
+	int16		next_io_index;
+
+	/* Head and tail of the circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data(StreamingRead *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow streaming_unget_block() to work.
+ */
+static BlockNumber
+streaming_read_get_block(StreamingRead *stream, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(stream->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = stream->unget_blocknum;
+		stream->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == stream->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = stream->callback(stream,
+								  stream->callback_private_data,
+								  per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by streaming_get_block().
+ */
+static void
+streaming_read_unget_block(StreamingRead *stream, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(stream->unget_blocknum == InvalidBlockNumber);
+	stream->unget_blocknum = blocknum;
+	stream->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+streaming_read_start_pending_read(StreamingRead *stream)
+{
+	bool		need_wait;
+	int			nblocks;
+	int16		io_index;
+	int16		overflow;
+	int			flags;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= buffer_io_size);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (stream->advice_enabled &&
+		stream->started &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* Suppress advice on the first call, because it's too late to benefit. */
+	if (!stream->started)
+		stream->started = true;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	nblocks = stream->pending_read_nblocks;
+	need_wait =
+		StartReadBuffers(stream->bmr,
+						 &stream->buffers[stream->next_buffer_index],
+						 stream->forknum,
+						 stream->pending_read_blocknum,
+						 &nblocks,
+						 stream->strategy,
+						 flags,
+						 &stream->ios[stream->next_io_index]);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		io_index = -1;
+
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		io_index = stream->next_io_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+	}
+
+	/* Set up the pointer to the I/O for the head buffer, if there is one. */
+	stream->buffer_io_indexes[stream->next_buffer_index] = io_index;
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at max_pinned_buffers.  Move values that
+	 * overflowed into the extra space.  At the same time, put -1 in the I/O
+	 * slots for the rest of the buffers to indicate no I/O.  They are covered
+	 * by the head buffer's I/O, if there is one.  We avoid a % operator.
+	 */
+	overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
+	if (overflow > 0)
+	{
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->max_pinned_buffers],
+				sizeof(stream->buffers[0]) * overflow);
+		for (int i = 0; i < overflow; ++i)
+			stream->buffer_io_indexes[i] = -1;
+		for (int i = 1; i < nblocks - overflow; ++i)
+			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
+	}
+	else
+	{
+		for (int i = 1; i < nblocks; ++i)
+			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
+	}
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time and suppress advice.
+	 */
+	stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+
+	/* Compute location of start of next read, without using % operator. */
+	stream->next_buffer_index += nblocks;
+	if (stream->next_buffer_index >= stream->max_pinned_buffers)
+		stream->next_buffer_index -= stream->max_pinned_buffers;
+	Assert(stream->next_buffer_index >= 0);
+	Assert(stream->next_buffer_index < stream->max_pinned_buffers);
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+streaming_read_look_ahead(StreamingRead *stream)
+{
+	while (!stream->finished &&
+		   stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == buffer_io_size)
+		{
+			streaming_read_start_pending_read(stream);
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index > stream->max_pinned_buffers)
+			buffer_index -= stream->max_pinned_buffers;
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = streaming_read_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			stream->finished = true;
+			continue;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			streaming_read_start_pending_read(stream);
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				streaming_read_unget_block(stream, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * Normally we don't start the pending read just because we've hit a
+	 * limit, preferring to give it another chance to grow to a larger size
+	 * once more buffers have been consumed.  However, in cases where that
+	 * can't possibly happen, we might as well start the read immediately.
+	 */
+	if (((stream->pending_read_nblocks > 0 && stream->finished) ||
+		 (stream->pending_read_nblocks == stream->distance)) &&
+		stream->ios_in_progress < stream->max_ios)
+		streaming_read_start_pending_read(stream);
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+StreamingRead *
+streaming_read_buffer_begin(int flags,
+							BufferAccessStrategy strategy,
+							BufferManagerRelation bmr,
+							ForkNumber forknum,
+							StreamingReadBufferCB callback,
+							void *callback_private_data,
+							size_t per_buffer_data_size)
+{
+	StreamingRead *stream;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/*
+	 * Make sure our bmr's smgr and persistent are populated.  The caller
+	 * asserts that the storage manager will remain valid.
+	 */
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & STREAMING_READ_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+
+	/*
+	 * Choose a maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least buffer_io_size so
+	 * that we can have a chance to build up a full sized read, even when
+	 * max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, buffer_io_size);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	stream = (StreamingRead *) palloc0(sizeof(StreamingRead));
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & STREAMING_READ_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->strategy = strategy;
+
+	stream->bmr = bmr;
+	stream->forknum = forknum;
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	stream->unget_blocknum = InvalidBlockNumber;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out doing full-sized
+	 * reads.
+	 */
+	if (flags & STREAMING_READ_FULL)
+		stream->distance = stream->max_pinned_buffers;
+	else
+		stream->distance = 1;
+
+	/*
+	 * Space for the buffers we pin.  Though we never pin more than
+	 * max_pinned_buffers, we want to be able to assume that all the buffers
+	 * for a single read are contiguous (i.e. don't wrap around halfway
+	 * through), so we let the final one run past that position temporarily by
+	 * allocating an extra buffer_io_size - 1 elements.
+	 */
+	stream->buffers = palloc((max_pinned_buffers + buffer_io_size - 1) *
+							 sizeof(stream->buffers[0]));
+
+	/* Space for per-buffer data, if configured. */
+	if (per_buffer_data_size)
+		stream->per_buffer_data =
+			palloc(per_buffer_data_size * (max_pinned_buffers +
+										   buffer_io_size - 1));
+
+	/* Space for the IOs that we might run. */
+	stream->buffer_io_indexes = palloc(max_pinned_buffers * sizeof(stream->buffer_io_indexes[0]));
+	stream->ios = palloc(max_ios * sizeof(ReadBuffersOperation));
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream created with
+ * streaming_read_buffer_begin().  Each call returns successive blocks in the
+ * order specified by the callback.  If per_buffer_data_size was set to a
+ * non-zero size, *per_buffer_data receives a pointer to the extra per-buffer
+ * data that the callback had a chance to populate.  When the stream runs out
+ * of data, InvalidBuffer is returned.  The caller may decide to end the
+ * stream early at any time by calling streaming_read_end().
+ */
+Buffer
+streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		io_index;
+	int16		oldest_buffer_index;
+
+	if (unlikely(stream->pinned_buffers == 0))
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		if (stream->finished)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but we need a special case to prime the stream when
+		 * we're getting started.
+		 */
+		Assert(!stream->started);
+		streaming_read_look_ahead(stream);
+		if (stream->pinned_buffers == 0)
+			return InvalidBuffer;
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->max_pinned_buffers);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	io_index = stream->buffer_io_indexes[oldest_buffer_index];
+	Assert(io_index >= -1 && io_index < stream->max_ios);
+	if (io_index >= 0)
+	{
+		int			distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].buffers == &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index]);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+
+		if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards full I/O size (behavior B). */
+			if (stream->distance > buffer_io_size)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, buffer_io_size);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+	/* Advance the oldest buffer, but clobber it first for debugging. */
+#ifdef USE_ASSERT_CHECKING
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+	stream->buffer_io_indexes[oldest_buffer_index] = -1;
+	if (stream->per_buffer_data)
+		memset(get_per_buffer_data(stream, oldest_buffer_index),
+			   0xff,
+			   stream->per_buffer_data_size);
+#endif
+	if (++stream->oldest_buffer_index == stream->max_pinned_buffers)
+		stream->oldest_buffer_index = 0;
+
+	/* We are transferring ownership of the pin to the caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/*
+	 * When distance is minimal, we finish up with no queued buffers.  As a
+	 * micro-optimization, we can then reset our circular queues, so that
+	 * all-cached streams re-use the same elements instead of rotating through
+	 * memory.
+	 */
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+		stream->oldest_buffer_index = 0;
+		stream->next_buffer_index = 0;
+		stream->next_io_index = 0;
+	}
+
+	/* Prepare for the next call. */
+	streaming_read_look_ahead(stream);
+
+	return buffer;
+}
+
+/*
+ * Finish streaming blocks and release all resources.
+ */
+void
+streaming_read_buffer_end(StreamingRead *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = streaming_read_buffer_next(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Release memory. */
+	pfree(stream->buffers);
+	if (stream->per_buffer_data)
+		pfree(stream->per_buffer_data);
+	pfree(stream->ios);
+
+	pfree(stream);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..7991402631a
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,50 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define STREAMING_READ_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define STREAMING_READ_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define STREAMING_READ_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define STREAMING_READ_FULL 0x04
+
+struct StreamingRead;
+typedef struct StreamingRead StreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*StreamingReadBufferCB) (StreamingRead *stream,
+											  void *callback_private_data,
+											  void *per_buffer_data);
+
+extern StreamingRead *streaming_read_buffer_begin(int flags,
+												  BufferAccessStrategy strategy,
+												  BufferManagerRelation bmr,
+												  ForkNumber forknum,
+												  StreamingReadBufferCB callback,
+												  void *callback_private_data,
+												  size_t per_buffer_data_size);
+extern Buffer streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_private);
+extern void streaming_read_buffer_end(StreamingRead *stream);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9933c8f4b1d..967d9ca1591 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2693,6 +2693,7 @@ StopList
 StrategyNumber
 StreamCtl
 StreamStopReason
+StreamingRead
 String
 StringInfo
 StringInfoData
-- 
2.39.2

v9-0003-Use-streaming-reads-in-pg_prewarm.patchtext/x-patch; charset=US-ASCII; name=v9-0003-Use-streaming-reads-in-pg_prewarm.patchDownload

From d77d5fdff31a4333a8a52e7146a2474bb42bad9c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v9 3/4] Use streaming reads in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use streaming reads.  This
provides a simple example of such a transformation, and generates fewer
system calls.

Reviewed-by:
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..cd2e15aa3d8 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(StreamingRead *stream,
+							   void *user_data,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = user_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		StreamingRead *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = streaming_read_buffer_begin(STREAMING_READ_FULL,
+											 NULL,
+											 BMR_REL(rel),
+											 forkNumber,
+											 pg_prewarm_streaming_read_next,
+											 &p,
+											 0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = streaming_read_buffer_next(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(streaming_read_buffer_next(stream, NULL) == InvalidBuffer);
+		streaming_read_buffer_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.2

#41

Heikki Linnakangas

hlinnaka@iki.fi

almost 2 years ago

In reply to: Thomas Munro (#38)

9 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On 24/03/2024 15:02, Thomas Munro wrote:

On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Maybe 'pg_streaming_read_next_buffer' or just 'pg_streaming_read_next',
for a shorter name.

Hmm. The idea of 'buffer' appearing in a couple of names is that
there are conceptually other kinds of I/O that we might want to
stream, like raw files or buffers other than the buffer pool, maybe
even sockets, so this would be part of a family of similar interfaces.
I think it needs to be clear that this variant gives you buffers. I'm
OK with removing "get" but I guess it would be better to keep the
words in the same order across the three functions? What about these?

streaming_read_buffer_begin();
streaming_read_buffer_next();
streaming_read_buffer_end();

Tried like that in this version. Other ideas would be to make
"stream" the main noun, buffered_read_stream_begin() or something.
Ideas welcome.

Works for me, although "streaming_read_buffer" is a pretty long prefix.
The flags like "STREAMING_READ_MAINTENANCE" probably ought to be
"STREAMING_READ_BUFFER_MAINTENANCE" as well.

Maybe "buffer_stream_next()"?

Here are some other changes:

* I'm fairly happy with the ABC adaptive distance algorithm so far, I
think, but I spent more time tidying up the way it is implemented. I
didn't like the way each 'range' had buffer[MAX_BUFFERS_PER_TRANSFER],
so I created a new dense array stream->buffers that behaved as a
second circular queue.

* The above also made it trivial for MAX_BUFFERS_PER_TRANSFER to
become the GUC that it always wanted to be: buffer_io_size defaulting
to 128kB. Seems like a reasonable thing to have? Could also
influence things like bulk write? (The main problem I have with the
GUC currently is choosing a category, async resources is wrong....)

* By analogy, it started to look a bit funny that each range had room
for a ReadBuffersOperation, and we had enough ranges for
max_pinned_buffers * 1 block range. So I booted that out to another
dense array, of size max_ios.

* At the same time, Bilal and Andres had been complaining privately
about 'range' management overheads showing up in perf and creating a
regression against master on fully cached scans that do nothing else
(eg pg_prewarm, where we lookup, pin, unpin every page and do no I/O
and no CPU work with the page, a somewhat extreme case but a
reasonable way to isolate the management costs); having made the above
change, it suddenly seemed obvious that I should make the buffers
array the 'main' circular queue, pointing off to another place for
information required for dealing with misses. In this version, there
are no more range objects. This feels better and occupies and touches
less memory. See pictures below.

+1 for all that. Much better!

* Various indexes and sizes that couldn't quite fit in uint8_t but
couldn't possibly exceed a few thousand because they are bounded by
numbers deriving from range-limited GUCs are now int16_t (while I was
looking for low hanging opportunities to reduce memory usage...)

Is int16 enough though? It seems so, because:

max_pinned_buffers = Max(max_ios * 4, buffer_io_size);

and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and
buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.

If someone changes those constants though, int16 might overflow and fail
in weird ways. I'd suggest being more careful here and explicitly clamp
max_pinned_buffers at PG_INT16_MAX or have a static assertion or
something. (I think it needs to be somewhat less than PG_INT16_MAX,
because of the extra "overflow buffers" stuff and some other places
where you do arithmetic.)

/*
* We gave a contiguous range of buffer space to StartReadBuffers(), but
* we want it to wrap around at max_pinned_buffers. Move values that
* overflowed into the extra space. At the same time, put -1 in the I/O
* slots for the rest of the buffers to indicate no I/O. They are covered
* by the head buffer's I/O, if there is one. We avoid a % operator.
*/
overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
if (overflow > 0)
{
memmove(&stream->buffers[0],
&stream->buffers[stream->max_pinned_buffers],
sizeof(stream->buffers[0]) * overflow);
for (int i = 0; i < overflow; ++i)
stream->buffer_io_indexes[i] = -1;
for (int i = 1; i < nblocks - overflow; ++i)
stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
}
else
{
for (int i = 1; i < nblocks; ++i)
stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
}

Instead of clearing buffer_io_indexes here, it might be cheaper/simpler
to initialize the array to -1 in streaming_read_buffer_begin(), and
reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(),
after the WaitReadBuffers() call. In other words, except when an I/O is
in progress, keep all the elements at -1, even the elements that are not
currently in use.

Alternatively, you could remember the first buffer that the I/O applies
to in the 'ios' array. In other words, instead of pointing from buffer
to the I/O that it depends on, point from the I/O to the buffer that
depends on it. The last attached patch implements that approach. I'm not
wedded to it, but it feels a little simpler.

if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
{
/* Distance ramps up fast (behavior C). */
...
}
else
{
/* No advice; move towards full I/O size (behavior B). */
...
}

The comment on ReadBuffersOperation says "Declared in public header only
to allow inclusion in other structs, but contents should not be
accessed", but here you access the 'flags' field.

You also mentioned that the StartReadBuffers() argument list is too
long. Perhaps the solution is to redefine ReadBuffersOperation so that
it consists of two parts: 1st part is filled in by the caller, and
contains the arguments, and 2nd part is private to bufmgr.c. The
signature for StartReadBuffers() would then be just:

bool StartReadBuffers(ReadBuffersOperation *operation);

That would make it OK to read the 'flags' field. It would also allow
reusing the same ReadBuffersOperation struct for multiple I/Os for the
same relation; you only need to change the changing parts of the struct
on each operation.

In the attached patch set, the first three patches are your v9 with no
changes. The last patch refactors away 'buffer_io_indexes' like I
mentioned above. The others are fixes for some other trivial things that
caught my eye.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v9.heikki-0001-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=UTF-8; name=v9.heikki-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From 359dcac7725f5bf35f5b135fe1b7fe9d8e050436 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v9.heikki 1/9] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC buffer_io_size, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 658 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  45 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 535 insertions(+), 212 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c4086..3af86c59384 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-buffer-io-size" xreflabel="buffer_io_size">
+       <term><varname>buffer_io_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>buffer_io_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the target I/O size in operations that coalesce buffer I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..b5347678726 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,12 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/*
+ * How many buffers should be coalesced into single I/O operations where
+ * possible.
+ */
+int			buffer_io_size = DEFAULT_BUFFER_IO_SIZE;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +482,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +510,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +791,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +803,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +825,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +996,66 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1074,413 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_BUFFER_IO_SIZE);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	for (int i = 0; i < actual_nblocks; ++i)
+	{
+		bool		found;
+
+		buffers[i] = PinBufferForBlock(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
+
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+
+	if (io_buffers_len > 0)
+	{
+		/* Populate extra information needed for I/O. */
+		operation->io_buffers_len = io_buffers_len;
+		operation->blocknum = blockNum;
+		operation->buffers = buffers;
+		operation->nblocks = actual_nblocks;
+		operation->bmr = bmr;
+		operation->forknum = forkNum;
+		operation->strategy = strategy;
+		operation->flags = flags;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
 		}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
+	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFER_IO_SIZE];
+		void	   *io_pages[MAX_BUFFER_IO_SIZE];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1488,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1547,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1615,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1650,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2006,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2271,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2294,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2609,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2623,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3691,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5426,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5447,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1e71e7db4a0..71889471266 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"buffer_io_size",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Target size for coalescing reads and writes of buffered data blocks."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&buffer_io_size,
+		DEFAULT_BUFFER_IO_SIZE,
+		1, MAX_BUFFER_IO_SIZE,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f79..b7a4143df21 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#buffer_io_size = 128kB
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..1cc198bde21 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_BUFFER_IO_SIZE PG_IOV_MAX
+#define DEFAULT_BUFFER_IO_SIZE Min(MAX_BUFFER_IO_SIZE, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int buffer_io_size;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int16		nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +290,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..97edd1388e9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2286,6 +2286,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.2

v9.heikki-0002-Provide-API-for-streaming-reads-of-relatio.patchtext/x-patch; charset=UTF-8; name=v9.heikki-0002-Provide-API-for-streaming-reads-of-relatio.patchDownload

From 8c6d754275af5fcf2f0a638a2770e2ad1a60659b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v9.heikki 2/9] Provide API for "streaming" reads of relations.

"Streaming reads" can be used as a more efficient replacement for
a series of ReadBuffer() calls.

The client code supplies a callback that can say which block to read
next, and then consumes individual buffers one at a time.  This division
allows streaming_read.c to build up large calls to StartReadBuffers(),
and issue fadvise() advice about future random reads in a systematic
way.

This API is based on an idea proposed by Andres Freund, to pave the way
for asynchronous I/O in future work as required to support direct I/O.
The longer term aim is to create an abstraction that insulates client
code from future improvements to the I/O subsystem.  In the short term,
this mechanism allows improvements and simplification even with
traditional synchronous I/O.

An extended API may be necessary in future for more complicated cases
(for example recovery, which has a related mechanism LsnReadQueue in
xlogprefetcher.c that could eventually be replaced by this), but this
basic API is thought to be sufficient for many common usage patterns
involving predictable access to a single relation fork.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 678 +++++++++++++++++++++++
 src/backend/storage/meson.build          |   1 +
 src/include/storage/streaming_read.h     |  50 ++
 src/tools/pgindent/typedefs.list         |   1 +
 7 files changed, 750 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..760a231500a
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,678 @@
+/*-------------------------------------------------------------------------
+ *
+ * streaming_read.c
+ *	  Mechanism for buffer access with look-ahead
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to size buffer_io_size, by attempting to merge them
+ * with a pending read.  When that isn't possible, the existing pending read
+ * is sent to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * buffer_io_size (the GUC controlling physical read size), because in this
+ * case only goal is larger read system calls.  Looking further ahead would
+ * pin many buffers and perform speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at buffer_io_size.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers, ready to be returned by streaming_read_buffer_next().
+ * Each buffer also has an optional variable sized object that is passed from
+ * the callback to the consumer of buffers.  A third array records whether
+ * WaitReadBuffers() must be called before returning the buffer, and if so,
+ * points to the relevant ReadBuffersOperation object.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data buf/io       ios
+ *
+ *                          +----+  +-----+  +---+        +--------+
+ *                          |    |  |     |  |   |  +---->| 42..44 |
+ *                          +----+  +-----+  +---+  |     +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  |   |  | +-->| 60..60 |
+ *                          +----+  +-----+  +---+  | |   +--------+
+ *                          | 42 |  |  ?  |  | 0 +--+ |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 43 |  |  ?  |  |   |    |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 44 |  |  ?  |  |   |    |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 60 |  |  ?  |  | 1 +----+
+ *                          +----+  +-----+  +---+
+ *     next_buffer_index -> |    |  |     |  |   |
+ *                          +----+  +-----+  +---+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/storage/aio/streaming_read.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+/*
+ * Streaming read object.
+ */
+struct StreamingRead
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		started;
+	bool		finished;
+	bool		advice_enabled;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	StreamingReadBufferCB callback;
+	void	   *callback_private_data;
+
+	/* The relation we will read. */
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	Buffer	   *buffers;
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int16	   *buffer_io_indexes;
+
+	/* Read operations that have been started by not waited for yet. */
+	ReadBuffersOperation *ios;
+	int16		next_io_index;
+
+	/* Head and tail of the circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data(StreamingRead *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow streaming_unget_block() to work.
+ */
+static BlockNumber
+streaming_read_get_block(StreamingRead *stream, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(stream->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = stream->unget_blocknum;
+		stream->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == stream->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = stream->callback(stream,
+								  stream->callback_private_data,
+								  per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by streaming_get_block().
+ */
+static void
+streaming_read_unget_block(StreamingRead *stream, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(stream->unget_blocknum == InvalidBlockNumber);
+	stream->unget_blocknum = blocknum;
+	stream->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+streaming_read_start_pending_read(StreamingRead *stream)
+{
+	bool		need_wait;
+	int			nblocks;
+	int16		io_index;
+	int16		overflow;
+	int			flags;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= buffer_io_size);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (stream->advice_enabled &&
+		stream->started &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* Suppress advice on the first call, because it's too late to benefit. */
+	if (!stream->started)
+		stream->started = true;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	nblocks = stream->pending_read_nblocks;
+	need_wait =
+		StartReadBuffers(stream->bmr,
+						 &stream->buffers[stream->next_buffer_index],
+						 stream->forknum,
+						 stream->pending_read_blocknum,
+						 &nblocks,
+						 stream->strategy,
+						 flags,
+						 &stream->ios[stream->next_io_index]);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		io_index = -1;
+
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		io_index = stream->next_io_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+	}
+
+	/* Set up the pointer to the I/O for the head buffer, if there is one. */
+	stream->buffer_io_indexes[stream->next_buffer_index] = io_index;
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at max_pinned_buffers.  Move values that
+	 * overflowed into the extra space.  At the same time, put -1 in the I/O
+	 * slots for the rest of the buffers to indicate no I/O.  They are covered
+	 * by the head buffer's I/O, if there is one.  We avoid a % operator.
+	 */
+	overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
+	if (overflow > 0)
+	{
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->max_pinned_buffers],
+				sizeof(stream->buffers[0]) * overflow);
+		for (int i = 0; i < overflow; ++i)
+			stream->buffer_io_indexes[i] = -1;
+		for (int i = 1; i < nblocks - overflow; ++i)
+			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
+	}
+	else
+	{
+		for (int i = 1; i < nblocks; ++i)
+			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
+	}
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time and suppress advice.
+	 */
+	stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+
+	/* Compute location of start of next read, without using % operator. */
+	stream->next_buffer_index += nblocks;
+	if (stream->next_buffer_index >= stream->max_pinned_buffers)
+		stream->next_buffer_index -= stream->max_pinned_buffers;
+	Assert(stream->next_buffer_index >= 0);
+	Assert(stream->next_buffer_index < stream->max_pinned_buffers);
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+streaming_read_look_ahead(StreamingRead *stream)
+{
+	while (!stream->finished &&
+		   stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == buffer_io_size)
+		{
+			streaming_read_start_pending_read(stream);
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index > stream->max_pinned_buffers)
+			buffer_index -= stream->max_pinned_buffers;
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = streaming_read_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			stream->finished = true;
+			continue;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			streaming_read_start_pending_read(stream);
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				streaming_read_unget_block(stream, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * Normally we don't start the pending read just because we've hit a
+	 * limit, preferring to give it another chance to grow to a larger size
+	 * once more buffers have been consumed.  However, in cases where that
+	 * can't possibly happen, we might as well start the read immediately.
+	 */
+	if (((stream->pending_read_nblocks > 0 && stream->finished) ||
+		 (stream->pending_read_nblocks == stream->distance)) &&
+		stream->ios_in_progress < stream->max_ios)
+		streaming_read_start_pending_read(stream);
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+StreamingRead *
+streaming_read_buffer_begin(int flags,
+							BufferAccessStrategy strategy,
+							BufferManagerRelation bmr,
+							ForkNumber forknum,
+							StreamingReadBufferCB callback,
+							void *callback_private_data,
+							size_t per_buffer_data_size)
+{
+	StreamingRead *stream;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/*
+	 * Make sure our bmr's smgr and persistent are populated.  The caller
+	 * asserts that the storage manager will remain valid.
+	 */
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & STREAMING_READ_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+
+	/*
+	 * Choose a maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least buffer_io_size so
+	 * that we can have a chance to build up a full sized read, even when
+	 * max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, buffer_io_size);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	stream = (StreamingRead *) palloc0(sizeof(StreamingRead));
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & STREAMING_READ_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->strategy = strategy;
+
+	stream->bmr = bmr;
+	stream->forknum = forknum;
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	stream->unget_blocknum = InvalidBlockNumber;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out doing full-sized
+	 * reads.
+	 */
+	if (flags & STREAMING_READ_FULL)
+		stream->distance = stream->max_pinned_buffers;
+	else
+		stream->distance = 1;
+
+	/*
+	 * Space for the buffers we pin.  Though we never pin more than
+	 * max_pinned_buffers, we want to be able to assume that all the buffers
+	 * for a single read are contiguous (i.e. don't wrap around halfway
+	 * through), so we let the final one run past that position temporarily by
+	 * allocating an extra buffer_io_size - 1 elements.
+	 */
+	stream->buffers = palloc((max_pinned_buffers + buffer_io_size - 1) *
+							 sizeof(stream->buffers[0]));
+
+	/* Space for per-buffer data, if configured. */
+	if (per_buffer_data_size)
+		stream->per_buffer_data =
+			palloc(per_buffer_data_size * (max_pinned_buffers +
+										   buffer_io_size - 1));
+
+	/* Space for the IOs that we might run. */
+	stream->buffer_io_indexes = palloc(max_pinned_buffers * sizeof(stream->buffer_io_indexes[0]));
+	stream->ios = palloc(max_ios * sizeof(ReadBuffersOperation));
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream created with
+ * streaming_read_buffer_begin().  Each call returns successive blocks in the
+ * order specified by the callback.  If per_buffer_data_size was set to a
+ * non-zero size, *per_buffer_data receives a pointer to the extra per-buffer
+ * data that the callback had a chance to populate.  When the stream runs out
+ * of data, InvalidBuffer is returned.  The caller may decide to end the
+ * stream early at any time by calling streaming_read_end().
+ */
+Buffer
+streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		io_index;
+	int16		oldest_buffer_index;
+
+	if (unlikely(stream->pinned_buffers == 0))
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		if (stream->finished)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but we need a special case to prime the stream when
+		 * we're getting started.
+		 */
+		Assert(!stream->started);
+		streaming_read_look_ahead(stream);
+		if (stream->pinned_buffers == 0)
+			return InvalidBuffer;
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->max_pinned_buffers);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	io_index = stream->buffer_io_indexes[oldest_buffer_index];
+	Assert(io_index >= -1 && io_index < stream->max_ios);
+	if (io_index >= 0)
+	{
+		int			distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].buffers == &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index]);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+
+		if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards full I/O size (behavior B). */
+			if (stream->distance > buffer_io_size)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, buffer_io_size);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+	/* Advance the oldest buffer, but clobber it first for debugging. */
+#ifdef USE_ASSERT_CHECKING
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+	stream->buffer_io_indexes[oldest_buffer_index] = -1;
+	if (stream->per_buffer_data)
+		memset(get_per_buffer_data(stream, oldest_buffer_index),
+			   0xff,
+			   stream->per_buffer_data_size);
+#endif
+	if (++stream->oldest_buffer_index == stream->max_pinned_buffers)
+		stream->oldest_buffer_index = 0;
+
+	/* We are transferring ownership of the pin to the caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/*
+	 * When distance is minimal, we finish up with no queued buffers.  As a
+	 * micro-optimization, we can then reset our circular queues, so that
+	 * all-cached streams re-use the same elements instead of rotating through
+	 * memory.
+	 */
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+		stream->oldest_buffer_index = 0;
+		stream->next_buffer_index = 0;
+		stream->next_io_index = 0;
+	}
+
+	/* Prepare for the next call. */
+	streaming_read_look_ahead(stream);
+
+	return buffer;
+}
+
+/*
+ * Finish streaming blocks and release all resources.
+ */
+void
+streaming_read_buffer_end(StreamingRead *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = streaming_read_buffer_next(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Release memory. */
+	pfree(stream->buffers);
+	if (stream->per_buffer_data)
+		pfree(stream->per_buffer_data);
+	pfree(stream->ios);
+
+	pfree(stream);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..7991402631a
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,50 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define STREAMING_READ_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define STREAMING_READ_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define STREAMING_READ_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define STREAMING_READ_FULL 0x04
+
+struct StreamingRead;
+typedef struct StreamingRead StreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*StreamingReadBufferCB) (StreamingRead *stream,
+											  void *callback_private_data,
+											  void *per_buffer_data);
+
+extern StreamingRead *streaming_read_buffer_begin(int flags,
+												  BufferAccessStrategy strategy,
+												  BufferManagerRelation bmr,
+												  ForkNumber forknum,
+												  StreamingReadBufferCB callback,
+												  void *callback_private_data,
+												  size_t per_buffer_data_size);
+extern Buffer streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_private);
+extern void streaming_read_buffer_end(StreamingRead *stream);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97edd1388e9..783d20ba058 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2693,6 +2693,7 @@ StopList
 StrategyNumber
 StreamCtl
 StreamStopReason
+StreamingRead
 String
 StringInfo
 StringInfoData
-- 
2.39.2

v9.heikki-0003-Use-streaming-reads-in-pg_prewarm.patchtext/x-patch; charset=UTF-8; name=v9.heikki-0003-Use-streaming-reads-in-pg_prewarm.patchDownload

From 464367523da22f615607c11231c7c56871b4da9b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v9.heikki 3/9] Use streaming reads in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use streaming reads.  This
provides a simple example of such a transformation, and generates fewer
system calls.

Reviewed-by:
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..cd2e15aa3d8 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(StreamingRead *stream,
+							   void *user_data,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = user_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		StreamingRead *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = streaming_read_buffer_begin(STREAMING_READ_FULL,
+											 NULL,
+											 BMR_REL(rel),
+											 forkNumber,
+											 pg_prewarm_streaming_read_next,
+											 &p,
+											 0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = streaming_read_buffer_next(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(streaming_read_buffer_next(stream, NULL) == InvalidBuffer);
+		streaming_read_buffer_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.2

v9.heikki-0004-Cleanup-boilerplate-file-header-comments.patchtext/x-patch; charset=UTF-8; name=v9.heikki-0004-Cleanup-boilerplate-file-header-comments.patchDownload

From 7f3a341dcd13a0271016922da899123fc5eb8892 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 26 Mar 2024 11:35:22 +0200
Subject: [PATCH v9.heikki 4/9] Cleanup boilerplate file header comments

---
 src/backend/storage/aio/streaming_read.c |  9 +++++----
 src/include/storage/streaming_read.h     | 15 ++++++++++++++-
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 760a231500a..e530a6da0ed 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -3,9 +3,6 @@
  * streaming_read.c
  *	  Mechanism for buffer access with look-ahead
  *
- * Portions Copyright (c) 2024, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
  * Code that needs to access relation data typically pins blocks one at a
  * time, often in a predictable order that might be sequential or data-driven.
  * Calling the simple ReadBuffer() function for each block is inefficient,
@@ -77,8 +74,12 @@
  * the range 42..44 requires an I/O wait before its buffers are returned, as
  * does block 60.
  *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
  * IDENTIFICATION
- *	  src/backend/storage/storage/aio/streaming_read.c
+ *	  src/backend/storage/aio/streaming_read.c
  *
  *-------------------------------------------------------------------------
  */
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 7991402631a..f0ce3e45956 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -1,3 +1,16 @@
+/*-------------------------------------------------------------------------
+ *
+ * streaming_read.h
+ *	  Mechanism for buffer access with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/streaming_read.h
+ *
+ *-------------------------------------------------------------------------
+ */
 #ifndef STREAMING_READ_H
 #define STREAMING_READ_H
 
@@ -47,4 +60,4 @@ extern StreamingRead *streaming_read_buffer_begin(int flags,
 extern Buffer streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_private);
 extern void streaming_read_buffer_end(StreamingRead *stream);
 
-#endif
+#endif							/* STREAMING_READ_H */
-- 
2.39.2

v9.heikki-0005-Tidy-up-includes.patchtext/x-patch; charset=UTF-8; name=v9.heikki-0005-Tidy-up-includes.patchDownload

From 0526b6c98afaec357ebac2b8f9ac94a8a30378fc Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 26 Mar 2024 12:54:25 +0200
Subject: [PATCH v9.heikki 5/9] Tidy up #includes

Nothing in the prototypes defined in streaming_read.h depend on smgr.h
or fd.h.
---
 src/backend/storage/aio/streaming_read.c | 2 ++
 src/include/storage/streaming_read.h     | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index e530a6da0ed..82a2227d4d5 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -87,6 +87,8 @@
 
 #include "catalog/pg_tablespace.h"
 #include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
 #include "storage/streaming_read.h"
 #include "utils/rel.h"
 #include "utils/spccache.h"
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index f0ce3e45956..74053d704ff 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -15,8 +15,6 @@
 #define STREAMING_READ_H
 
 #include "storage/bufmgr.h"
-#include "storage/fd.h"
-#include "storage/smgr.h"
 
 /* Default tuning, reasonable for many users. */
 #define STREAMING_READ_DEFAULT 0x00
-- 
2.39.2

v9.heikki-0006-Mention-that-STREAMING_READ_MAINTENANCE-is.patchtext/x-patch; charset=UTF-8; name=v9.heikki-0006-Mention-that-STREAMING_READ_MAINTENANCE-is.patchDownload

From 5a4841a008b4ebaa52f0d42a85a76e45e50bbdea Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 26 Mar 2024 11:35:32 +0200
Subject: [PATCH v9.heikki 6/9] Mention that STREAMING_READ_MAINTENANCE is used
 by VACUUM and CREATE INDEX

It took me a while to understand what a maintenance operation is. An
example helps.

Maybe we should even explicitly mention here that it means that
maintenance_io_concurrency is used instead of
effective_io_concurrency?
---
 src/include/storage/streaming_read.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 74053d704ff..b48be02d4bf 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -21,7 +21,7 @@
 
 /*
  * I/O streams that are performing maintenance work on behalf of potentially
- * many users.
+ * many users.  For example, VACUUM or CREATE INDEX.
  */
 #define STREAMING_READ_MAINTENANCE 0x01
 
-- 
2.39.2

v9.heikki-0007-Trivial-comment-fixes.patchtext/x-patch; charset=UTF-8; name=v9.heikki-0007-Trivial-comment-fixes.patchDownload

From 55ccdfc0273b32eac850b5cc170578b5306dff08 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 26 Mar 2024 12:46:48 +0200
Subject: [PATCH v9.heikki 7/9] Trivial comment fixes

---
 src/backend/storage/aio/streaming_read.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 82a2227d4d5..76ce9cef53e 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -48,7 +48,7 @@
  * WaitReadBuffers() must be called before returning the buffer, and if so,
  * points to the relevant ReadBuffersOperation object.
  *
- * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * For example, if the callback returns block numbers 10, 42, 43, 44, 60 in
  * successive calls, then these data structures might appear as follows:
  *
  *                          buffers buf/data buf/io       ios
@@ -136,7 +136,7 @@ struct StreamingRead
 	void	   *per_buffer_data;
 	int16	   *buffer_io_indexes;
 
-	/* Read operations that have been started by not waited for yet. */
+	/* Read operations that have been started but not waited for yet. */
 	ReadBuffersOperation *ios;
 	int16		next_io_index;
 
-- 
2.39.2

v9.heikki-0008-Use-wipe_mem-for-callback-data-that-should.patchtext/x-patch; charset=UTF-8; name=v9.heikki-0008-Use-wipe_mem-for-callback-data-that-should.patchDownload

From 8be3ae3eddce73846692943912e57520e7df032a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 26 Mar 2024 12:47:41 +0200
Subject: [PATCH v9.heikki 8/9] Use wipe_mem() for callback data that should be
 considered invalid

wipe_mem() includes Valgrind hints, which is nice.
---
 src/backend/storage/aio/streaming_read.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 76ce9cef53e..0ed81017b78 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -90,6 +90,7 @@
 #include "storage/fd.h"
 #include "storage/smgr.h"
 #include "storage/streaming_read.h"
+#include "utils/memdebug.h"
 #include "utils/rel.h"
 #include "utils/spccache.h"
 
@@ -618,13 +619,12 @@ streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_data)
 	}
 
 	/* Advance the oldest buffer, but clobber it first for debugging. */
-#ifdef USE_ASSERT_CHECKING
+#ifdef CLOBBER_FREED_MEMORY
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
 	stream->buffer_io_indexes[oldest_buffer_index] = -1;
 	if (stream->per_buffer_data)
-		memset(get_per_buffer_data(stream, oldest_buffer_index),
-			   0xff,
-			   stream->per_buffer_data_size);
+		wipe_mem(get_per_buffer_data(stream, oldest_buffer_index),
+				 stream->per_buffer_data_size);
 #endif
 	if (++stream->oldest_buffer_index == stream->max_pinned_buffers)
 		stream->oldest_buffer_index = 0;
-- 
2.39.2

v9.heikki-0009-Replace-buffer_io_indexes-array-with-ref-f.patchtext/x-patch; charset=UTF-8; name=v9.heikki-0009-Replace-buffer_io_indexes-array-with-ref-f.patchDownload

From 76f60b29bae4cd90515b7b942272858df35acaca Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 26 Mar 2024 13:03:06 +0200
Subject: [PATCH v9.heikki 9/9] Replace 'buffer_io_indexes' array with ref from
 IO to buffer.

When the block pattern is sequential so that we manage to build large
I/Os, this avoids having to repeatedly reset the buffer_io_indexes
array elements to -1, when most of them already are -1. That could be
slightly more efficient, although I didn't try to measure that and it
probably doesn't make any difference in practice either way.
---
 src/backend/storage/aio/streaming_read.c | 91 +++++++++++-------------
 1 file changed, 43 insertions(+), 48 deletions(-)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 0ed81017b78..00b92284a47 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -44,30 +44,32 @@
  * The main data structure is a circular queue of buffers of size
  * max_pinned_buffers, ready to be returned by streaming_read_buffer_next().
  * Each buffer also has an optional variable sized object that is passed from
- * the callback to the consumer of buffers.  A third array records whether
- * WaitReadBuffers() must be called before returning the buffer, and if so,
- * points to the relevant ReadBuffersOperation object.
+ * the callback to the consumer of buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
  *
  * For example, if the callback returns block numbers 10, 42, 43, 44, 60 in
  * successive calls, then these data structures might appear as follows:
  *
- *                          buffers buf/data buf/io       ios
+ *                          buffers buf/data              ios
  *
- *                          +----+  +-----+  +---+        +--------+
- *                          |    |  |     |  |   |  +---->| 42..44 |
- *                          +----+  +-----+  +---+  |     +--------+
- *   oldest_buffer_index -> | 10 |  |  ?  |  |   |  | +-->| 60..60 |
- *                          +----+  +-----+  +---+  | |   +--------+
- *                          | 42 |  |  ?  |  | 0 +--+ |   |        |
- *                          +----+  +-----+  +---+    |   +--------+
- *                          | 43 |  |  ?  |  |   |    |   |        |
- *                          +----+  +-----+  +---+    |   +--------+
- *                          | 44 |  |  ?  |  |   |    |   |        |
- *                          +----+  +-----+  +---+    |   +--------+
- *                          | 60 |  |  ?  |  | 1 +----+
- *                          +----+  +-----+  +---+
- *     next_buffer_index -> |    |  |     |  |   |
- *                          +----+  +-----+  +---+
+ *                          +----+  +-----+               +--------+
+ *                          |    |  |     |         +-----+ 42..44 |  <- oldest_io_index
+ *                          +----+  +-----+         |     +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |         | +-->| 60..60 |
+ *                          +----+  +-----+         | |   +--------+
+ *                          | 42 |  |  ?  |  <------+ |   |        |  <- next_io_index
+ *                          +----+  +-----+           |   +--------+
+ *                          | 43 |  |  ?  |           |   |        |
+ *                          +----+  +-----+           |   +--------+
+ *                          | 44 |  |  ?  |           |   |        |
+ *                          +----+  +-----+           |   +--------+
+ *                          | 60 |  |  ?  |  <--------+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
  *
  * In the example, 5 buffers are pinned, and the next buffer to be streamed to
  * the client is block 10.  Block 10 was a hit and has no associated I/O, but
@@ -94,6 +96,12 @@
 #include "utils/rel.h"
 #include "utils/spccache.h"
 
+typedef struct
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
 /*
  * Streaming read object.
  */
@@ -135,10 +143,10 @@ struct StreamingRead
 	Buffer	   *buffers;
 	size_t		per_buffer_data_size;
 	void	   *per_buffer_data;
-	int16	   *buffer_io_indexes;
 
 	/* Read operations that have been started but not waited for yet. */
-	ReadBuffersOperation *ios;
+	InProgressIO *ios;
+	int16		oldest_io_index;
 	int16		next_io_index;
 
 	/* Head and tail of the circular queue of buffers. */
@@ -211,7 +219,6 @@ streaming_read_start_pending_read(StreamingRead *stream)
 {
 	bool		need_wait;
 	int			nblocks;
-	int16		io_index;
 	int16		overflow;
 	int			flags;
 
@@ -254,14 +261,12 @@ streaming_read_start_pending_read(StreamingRead *stream)
 						 &nblocks,
 						 stream->strategy,
 						 flags,
-						 &stream->ios[stream->next_io_index]);
+						 &stream->ios[stream->next_io_index].op);
 	stream->pinned_buffers += nblocks;
 
 	/* Remember whether we need to wait before returning this buffer. */
 	if (!need_wait)
 	{
-		io_index = -1;
-
 		/* Look-ahead distance decays, no I/O necessary (behavior A). */
 		if (stream->distance > 1)
 			stream->distance--;
@@ -272,7 +277,7 @@ streaming_read_start_pending_read(StreamingRead *stream)
 		 * Remember to call WaitReadBuffers() before returning head buffer.
 		 * Look-ahead distance will be adjusted after waiting.
 		 */
-		io_index = stream->next_io_index;
+		stream->ios[stream->next_io_index].buffer_index = stream->next_buffer_index;
 		if (++stream->next_io_index == stream->max_ios)
 			stream->next_io_index = 0;
 
@@ -280,9 +285,6 @@ streaming_read_start_pending_read(StreamingRead *stream)
 		stream->ios_in_progress++;
 	}
 
-	/* Set up the pointer to the I/O for the head buffer, if there is one. */
-	stream->buffer_io_indexes[stream->next_buffer_index] = io_index;
-
 	/*
 	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
 	 * we want it to wrap around at max_pinned_buffers.  Move values that
@@ -296,15 +298,6 @@ streaming_read_start_pending_read(StreamingRead *stream)
 		memmove(&stream->buffers[0],
 				&stream->buffers[stream->max_pinned_buffers],
 				sizeof(stream->buffers[0]) * overflow);
-		for (int i = 0; i < overflow; ++i)
-			stream->buffer_io_indexes[i] = -1;
-		for (int i = 1; i < nblocks - overflow; ++i)
-			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
-	}
-	else
-	{
-		for (int i = 1; i < nblocks; ++i)
-			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
 	}
 
 	/*
@@ -528,8 +521,7 @@ streaming_read_buffer_begin(int flags,
 										   buffer_io_size - 1));
 
 	/* Space for the IOs that we might run. */
-	stream->buffer_io_indexes = palloc(max_pinned_buffers * sizeof(stream->buffer_io_indexes[0]));
-	stream->ios = palloc(max_ios * sizeof(ReadBuffersOperation));
+	stream->ios = palloc(max_ios * sizeof(InProgressIO));
 
 	return stream;
 }
@@ -547,7 +539,6 @@ Buffer
 streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_data)
 {
 	Buffer		buffer;
-	int16		io_index;
 	int16		oldest_buffer_index;
 
 	if (unlikely(stream->pinned_buffers == 0))
@@ -580,21 +571,23 @@ streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_data)
 	Assert(BufferIsValid(buffer));
 
 	/* Do we have to wait for an associated I/O first? */
-	io_index = stream->buffer_io_indexes[oldest_buffer_index];
-	Assert(io_index >= -1 && io_index < stream->max_ios);
-	if (io_index >= 0)
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
 	{
+		int16		io_index = stream->oldest_io_index;
 		int			distance;
 
 		/* Sanity check that we still agree on the buffers. */
-		Assert(stream->ios[io_index].buffers == &stream->buffers[oldest_buffer_index]);
+		Assert(stream->ios[io_index].op.buffers == &stream->buffers[oldest_buffer_index]);
 
-		WaitReadBuffers(&stream->ios[io_index]);
+		WaitReadBuffers(&stream->ios[io_index].op);
 
 		Assert(stream->ios_in_progress > 0);
 		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
 
-		if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
 			/* Distance ramps up fast (behavior C). */
 			distance = stream->distance * 2;
@@ -621,7 +614,6 @@ streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_data)
 	/* Advance the oldest buffer, but clobber it first for debugging. */
 #ifdef CLOBBER_FREED_MEMORY
 	stream->buffers[oldest_buffer_index] = InvalidBuffer;
-	stream->buffer_io_indexes[oldest_buffer_index] = -1;
 	if (stream->per_buffer_data)
 		wipe_mem(get_per_buffer_data(stream, oldest_buffer_index),
 				 stream->per_buffer_data_size);
@@ -644,7 +636,10 @@ streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_data)
 		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
 		stream->oldest_buffer_index = 0;
 		stream->next_buffer_index = 0;
+		Assert(stream->next_io_index == stream->oldest_io_index);
+		Assert(stream->ios_in_progress == 0);
 		stream->next_io_index = 0;
+		stream->oldest_io_index = 0;
 	}
 
 	/* Prepare for the next call. */
-- 
2.39.2

#42

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#41)

3 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Mar 27, 2024 at 1:40 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Is int16 enough though? It seems so, because:

max_pinned_buffers = Max(max_ios * 4, buffer_io_size);

and max_ios is constrained by the GUC's maximum MAX_IO_CONCURRENCY, and
buffer_io_size is constrained by MAX_BUFFER_IO_SIZE == PG_IOV_MAX == 32.

If someone changes those constants though, int16 might overflow and fail
in weird ways. I'd suggest being more careful here and explicitly clamp
max_pinned_buffers at PG_INT16_MAX or have a static assertion or
something. (I think it needs to be somewhat less than PG_INT16_MAX,
because of the extra "overflow buffers" stuff and some other places
where you do arithmetic.)

Clamp added.

/*
* We gave a contiguous range of buffer space to StartReadBuffers(), but
* we want it to wrap around at max_pinned_buffers. Move values that
* overflowed into the extra space. At the same time, put -1 in the I/O
* slots for the rest of the buffers to indicate no I/O. They are covered
* by the head buffer's I/O, if there is one. We avoid a % operator.
*/
overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
if (overflow > 0)
{
memmove(&stream->buffers[0],
&stream->buffers[stream->max_pinned_buffers],
sizeof(stream->buffers[0]) * overflow);
for (int i = 0; i < overflow; ++i)
stream->buffer_io_indexes[i] = -1;
for (int i = 1; i < nblocks - overflow; ++i)
stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
}
else
{
for (int i = 1; i < nblocks; ++i)
stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
}

Instead of clearing buffer_io_indexes here, it might be cheaper/simpler
to initialize the array to -1 in streaming_read_buffer_begin(), and
reset buffer_io_indexes[io_index] = -1 in streaming_read_buffer_next(),
after the WaitReadBuffers() call. In other words, except when an I/O is
in progress, keep all the elements at -1, even the elements that are not
currently in use.

Yeah that wasn't nice and I had already got as far as doing exactly
that ↑ on my own, but your second idea ↓ is better!

Alternatively, you could remember the first buffer that the I/O applies
to in the 'ios' array. In other words, instead of pointing from buffer
to the I/O that it depends on, point from the I/O to the buffer that
depends on it. The last attached patch implements that approach. I'm not
wedded to it, but it feels a little simpler.

Yeah, nice improvement.

if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
{
/* Distance ramps up fast (behavior C). */
...
}
else
{
/* No advice; move towards full I/O size (behavior B). */
...
}

The comment on ReadBuffersOperation says "Declared in public header only
to allow inclusion in other structs, but contents should not be
accessed", but here you access the 'flags' field.

You also mentioned that the StartReadBuffers() argument list is too
long. Perhaps the solution is to redefine ReadBuffersOperation so that
it consists of two parts: 1st part is filled in by the caller, and
contains the arguments, and 2nd part is private to bufmgr.c. The
signature for StartReadBuffers() would then be just:

bool StartReadBuffers(ReadBuffersOperation *operation);

Yeah. I had already got as far as doing this on the regression
hunting expedition, but I kept some arguments for frequently changing
things, eg blocknum. It means that the stuff that never changes is in
there, and the stuff that changes each time doesn't have to be written
to memory at all.

That would make it OK to read the 'flags' field. It would also allow
reusing the same ReadBuffersOperation struct for multiple I/Os for the
same relation; you only need to change the changing parts of the struct
on each operation.

Right. Done.

In the attached patch set, the first three patches are your v9 with no
changes. The last patch refactors away 'buffer_io_indexes' like I
mentioned above. The others are fixes for some other trivial things that
caught my eye.

Thanks, all squashed into the patch.

In an offline chat with Robert and Andres, we searched for a better
name for the GUC. We came up with "io_combine_limit". It's easier to
document a general purpose limit than to explain what "buffer_io_size"
does (particularly since the set of affected features will grow over
time but starts so small). I'll feel better about using it to control
bulk writes too, with that name.

I collapsed the various memory allocations into one palloc. The
buffer array is now a buffers[FLEXIBLE_ARRAY_MEMBER].

I got rid of "finished" (now represented by distance == 0, I was
removing branches and variables). I got rid of "started", which can
now be deduced (used for suppressing advice when you're calling
_next() because you need a block and we need to read it immediately),
see the function argument suppress_advice.

Here is a new proposal for the names, updated in v10:

read_stream_begin_relation()
read_stream_next_buffer()
void read_stream_end()

I think we'll finish up with different 'constructor' functions for
different kinds of streams. For example I already want one that can
provide a multi-relation callback for use by recovery (shown in v1).
Others might exist for raw file access, etc. The defining
characteristic of this one is that it accesses one specific
relation/fork. Well, _relfork() might be more accurate but less easy
on the eye. I won't be surprised if people not following this thread
have ideas after commit; it's certainly happened before something gets
renamed in beta and I won't mind a bit if that happens...

I fixed a thinko in the new ReadBuffer() implementation mentioned
before (thanks to Bilal for pointing this out): it didn't handle the
RBM_ZERO_XXX flags properly. Well, it did work and the tests passed,
but it performed a useless read first. I probably got mixed up when I
removed the extended interface which was capable of streaming zeroed
buffers but this simple one isn't and it wasn't suppressing the read
as it would need to. Later I'll propose to add that back in for
recovery.

I fixed a recently added thinko in the circular queue: I mistakenly
thought I didn't need a spare gap between head and tail anymore
because now we never compare them, since we track the number of pinned
buffers instead, but after read_stream_next_buffer() returns, users of
per-buffer data need that data to remain valid until the next call.
So the recent refactoring work didn't survive contact with Melanie's
BHS work, which uses that. We need to "wipe" the previous (spare)
one, which can't possibly be in use yet, and if anyone ever sees 0x7f
in their per-buffer data, it will mean that they illegally accessed
the older value after a new call to read_stream_next_buffer(). Fixed,
with comments to clarify.

Retesting with Melanie's latest BHS patch set and my random.sql
(upthread) gives the same system call trace as before. The intention
of that was to demonstrate what exact sequences
effective_io_concurrency values give. Now you can also run that with
different values of the new io_combine_limit. If you run it with
io_combine_limit = '8kB', it looks almost exactly like master ie no
big reads allowed; the only difference is that read_stream.c refuses
to issue advice for strictly sequential reads:

effective_io_concurrency = 1, range size = 2
unpatched patched
==============================================================================
pread(93,...,8192,0x58000) = 8192 pread(84,...,8192,0x58000) = 8192
posix_fadvise(93,0x5a000,0x2000,...) pread(84,...,8192,0x5a000) = 8192
pread(93,...,8192,0x5a000) = 8192 posix_fadvise(84,0xb0000,0x2000,...)
posix_fadvise(93,0xb0000,0x2000,...) pread(84,...,8192,0xb0000) = 8192
pread(93,...,8192,0xb0000) = 8192 pread(84,...,8192,0xb2000) = 8192
posix_fadvise(93,0xb2000,0x2000,...) posix_fadvise(84,0x108000,0x2000,...)
pread(93,...,8192,0xb2000) = 8192 pread(84,...,8192,0x108000) = 8192
posix_fadvise(93,0x108000,0x2000,...) pread(84,...,8192,0x10a000) = 8192

You wouldn't normally see that though as the default io_combine_limit
would just merge those adjacent reads after the first one triggers
ramp-up towards behaviour C:

effective_io_concurrency = 1, range size = 2
unpatched patched
==============================================================================
pread(93,...,8192,0x58000) = 8192 pread(80,...,8192,0x58000) = 8192
posix_fadvise(93,0x5a000,0x2000,...) pread(80,...,8192,0x5a000) = 8192
pread(93,...,8192,0x5a000) = 8192 posix_fadvise(80,0xb0000,0x4000,...)
posix_fadvise(93,0xb0000,0x2000,...) preadv(80,...,2,0xb0000) = 16384
pread(93,...,8192,0xb0000) = 8192 posix_fadvise(80,0x108000,0x4000,...)
posix_fadvise(93,0xb2000,0x2000,...) preadv(80,...,2,0x108000) = 16384
pread(93,...,8192,0xb2000) = 8192 posix_fadvise(80,0x160000,0x4000,...)

I spent most of the past few days trying to regain some lost
performance. Thanks to Andres for some key observations and help!
That began with reports from Bilal and Melanie (possibly related to
things Tomas had seen too, not sure) of regressions in all-cached
workloads, which I already improved a bit with the ABC algorithm that
minimised pinning for this case. That is, if there's no recent I/O so
we reach what I call behaviour A, it should try to do as little magic
as possible. But it turns out that wasn't enough! It is very hard to
beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over
millions of already-cached blocks, if you have to do exactly the same
work AND extra instructions for management.

There were two layers to the solution in this new version: First, I
now have a special case in read_stream_next_buffer(), a sort of
open-coded specialisation for behaviour A with no per-buffer data, and
that got rid of most of the regression, but some remained. Next,
Andres pointed out that ReadBuffer() itself, even though it is now
implemented on top of StartReadBuffers(nblocks = 1), was still beating
my special case code that calls StartReadBuffers(nblocks = 1), even
though it looks about the same, because bufmgr.c was able to inline
and specialise the latter for one block. To give streaming_read.c
that power from its home inside another translation units, we needed
to export a special case singular StartReadBuffer() (investigation and
patch by Andres, added as co-author). It just calls the plural
function with nblocks = 1, but it gets inlined. So now the special
case for behaviour A drills through both layers, and hopefully now
there is no measurable regression.. need to test a bitt more. Of
course we can't *beat* the old code in this case, yet, but...

(We speculate that a future tree-based buffer mapping table might
allow efficient lookup for a range of block numbers in one go, and
then it could be worth paying the book-keeping costs to find ranges.
Perhaps behaviour A and the associated special case code could then be
deleted, as you'd probably want to use multi-block magic all the time,
for both for I/O and mapping table lookups. Or something like that?)

Attachments:

v10-0001-Provide-vectored-variant-of-ReadBuffer.patchapplication/octet-stream; name=v10-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From 52d39699e38cb36cfec6de49e95349448b59d9b6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v10 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 700 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  41 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 563 insertions(+), 222 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..1480561134 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..8df5f3b43d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,9 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/* Limit on how many blocks should be handled in single I/O operations. */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +479,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +507,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +788,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +800,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +822,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +993,146 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
 
-	*hit = false;
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
+
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
+	{
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static inline Buffer
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1151,358 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
+		bool		found;
+
+		if (bmr.smgr == NULL)
+		{
+			bmr.smgr = RelationGetSmgr(bmr.rel);
+			bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+		}
+
+		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
 	}
+
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
+		flags = 0;
+	operation.bmr = bmr;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
+
+	return buffer;
+}
+
+/*
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, if the compiler
+ * inlines the code and specializes for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffers(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+inline bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+
+	if (!operation->bmr.smgr)
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
+		operation->bmr.smgr = RelationGetSmgr(operation->bmr.rel);
+		operation->bmr.relpersistence = operation->bmr.rel->rd_rel->relpersistence;
 	}
 
-	/* At this point we do NOT hold any locks. */
-
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
-		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		bool		found;
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		buffers[i] = PinBufferForBlock(operation->bmr,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+	*nblocks = actual_nblocks;
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+	if (io_buffers_len > 0)
+	{
+		/* Populate information needed for I/O. */
+		operation->buffers = buffers;
+		operation->blocknum = blockNum;
+		operation->flags = flags;
+		operation->nblocks = actual_nblocks;
+		operation->io_buffers_len = io_buffers_len;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(operation->bmr.smgr,
+						 operation->forknum,
+						 blockNum,
+						 operation->io_buffers_len);
 		}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
 	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1510,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1569,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1637,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1672,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2028,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2293,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2316,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2631,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2645,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3713,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5448,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5469,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a5..985a2c7049 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1e71e7db4a..28fc52c8a9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f7..7fa6d5a64c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..241f68c45e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,38 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/* The following private members should not be accessed directly. */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +286,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4679660837..a813ec1701 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2287,6 +2287,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.3 (Apple Git-146)

v10-0002-Provide-API-for-streaming-relation-data.patchapplication/octet-stream; name=v10-0002-Provide-API-for-streaming-relation-data.patchDownload

From bd46950956038953ebdc96fa9d6b01ef72ca8153 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v10 2/4] Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue fadvise() advice
ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code that should eventually be replaced by it),
but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/aio/Makefile      |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/read_stream.c | 733 ++++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c   |  21 +-
 src/backend/storage/meson.build       |   1 +
 src/include/storage/read_stream.h     |  62 +++
 src/tools/pgindent/typedefs.list      |   2 +
 8 files changed, 829 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c
 create mode 100644 src/include/storage/read_stream.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..2f29a9ec4d
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..10e1aa3b20
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 0000000000..4e293e0df6
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,733 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Sometimes we need to be able to 'unget' a block number to resolve a
+	 * flow control problem when I/Os are split.
+	 */
+	BlockNumber unget_blocknum;
+	bool		have_unget_blocknum;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow streaming_unget_block() to work.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (!stream->have_unget_blocknum)
+		return stream->callback(stream,
+								stream->callback_private_data,
+								per_buffer_data);
+
+	/*
+	 * You can only unget one block, and next_buffer_index can't change across
+	 * a get, unget, get sequence, so the callback's per_buffer_data, if any,
+	 * is still present in the correct slot.  We just have to return the
+	 * previous block number.
+	 */
+	stream->have_unget_blocknum = false;
+	return stream->unget_blocknum;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	Assert(!stream->have_unget_blocknum);
+	stream->have_unget_blocknum = true;
+	stream->unget_blocknum = blocknum;
+}
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * Normally we don't start the pending read just because we've hit a
+	 * limit, preferring to give it another chance to grow to a larger size
+	 * once more buffers have been consumed.  However, in cases where that
+	 * can't possibly happen, we might as well start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->distance == stream->pending_read_nblocks ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   BufferManagerRelation bmr,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/* Make sure our bmr's smgr and persistent are populated. */
+	if (bmr.smgr == NULL)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.bmr = bmr;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream created with
+ * read_stream_begin_buffered().  Each call returns successive blocks in the
+ * order specified by the callback.  If per_buffer_data_size was set to a
+ * non-zero size, *per_buffer_data receives a pointer to the extra per-buffer
+ * data that the callback had a chance to populate, which remains valid until
+ * the next call to read_stream_next_buffer().  When the stream runs out of
+ * data, InvalidBuffer is returned.  The caller may decide to end the stream
+ * early at any time by calling read_stream_end_buffered().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(per_buffer_data == NULL &&
+			   stream->ios_in_progress == 0 &&
+			   stream->pinned_buffers == 1 &&
+			   stream->distance == 1))
+	{
+		BlockNumber next_blocknum;
+
+		/*
+		 * We have a pinned buffer that we need to serve up, but we also want
+		 * to probe the next one before we return, just in case we need to
+		 * start an I/O.  We can re-use the same buffer slot, and an arbitrary
+		 * I/O slot since they're all free.
+		 */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+		Assert(stream->pending_read_nblocks <= 1);
+		if (unlikely(stream->pending_read_nblocks == 1))
+		{
+			next_blocknum = stream->pending_read_blocknum;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+			next_blocknum = read_stream_get_block(stream, NULL);
+		if (unlikely(next_blocknum == InvalidBlockNumber))
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			stream->next_buffer_index = oldest_buffer_index;
+			/* Pin transferred to caller. */
+			stream->pinned_buffers = 0;
+			return buffer;
+		}
+		/* Call the special single block version, which is marginally faster. */
+		if (unlikely(StartReadBuffer(&stream->ios[0].op,
+									 &stream->buffers[oldest_buffer_index],
+									 next_blocknum,
+									 stream->advice_enabled ?
+									 READ_BUFFERS_ISSUE_ADVICE : 0)))
+		{
+			/* I/O needed.  We'll take the general path next time. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+			/* Increase look ahead distance (move towards behavior B/C). */
+			stream->distance = Min(2, stream->max_pinned_buffers);
+		}
+		/* Pin transferred to caller, got another one, no net change. */
+		Assert(stream->pinned_buffers == 1);
+		return buffer;
+	}
+
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+	return buffer;
+}
+
+/*
+ * Release stream resources.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Release memory. */
+	pfree(stream);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8df5f3b43d..577bcf6e5d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1210,13 +1210,14 @@ StartReadBuffer(ReadBuffersOperation *operation,
  * Begin reading a range of blocks beginning at blockNum and extending for
  * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
  * are written into the buffers array, and *nblocks is updated to contain the
- * actual number, which may be fewer than requested.
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
  *
- * If false is returned, no I/O is necessary and WaitReadBuffers() is not
- * necessary.  If true is returned, one I/O has been started, and
- * WaitReadBuffers() must be called with the same operation object before the
- * buffers are accessed.  Along with the operation object, the caller-supplied
- * array of buffers must remain valid until WaitReadBuffers() is called.
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
  *
  * Currently the I/O is only started with optional operating system advice,
  * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
@@ -2452,7 +2453,7 @@ MarkBufferDirty(Buffer buffer)
 	uint32		old_buf_state;
 
 	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
+		elog(PANIC, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
 	{
@@ -4824,7 +4825,7 @@ void
 ReleaseBuffer(Buffer buffer)
 {
 	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
+		elog(PANIC, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
 		UnpinLocalBuffer(buffer);
@@ -4891,7 +4892,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 	Page		page = BufferGetPage(buffer);
 
 	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
+		elog(PANIC, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
 	{
@@ -5963,7 +5964,7 @@ ResOwnerReleaseBufferPin(Datum res)
 
 	/* Like ReleaseBuffer, but don't call ResourceOwnerForgetBuffer */
 	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
+		elog(PANIC, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
 		UnpinLocalBufferNoOwner(buffer);
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 0000000000..9e5fa2acf1
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,62 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  BufferManagerRelation bmr,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a813ec1701..56ce4002be 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1214,6 +1214,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2293,6 +2294,7 @@ ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.39.3 (Apple Git-146)

v10-0003-Use-streaming-I-O-in-pg_prewarm.patchapplication/octet-stream; name=v10-0003-Use-streaming-I-O-in-pg_prewarm.patchDownload

From e0151f88b0ce254ae2349731e6786c617c120032 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v10 3/4] Use streaming I/O in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use a ReadStream.  This
commit provides a simple example of such a transformation, and generates
fewer system calls.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..f6269562b5 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -19,6 +19,7 @@
 #include "fmgr.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_read_stream_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_read_stream_next_block(ReadStream *stream,
+								  void *callback_private_data,
+								  void *per_buffer_data)
+{
+	struct pg_prewarm_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_read_stream_private p;
+		ReadStream *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											NULL,
+											BMR_REL(rel),
+											forkNumber,
+											pg_prewarm_read_stream_next_block,
+											&p,
+											0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = read_stream_next_buffer(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.3 (Apple Git-146)

#43

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#42)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Mar 27, 2024 at 10:11 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I got rid of "finished" (now represented by distance == 0, I was
removing branches and variables). I got rid of "started", which can
now be deduced (used for suppressing advice when you're calling
_next() because you need a block and we need to read it immediately),
see the function argument suppress_advice.

I started rebasing the sequential scan streaming read user over this
new version, and this change (finished now represented with distance
== 0) made me realize that I'm not sure what to set distance to on
rescan.

For sequential scan, I added a little reset function to the streaming
read API (read_stream_reset()) that just releases all the buffers.
Previously, it set finished to true before releasing the buffers (to
indicate it was done) and then set it back to false after. Now, I'll
set distance to 0 before releasing the buffers and !0 after. I could
just restore whatever value distance had before I set it to 0. Or I
could set it to 1. But, thinking about it, are we sure we want to ramp
up in the same way on rescans? Maybe we want to use some information
from the previous scan to determine what to set distance to? Maybe I'm
overcomplicating it...

Here is a new proposal for the names, updated in v10:

read_stream_begin_relation()
read_stream_next_buffer()
void read_stream_end()

Personally, I'm happy with these.

- Melanie

#44

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#43)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Mar 28, 2024 at 9:43 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

For sequential scan, I added a little reset function to the streaming
read API (read_stream_reset()) that just releases all the buffers.
Previously, it set finished to true before releasing the buffers (to
indicate it was done) and then set it back to false after. Now, I'll
set distance to 0 before releasing the buffers and !0 after. I could
just restore whatever value distance had before I set it to 0. Or I
could set it to 1. But, thinking about it, are we sure we want to ramp
up in the same way on rescans? Maybe we want to use some information
from the previous scan to determine what to set distance to? Maybe I'm
overcomplicating it...

I think 1 is good, as a rescan is even more likely to find the pages
in cache, and if that turns out to be wrong it'll very soon adjust.

#45

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#44)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Mar 28, 2024 at 10:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I think 1 is good, as a rescan is even more likely to find the pages
in cache, and if that turns out to be wrong it'll very soon adjust.

Hmm, no I take that back, it probably won't be due to the
strategy/ring... I see your point now... when I had a separate flag,
the old distance was remembered across but now I'm zapping it. I was
trying to minimise the number of variables that have to be tested in
the fast path by consolidating. Hmm, it is signed -- would it be too
weird if we used a negative number for "finished", so we can just flip
it on reset?

#46

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#38)

Re: Streaming I/O, vectored I/O (WIP)

On Mon, Mar 25, 2024 at 2:02 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Mar 20, 2024 at 4:04 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

/*
* Skip the initial ramp-up phase if the caller says we're going to be
* reading the whole relation. This way we start out doing full-sized
* reads.
*/
if (flags & PGSR_FLAG_FULL)
pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
else
pgsr->distance = 1;

Should this be "Max(MAX_BUFFERS_PER_TRANSFER,
pgsr->max_pinned_buffers)"? max_pinned_buffers cannot be smaller than
MAX_BUFFERS_PER_TRANSFER though, given how it's initialized earlier. So
perhaps just 'pgsr->distance = pgsr->max_pinned_buffers' ?

Right, done.

BTW I forgot to mention that in v10 I changed my mind and debugged my
way back to the original coding, which now looks like this:

/*
* Skip the initial ramp-up phase if the caller says we're going to be
* reading the whole relation. This way we start out assuming we'll be
* doing full io_combine_limit sized reads (behavior B).
*/
if (flags & READ_STREAM_FULL)
stream->distance = Min(max_pinned_buffers, io_combine_limit);
else
stream->distance = 1;

It's not OK for distance to exceed max_pinned_buffers. But if
max_pinned_buffers is huge, remember that the goal here is to access
'behavior B' meaning wide read calls but no unnecessary extra
look-ahead beyond what is needed for that, so we also don't want to
exceed io_combine_limit. Therefore we want the minimum of those two
numbers. In practice on a non-toy system, that's always going to be
io_combine_limit. But I'm not sure how many users of READ_STREAM_FULL
there will be, and I am starting to wonder if it's a good name for the
flag, or even generally very useful. It's sort of saying "I expect to
do I/O, and it'll be sequential, and I won't give up until the end".
But how many users can really make those claims? pg_prewarm is unsual
in that it contains an explicit assumption that the cache is cold and
we want to warm it up. But maybe we should just let the adaptive
algorithm do its thing. It only takes a few reads to go from 1 ->
io_combine_limit.

Thinking harder, if we're going to keep this and not just be fully
adaptive, perhaps there should be a flag READ_STREAM_COLD, where you
hint that the data is not expected to be cached, and you'd combine
that with the _SEQUENTIAL hint. pg_prewarm hints _COLD | _SEQUENTIAL.
Then the initial distance would be something uses the flag
combinations to select initial behavior A, B, C (and we'll quickly
adjust if you're wrong):

if (!(flags & READ_STREAM_COLD))
stream->distance = 1;
else if (flags & READ_STREAM_SEQUENTIAL)
stream->distance = Min(max_pinned_buffers, io_combine_limit);
else
stream->distance = max_pinned_buffers;

But probably almost all users especially in the executor haven't
really got much of a clue what they're going to do so they'd use the
initial starting position of 1 (A) and we'd soo figure it out. Maybe
overengineering for pg_prewarm is a waste of time and we should just
delete the flag instead and hard code 1.

#47

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#46)

Re: Streaming I/O, vectored I/O (WIP)

On Thu, Mar 28, 2024 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:

... In practice on a non-toy system, that's always going to be
io_combine_limit. ...

And to be more explicit about that: you're right that we initialise
max_pinned_buffers such that it's usually at least io_combine_limit,
but then if you have a very small buffer pool it gets clobbered back
down again by LimitAdditionalBins() and may finish up as low as 1.
You're not allowed to pin more than 1/Nth of the whole buffer pool,
where N is approximately max connections (well it's not exactly that
but that's the general idea). So it's a degenerate case, but it can
happen that max_pinned_buffers is lower than io_combine_limit and then
it's important not to set distance higher or you'd exceed the allowed
limits (or more likely the circular data structure would implode).

#48

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#45)

3 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

New version with some cosmetic/comment changes, and Melanie's
read_stream_reset() function merged, as required by her sequential
scan user patch. I tweaked it slightly: it might as well share code
with read_stream_end(). I think setting distance = 1 is fine for now,
and we might later want to adjust that as we learn more about more
interesting users of _reset().

Attachments:

v11-0001-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v11-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From 6b66a6412c90c8f696a8b5890596ba1ab7477191 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v11 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 701 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  41 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 564 insertions(+), 222 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5468637e2ef..f3736000ad2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..7123cbbaa2a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,9 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/* Limit on how many blocks should be handled in single I/O operations. */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +479,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +507,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +788,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +800,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +822,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +993,146 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
 
-	*hit = false;
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
+
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
+	{
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static inline Buffer
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1151,359 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
+		bool		found;
+
+		if (bmr.smgr == NULL)
+		{
+			bmr.smgr = RelationGetSmgr(bmr.rel);
+			bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+		}
+
+		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
 	}
+
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
+		flags = 0;
+	operation.bmr = bmr;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
+
+	return buffer;
+}
+
+/*
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, if the compiler
+ * inlines the code and specializes for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffers(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
+ *
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+inline bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+
+	if (!operation->bmr.smgr)
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
+		operation->bmr.smgr = RelationGetSmgr(operation->bmr.rel);
+		operation->bmr.relpersistence = operation->bmr.rel->rd_rel->relpersistence;
 	}
 
-	/* At this point we do NOT hold any locks. */
-
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
-		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		bool		found;
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		buffers[i] = PinBufferForBlock(operation->bmr,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+	*nblocks = actual_nblocks;
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+	if (io_buffers_len > 0)
+	{
+		/* Populate information needed for I/O. */
+		operation->buffers = buffers;
+		operation->blocknum = blockNum;
+		operation->flags = flags;
+		operation->nblocks = actual_nblocks;
+		operation->io_buffers_len = io_buffers_len;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(operation->bmr.smgr,
+						 operation->forknum,
+						 blockNum,
+						 operation->io_buffers_len);
 		}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
 	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1511,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1570,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1638,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1673,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2029,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2294,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2317,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2632,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2646,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3714,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5449,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5470,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index abd9029451f..313e393262f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f79..7fa6d5a64c8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..241f68c45e1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,38 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/* The following private members should not be accessed directly. */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +286,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..97edd1388e9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2286,6 +2286,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.2

v11-0002-Provide-API-for-streaming-relation-data.patchtext/x-patch; charset=US-ASCII; name=v11-0002-Provide-API-for-streaming-relation-data.patchDownload

From 9d334fba6d7d5508d426bb0b066ef5facb088ada Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v11 2/4] Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue posix_fadvise()
advice ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem that would benefit from information about
future needs.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code and should eventually be replaced by
this), but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/aio/Makefile      |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/read_stream.c | 746 ++++++++++++++++++++++++++
 src/backend/storage/meson.build       |   1 +
 src/include/storage/read_stream.h     |  63 +++
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 832 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c
 create mode 100644 src/include/storage/read_stream.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..2f29a9ec4d1
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..10e1aa3b20b
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 00000000000..69079ed5485
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,746 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Sometimes we need to be able to 'unget' a block number to resolve a
+	 * flow control problem when I/Os are split.
+	 */
+	BlockNumber unget_blocknum;
+	bool		have_unget_blocknum;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow streaming_unget_block() to work.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (!stream->have_unget_blocknum)
+		return stream->callback(stream,
+								stream->callback_private_data,
+								per_buffer_data);
+
+	/*
+	 * You can only unget one block, and next_buffer_index can't change across
+	 * a get, unget, get sequence, so the callback's per_buffer_data, if any,
+	 * is still present in the correct slot.  We just have to return the
+	 * previous block number.
+	 */
+	stream->have_unget_blocknum = false;
+	return stream->unget_blocknum;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	Assert(!stream->have_unget_blocknum);
+	stream->have_unget_blocknum = true;
+	stream->unget_blocknum = blocknum;
+}
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * We don't start the pending read just because we've hit the distance
+	 * limit, preferring to give it another chance to grow to full
+	 * io_combine_limit size once more buffers have been consumed.  However,
+	 * if we've already reached that size or we're at the end of the stream of
+	 * blocks, we might as well start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->pending_read_nblocks == io_combine_limit ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   BufferManagerRelation bmr,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/* Make sure our bmr's smgr and persistent are populated. */
+	if (bmr.smgr == NULL)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.bmr = bmr;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream.  Each call returns successive
+ * blocks in the order specified by the callback.  If per_buffer_data_size was
+ * set to a non-zero size, *per_buffer_data receives a pointer to the extra
+ * per-buffer data that the callback had a chance to populate, which remains
+ * valid until the next call to read_stream_next_buffer().  When the stream
+ * runs out of data, InvalidBuffer is returned.  The caller may decide to end
+ * the stream early at any time by calling read_stream_end().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(per_buffer_data == NULL &&
+			   stream->ios_in_progress == 0 &&
+			   stream->pinned_buffers == 1 &&
+			   stream->distance == 1))
+	{
+		BlockNumber next_blocknum;
+
+		/*
+		 * We have a pinned buffer that we need to serve up, but we also want
+		 * to probe the next one before we return, just in case we need to
+		 * start an I/O.  We can re-use the same buffer slot, and an arbitrary
+		 * I/O slot since they're all free.
+		 */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+		Assert(stream->pending_read_nblocks <= 1);
+		if (unlikely(stream->pending_read_nblocks == 1))
+		{
+			next_blocknum = stream->pending_read_blocknum;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+			next_blocknum = read_stream_get_block(stream, NULL);
+		if (unlikely(next_blocknum == InvalidBlockNumber))
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			stream->next_buffer_index = oldest_buffer_index;
+			/* Pin transferred to caller. */
+			stream->pinned_buffers = 0;
+			return buffer;
+		}
+		/* Call the special single block version, which is marginally faster. */
+		if (unlikely(StartReadBuffer(&stream->ios[0].op,
+									 &stream->buffers[oldest_buffer_index],
+									 next_blocknum,
+									 stream->advice_enabled ?
+									 READ_BUFFERS_ISSUE_ADVICE : 0)))
+		{
+			/* I/O needed.  We'll take the general path next time. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+			/* Increase look ahead distance (move towards behavior B/C). */
+			stream->distance = Min(2, stream->max_pinned_buffers);
+		}
+		/* Pin transferred to caller, got another one, no net change. */
+		Assert(stream->pinned_buffers == 1);
+		return buffer;
+	}
+
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+	return buffer;
+}
+
+/*
+ * Reset a read stream by releasing any queued up buffers, allowing the stream
+ * to be used again for different blocks.  This can be used to clear an
+ * end-of-stream condition and start again, or to throw away blocks that were
+ * speculatively read and read some different blocks instead.
+ */
+void
+read_stream_reset(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Start off assuming data is cached. */
+	stream->distance = 1;
+}
+
+/*
+ * Release and free a read stream.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	read_stream_reset(stream);
+	pfree(stream);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 00000000000..f5dbc087b0b
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  BufferManagerRelation bmr,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_reset(ReadStream *stream);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97edd1388e9..82fa6c12970 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1214,6 +1214,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2292,6 +2293,7 @@ ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.39.2

v11-0003-Use-streaming-I-O-in-pg_prewarm.patchtext/x-patch; charset=US-ASCII; name=v11-0003-Use-streaming-I-O-in-pg_prewarm.patchDownload

From bda8898b476c5bfff34f43510227080d004ff2ea Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v11 3/4] Use streaming I/O in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use a ReadStream.  This
commit provides a simple example of such a transformation, and generates
fewer system calls.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..f6269562b54 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -19,6 +19,7 @@
 #include "fmgr.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_read_stream_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_read_stream_next_block(ReadStream *stream,
+								  void *callback_private_data,
+								  void *per_buffer_data)
+{
+	struct pg_prewarm_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_read_stream_private p;
+		ReadStream *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											NULL,
+											BMR_REL(rel),
+											forkNumber,
+											pg_prewarm_read_stream_next_block,
+											&p,
+											0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = read_stream_next_buffer(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.2

#49

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#48)

3 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

Small bug fix: the condition in the final test at the end of
read_stream_look_ahead() wasn't quite right. In general when looking
ahead, we don't need to start a read just because the pending read
would bring us up to stream->distance if submitted now (we'd prefer to
build it all the way up to size io_combine_limit if we can), but if
that condition is met AND we have nothing pinned yet, then there is no
chance for the read to grow bigger by a pinned buffer being consumed.
Fixed, comment updated.

Attachments:

v12-0001-Provide-vectored-variant-of-ReadBuffer.patchapplication/octet-stream; name=v12-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From 6b66a6412c90c8f696a8b5890596ba1ab7477191 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v12 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 701 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  41 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 564 insertions(+), 222 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5468637e2e..f3736000ad 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..7123cbbaa2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,9 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/* Limit on how many blocks should be handled in single I/O operations. */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +479,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +507,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +788,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +800,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +822,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +993,146 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
 
-	*hit = false;
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
+
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
+	{
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static inline Buffer
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1151,359 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
+		bool		found;
+
+		if (bmr.smgr == NULL)
+		{
+			bmr.smgr = RelationGetSmgr(bmr.rel);
+			bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+		}
+
+		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
 	}
+
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
+		flags = 0;
+	operation.bmr = bmr;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
+
+	return buffer;
+}
+
+/*
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, if the compiler
+ * inlines the code and specializes for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffers(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
+ *
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+inline bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+
+	if (!operation->bmr.smgr)
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
+		operation->bmr.smgr = RelationGetSmgr(operation->bmr.rel);
+		operation->bmr.relpersistence = operation->bmr.rel->rd_rel->relpersistence;
 	}
 
-	/* At this point we do NOT hold any locks. */
-
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
-		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		bool		found;
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		buffers[i] = PinBufferForBlock(operation->bmr,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+	*nblocks = actual_nblocks;
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+	if (io_buffers_len > 0)
+	{
+		/* Populate information needed for I/O. */
+		operation->buffers = buffers;
+		operation->blocknum = blockNum;
+		operation->flags = flags;
+		operation->nblocks = actual_nblocks;
+		operation->io_buffers_len = io_buffers_len;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(operation->bmr.smgr,
+						 operation->forknum,
+						 blockNum,
+						 operation->io_buffers_len);
 		}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
 	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1511,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1570,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1638,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1673,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2029,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2294,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2317,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2632,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2646,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3714,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5449,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5470,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a5..985a2c7049 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index abd9029451..313e393262 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f7..7fa6d5a64c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..241f68c45e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,38 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/* The following private members should not be accessed directly. */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +286,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaea..97edd1388e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2286,6 +2286,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.3 (Apple Git-146)

v12-0002-Provide-API-for-streaming-relation-data.patchapplication/octet-stream; name=v12-0002-Provide-API-for-streaming-relation-data.patchDownload

From 9d334fba6d7d5508d426bb0b066ef5facb088ada Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v12 2/4] Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue posix_fadvise()
advice ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem that would benefit from information about
future needs.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code and should eventually be replaced by
this), but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/aio/Makefile      |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/read_stream.c | 746 ++++++++++++++++++++++++++
 src/backend/storage/meson.build       |   1 +
 src/include/storage/read_stream.h     |  63 +++
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 832 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c
 create mode 100644 src/include/storage/read_stream.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..2f29a9ec4d
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..10e1aa3b20
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 0000000000..69079ed548
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,746 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Sometimes we need to be able to 'unget' a block number to resolve a
+	 * flow control problem when I/Os are split.
+	 */
+	BlockNumber unget_blocknum;
+	bool		have_unget_blocknum;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow streaming_unget_block() to work.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (!stream->have_unget_blocknum)
+		return stream->callback(stream,
+								stream->callback_private_data,
+								per_buffer_data);
+
+	/*
+	 * You can only unget one block, and next_buffer_index can't change across
+	 * a get, unget, get sequence, so the callback's per_buffer_data, if any,
+	 * is still present in the correct slot.  We just have to return the
+	 * previous block number.
+	 */
+	stream->have_unget_blocknum = false;
+	return stream->unget_blocknum;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	Assert(!stream->have_unget_blocknum);
+	stream->have_unget_blocknum = true;
+	stream->unget_blocknum = blocknum;
+}
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * We don't start the pending read just because we've hit the distance
+	 * limit, preferring to give it another chance to grow to full
+	 * io_combine_limit size once more buffers have been consumed.  However,
+	 * if we've already reached that size or we're at the end of the stream of
+	 * blocks, we might as well start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->pending_read_nblocks == io_combine_limit ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   BufferManagerRelation bmr,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/* Make sure our bmr's smgr and persistent are populated. */
+	if (bmr.smgr == NULL)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.bmr = bmr;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream.  Each call returns successive
+ * blocks in the order specified by the callback.  If per_buffer_data_size was
+ * set to a non-zero size, *per_buffer_data receives a pointer to the extra
+ * per-buffer data that the callback had a chance to populate, which remains
+ * valid until the next call to read_stream_next_buffer().  When the stream
+ * runs out of data, InvalidBuffer is returned.  The caller may decide to end
+ * the stream early at any time by calling read_stream_end().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(per_buffer_data == NULL &&
+			   stream->ios_in_progress == 0 &&
+			   stream->pinned_buffers == 1 &&
+			   stream->distance == 1))
+	{
+		BlockNumber next_blocknum;
+
+		/*
+		 * We have a pinned buffer that we need to serve up, but we also want
+		 * to probe the next one before we return, just in case we need to
+		 * start an I/O.  We can re-use the same buffer slot, and an arbitrary
+		 * I/O slot since they're all free.
+		 */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+		Assert(stream->pending_read_nblocks <= 1);
+		if (unlikely(stream->pending_read_nblocks == 1))
+		{
+			next_blocknum = stream->pending_read_blocknum;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+			next_blocknum = read_stream_get_block(stream, NULL);
+		if (unlikely(next_blocknum == InvalidBlockNumber))
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			stream->next_buffer_index = oldest_buffer_index;
+			/* Pin transferred to caller. */
+			stream->pinned_buffers = 0;
+			return buffer;
+		}
+		/* Call the special single block version, which is marginally faster. */
+		if (unlikely(StartReadBuffer(&stream->ios[0].op,
+									 &stream->buffers[oldest_buffer_index],
+									 next_blocknum,
+									 stream->advice_enabled ?
+									 READ_BUFFERS_ISSUE_ADVICE : 0)))
+		{
+			/* I/O needed.  We'll take the general path next time. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+			/* Increase look ahead distance (move towards behavior B/C). */
+			stream->distance = Min(2, stream->max_pinned_buffers);
+		}
+		/* Pin transferred to caller, got another one, no net change. */
+		Assert(stream->pinned_buffers == 1);
+		return buffer;
+	}
+
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+	return buffer;
+}
+
+/*
+ * Reset a read stream by releasing any queued up buffers, allowing the stream
+ * to be used again for different blocks.  This can be used to clear an
+ * end-of-stream condition and start again, or to throw away blocks that were
+ * speculatively read and read some different blocks instead.
+ */
+void
+read_stream_reset(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Start off assuming data is cached. */
+	stream->distance = 1;
+}
+
+/*
+ * Release and free a read stream.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	read_stream_reset(stream);
+	pfree(stream);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 0000000000..f5dbc087b0
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  BufferManagerRelation bmr,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_reset(ReadStream *stream);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97edd1388e..82fa6c1297 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1214,6 +1214,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2292,6 +2293,7 @@ ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.39.3 (Apple Git-146)

v12-0003-Use-streaming-I-O-in-pg_prewarm.patchapplication/octet-stream; name=v12-0003-Use-streaming-I-O-in-pg_prewarm.patchDownload

From bda8898b476c5bfff34f43510227080d004ff2ea Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v12 3/4] Use streaming I/O in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use a ReadStream.  This
commit provides a simple example of such a transformation, and generates
fewer system calls.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..f6269562b5 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -19,6 +19,7 @@
 #include "fmgr.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_read_stream_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_read_stream_next_block(ReadStream *stream,
+								  void *callback_private_data,
+								  void *per_buffer_data)
+{
+	struct pg_prewarm_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_read_stream_private p;
+		ReadStream *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											NULL,
+											BMR_REL(rel),
+											forkNumber,
+											pg_prewarm_read_stream_next_block,
+											&p,
+											0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = read_stream_next_buffer(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.3 (Apple Git-146)

#50

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#49)

3 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Fri, Mar 29, 2024 at 12:06 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Small bug fix: the condition in the final test at the end of
read_stream_look_ahead() wasn't quite right. In general when looking
ahead, we don't need to start a read just because the pending read
would bring us up to stream->distance if submitted now (we'd prefer to
build it all the way up to size io_combine_limit if we can), but if
that condition is met AND we have nothing pinned yet, then there is no
chance for the read to grow bigger by a pinned buffer being consumed.
Fixed, comment updated.

Oops, I sent the wrong/unfixed version. This version has the fix
described above.

Attachments:

v13-0001-Provide-vectored-variant-of-ReadBuffer.patchapplication/octet-stream; name=v13-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From a9f7fcbe42854fa57c65a85cc5edac3dac103db7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v13 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 701 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  41 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 564 insertions(+), 222 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5468637e2e..f3736000ad 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..7123cbbaa2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,9 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/* Limit on how many blocks should be handled in single I/O operations. */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +479,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +507,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +788,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +800,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +822,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +993,146 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
 
-	*hit = false;
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
+
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
+	{
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static inline Buffer
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1151,359 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
+		bool		found;
+
+		if (bmr.smgr == NULL)
+		{
+			bmr.smgr = RelationGetSmgr(bmr.rel);
+			bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+		}
+
+		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
 	}
+
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
+		flags = 0;
+	operation.bmr = bmr;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
+
+	return buffer;
+}
+
+/*
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, if the compiler
+ * inlines the code and specializes for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffers(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
+ *
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+inline bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+
+	if (!operation->bmr.smgr)
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
+		operation->bmr.smgr = RelationGetSmgr(operation->bmr.rel);
+		operation->bmr.relpersistence = operation->bmr.rel->rd_rel->relpersistence;
 	}
 
-	/* At this point we do NOT hold any locks. */
-
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
-		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		bool		found;
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		buffers[i] = PinBufferForBlock(operation->bmr,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+	*nblocks = actual_nblocks;
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+	if (io_buffers_len > 0)
+	{
+		/* Populate information needed for I/O. */
+		operation->buffers = buffers;
+		operation->blocknum = blockNum;
+		operation->flags = flags;
+		operation->nblocks = actual_nblocks;
+		operation->io_buffers_len = io_buffers_len;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(operation->bmr.smgr,
+						 operation->forknum,
+						 blockNum,
+						 operation->io_buffers_len);
 		}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
 	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1511,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1570,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1638,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1673,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2029,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2294,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2317,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2632,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2646,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3714,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5449,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5470,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a5..985a2c7049 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index abd9029451..313e393262 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f7..7fa6d5a64c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..241f68c45e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,38 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/* The following private members should not be accessed directly. */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +286,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaea..97edd1388e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2286,6 +2286,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.3 (Apple Git-146)

v13-0002-Provide-API-for-streaming-relation-data.patchapplication/octet-stream; name=v13-0002-Provide-API-for-streaming-relation-data.patchDownload

From 948caab48ac94703fd07ed3ea497b1d63f7e9476 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v13 2/4] Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue posix_fadvise()
advice ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem that would benefit from information about
future needs.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code and should eventually be replaced by
this), but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/aio/Makefile      |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/read_stream.c | 750 ++++++++++++++++++++++++++
 src/backend/storage/meson.build       |   1 +
 src/include/storage/read_stream.h     |  63 +++
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 836 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c
 create mode 100644 src/include/storage/read_stream.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca2..eec03f6f2b 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 0000000000..2f29a9ec4d
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 0000000000..10e1aa3b20
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 0000000000..6b4b97372c
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,750 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Sometimes we need to be able to 'unget' a block number to resolve a
+	 * flow control problem when I/Os are split.
+	 */
+	BlockNumber unget_blocknum;
+	bool		have_unget_blocknum;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (!stream->have_unget_blocknum)
+		return stream->callback(stream,
+								stream->callback_private_data,
+								per_buffer_data);
+
+	/*
+	 * You can only unget one block, and next_buffer_index can't change across
+	 * a get, unget, get sequence, so the callback's per_buffer_data, if any,
+	 * is still present in the correct slot.  We just have to return the
+	 * previous block number.
+	 */
+	stream->have_unget_blocknum = false;
+	return stream->unget_blocknum;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	Assert(!stream->have_unget_blocknum);
+	stream->have_unget_blocknum = true;
+	stream->unget_blocknum = blocknum;
+}
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * We don't start the pending read just because we've hit the distance
+	 * limit, preferring to give it another chance to grow to full
+	 * io_combine_limit size once more buffers have been consumed.  However,
+	 * if we've already reached io_combine_limit, or we've reached the
+	 * distance limit and there isn't anything pinned yet, or the callback has
+	 * signaled end-of-stream, we start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->pending_read_nblocks == io_combine_limit ||
+		 (stream->pending_read_nblocks == stream->distance &&
+		  stream->pinned_buffers == 0) ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new read stream object that can be used to perform the equivalent
+ * of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   BufferManagerRelation bmr,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/* Make sure our bmr's smgr and persistent are populated. */
+	if (bmr.smgr == NULL)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.bmr = bmr;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream.  Each call returns successive
+ * blocks in the order specified by the callback.  If per_buffer_data_size was
+ * set to a non-zero size, *per_buffer_data receives a pointer to the extra
+ * per-buffer data that the callback had a chance to populate, which remains
+ * valid until the next call to read_stream_next_buffer().  When the stream
+ * runs out of data, InvalidBuffer is returned.  The caller may decide to end
+ * the stream early at any time by calling read_stream_end().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(per_buffer_data == NULL &&
+			   stream->ios_in_progress == 0 &&
+			   stream->pinned_buffers == 1 &&
+			   stream->distance == 1))
+	{
+		BlockNumber next_blocknum;
+
+		/*
+		 * We have a pinned buffer that we need to serve up, but we also want
+		 * to probe the next one before we return, just in case we need to
+		 * start an I/O.  We can re-use the same buffer slot, and an arbitrary
+		 * I/O slot since they're all free.
+		 */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+		Assert(stream->pending_read_nblocks <= 1);
+		if (unlikely(stream->pending_read_nblocks == 1))
+		{
+			next_blocknum = stream->pending_read_blocknum;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+			next_blocknum = read_stream_get_block(stream, NULL);
+		if (unlikely(next_blocknum == InvalidBlockNumber))
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			stream->next_buffer_index = oldest_buffer_index;
+			/* Pin transferred to caller. */
+			stream->pinned_buffers = 0;
+			return buffer;
+		}
+		/* Call the special single block version, which is marginally faster. */
+		if (unlikely(StartReadBuffer(&stream->ios[0].op,
+									 &stream->buffers[oldest_buffer_index],
+									 next_blocknum,
+									 stream->advice_enabled ?
+									 READ_BUFFERS_ISSUE_ADVICE : 0)))
+		{
+			/* I/O needed.  We'll take the general path next time. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+			/* Increase look ahead distance (move towards behavior B/C). */
+			stream->distance = Min(2, stream->max_pinned_buffers);
+		}
+		/* Pin transferred to caller, got another one, no net change. */
+		Assert(stream->pinned_buffers == 1);
+		return buffer;
+	}
+
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+	return buffer;
+}
+
+/*
+ * Reset a read stream by releasing any queued up buffers, allowing the stream
+ * to be used again for different blocks.  This can be used to clear an
+ * end-of-stream condition and start again, or to throw away blocks that were
+ * speculatively read and read some different blocks instead.
+ */
+void
+read_stream_reset(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Start off assuming data is cached. */
+	stream->distance = 1;
+}
+
+/*
+ * Release and free a read stream.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	read_stream_reset(stream);
+	pfree(stream);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca2..739d13293f 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 0000000000..f5dbc087b0
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  BufferManagerRelation bmr,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_reset(ReadStream *stream);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97edd1388e..82fa6c1297 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1214,6 +1214,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2292,6 +2293,7 @@ ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.39.3 (Apple Git-146)

v13-0003-Use-streaming-I-O-in-pg_prewarm.patchapplication/octet-stream; name=v13-0003-Use-streaming-I-O-in-pg_prewarm.patchDownload

From 6f38d63ecf0d4e03d21ea1fbbba60bc96ae51bc1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v13 3/4] Use streaming I/O in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use a ReadStream.  This
commit provides a simple example of such a transformation, and generates
fewer system calls.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e4..f6269562b5 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -19,6 +19,7 @@
 #include "fmgr.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_read_stream_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_read_stream_next_block(ReadStream *stream,
+								  void *callback_private_data,
+								  void *per_buffer_data)
+{
+	struct pg_prewarm_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_read_stream_private p;
+		ReadStream *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											NULL,
+											BMR_REL(rel),
+											forkNumber,
+											pg_prewarm_read_stream_next_block,
+											&p,
+											0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = read_stream_next_buffer(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.3 (Apple Git-146)

#51

Heikki Linnakangas

hlinnaka@iki.fi

almost 2 years ago

In reply to: Thomas Munro (#50)

5 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

I spent most of the past few days trying to regain some lost
performance. Thanks to Andres for some key observations and help!
That began with reports from Bilal and Melanie (possibly related to
things Tomas had seen too, not sure) of regressions in all-cached
workloads, which I already improved a bit with the ABC algorithm that
minimised pinning for this case. That is, if there's no recent I/O so
we reach what I call behaviour A, it should try to do as little magic
as possible. But it turns out that wasn't enough! It is very hard to
beat a tight loop that just does ReadBuffer(), ReleaseBuffer() over
millions of already-cached blocks, if you have to do exactly the same
work AND extra instructions for management.

I got a little nerd-sniped by that, and did some micro-benchmarking of
my own. I tested essentially this, with small values of 'nblocks' so
that all pages are in cache:

for (int i = 0; i < niters; i++)
{
for (BlockNumber blkno = 0; blkno < nblocks; blkno++)
{
buf = ReadBuffer(rel, blkno);
ReleaseBuffer(buf);
}
}

The results look like this (lower is better, test program and script
attached):

master (213c959a29): 8.0 s
streaming-api v13: 9.5 s

This test exercises just the ReadBuffer() codepath, to check if there is
a regression there. It does not exercise the new streaming APIs.

So looks like the streaming API patches add some overhead to the simple
non-streaming ReadBuffer() case. This is a highly narrow
micro-benchmark, of course, so even though this is a very
performance-sensitive codepath, we could perhaps accept a small
regression there. In any real workload, you'd at least need to take the
buffer lock and read something from the page.

But can we do better? Aside from performance, I was never quite happy
with the BMR_REL/BMR_SMGR stuff we introduced in PG v16. I like having
one common struct like BufferManagerRelation that is used in all the
functions, instead of having separate Relation and SMgrRelation variants
of every function. But those macros feel a bit hacky and we are not
consistently using them in all the functions. Why is there no
ReadBuffer() variant that takes a BufferManagerRelation?

The attached patch expands the use of BufferManagerRelations. The
principle now is that before calling any bufmgr function, you first
initialize a BufferManagerRelation struct, and pass that to the
function. The initialization is done by the InitBMRForRel() or
InitBMRForSmgr() function, which replace the BMR_REL/BMR_SMGR macros.
They are full-blown functions now because they do more setup upfront
than BMR_REL/BMR_SMGR. For example, InitBMRForRel() always initializes
the 'smgr' field, so that you don't need to repeat this pattern in all
the other functions:

- /* Make sure our bmr's smgr and persistent are populated. */
- if (bmr.smgr == NULL)
- {
- bmr.smgr = RelationGetSmgr(bmr.rel);
- bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
- }

Initializing the BufferManagerRelation is still pretty cheap, so it's
feasible to call it separately for every ReadBuffer() call. But you can
also reuse it across calls, if you read multiple pages in a loop, for
example. That saves a few cycles.

The microbenchmark results with these changes:

master (213c959a29): 8.0 s
streaming-api v13: 9.5 s
bmr-refactor 8.4 s
bmr-refactor, InitBMR once 7.7 s

The difference between the "bmr-refactor" and "initBMR once" is that in
the "initBMR once" test, I modified the benchmark to call
InitBMRForRel() just once, outside the loop. So that shows the benefit
of reusing the BufferManagerRelation. This refactoring seems to make
performance regression smaller, even if you don't take advantage of
reusing the BufferManagerRelation.

This also moves things around a little in ReadBuffer_common() (now
called ReadBufferBMR). Instead of calling StartReadBuffer(), it calls
PinBufferForBlock() directly. I tried doing that before the other
refactorings, but that alone didn't seem to make much difference. Not
sure if it's needed, it's perhaps an orthogonal refactoring, but it's
included here nevertheless.

What do you think? The first three attached patches are your v13 patches
unchanged. The fourth is the micro-benchmark I used. The last patch is
the interesting one.

PS. To be clear, I'm happy with your v13 streaming patch set as it is. I
don't think this BufferManagerRelation refactoring is a show-stopper.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v13-bmr-0001-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=UTF-8; name=v13-bmr-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From 99f1cd4c2fc21f98457340613160125c567990fd Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v13-bmr 1/5] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 701 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  41 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 564 insertions(+), 222 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5468637e2ef..f3736000ad2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..7123cbbaa2a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,9 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/* Limit on how many blocks should be handled in single I/O operations. */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +479,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +507,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +788,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +800,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +822,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +993,146 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
 
-	*hit = false;
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
+
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
+	{
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static inline Buffer
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1151,359 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
+		bool		found;
+
+		if (bmr.smgr == NULL)
+		{
+			bmr.smgr = RelationGetSmgr(bmr.rel);
+			bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+		}
+
+		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
 	}
+
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
+		flags = 0;
+	operation.bmr = bmr;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
+
+	return buffer;
+}
+
+/*
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, if the compiler
+ * inlines the code and specializes for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffers(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
+ *
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+inline bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+
+	if (!operation->bmr.smgr)
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
+		operation->bmr.smgr = RelationGetSmgr(operation->bmr.rel);
+		operation->bmr.relpersistence = operation->bmr.rel->rd_rel->relpersistence;
 	}
 
-	/* At this point we do NOT hold any locks. */
-
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
-		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		bool		found;
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		buffers[i] = PinBufferForBlock(operation->bmr,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+	*nblocks = actual_nblocks;
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+	if (io_buffers_len > 0)
+	{
+		/* Populate information needed for I/O. */
+		operation->buffers = buffers;
+		operation->blocknum = blockNum;
+		operation->flags = flags;
+		operation->nblocks = actual_nblocks;
+		operation->io_buffers_len = io_buffers_len;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(operation->bmr.smgr,
+						 operation->forknum,
+						 blockNum,
+						 operation->io_buffers_len);
 		}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
 	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1511,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1570,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1638,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1673,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2029,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2294,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2317,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2632,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2646,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3714,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5449,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5470,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index abd9029451f..313e393262f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f79..7fa6d5a64c8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..241f68c45e1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,38 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/* The following private members should not be accessed directly. */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +286,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..97edd1388e9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2286,6 +2286,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.39.2

v13-bmr-0002-Provide-API-for-streaming-relation-data.patchtext/x-patch; charset=UTF-8; name=v13-bmr-0002-Provide-API-for-streaming-relation-data.patchDownload

From e4bb5771b1b343641da1594f3c49f9ee750adcc4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v13-bmr 2/5] Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue posix_fadvise()
advice ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem that would benefit from information about
future needs.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code and should eventually be replaced by
this), but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/aio/Makefile      |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/read_stream.c | 750 ++++++++++++++++++++++++++
 src/backend/storage/meson.build       |   1 +
 src/include/storage/read_stream.h     |  63 +++
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 836 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c
 create mode 100644 src/include/storage/read_stream.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..2f29a9ec4d1
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..10e1aa3b20b
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 00000000000..6b4b97372c7
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,750 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Sometimes we need to be able to 'unget' a block number to resolve a
+	 * flow control problem when I/Os are split.
+	 */
+	BlockNumber unget_blocknum;
+	bool		have_unget_blocknum;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (!stream->have_unget_blocknum)
+		return stream->callback(stream,
+								stream->callback_private_data,
+								per_buffer_data);
+
+	/*
+	 * You can only unget one block, and next_buffer_index can't change across
+	 * a get, unget, get sequence, so the callback's per_buffer_data, if any,
+	 * is still present in the correct slot.  We just have to return the
+	 * previous block number.
+	 */
+	stream->have_unget_blocknum = false;
+	return stream->unget_blocknum;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	Assert(!stream->have_unget_blocknum);
+	stream->have_unget_blocknum = true;
+	stream->unget_blocknum = blocknum;
+}
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * We don't start the pending read just because we've hit the distance
+	 * limit, preferring to give it another chance to grow to full
+	 * io_combine_limit size once more buffers have been consumed.  However,
+	 * if we've already reached io_combine_limit, or we've reached the
+	 * distance limit and there isn't anything pinned yet, or the callback has
+	 * signaled end-of-stream, we start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->pending_read_nblocks == io_combine_limit ||
+		 (stream->pending_read_nblocks == stream->distance &&
+		  stream->pinned_buffers == 0) ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new read stream object that can be used to perform the equivalent
+ * of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   BufferManagerRelation bmr,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/* Make sure our bmr's smgr and persistent are populated. */
+	if (bmr.smgr == NULL)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.bmr = bmr;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream.  Each call returns successive
+ * blocks in the order specified by the callback.  If per_buffer_data_size was
+ * set to a non-zero size, *per_buffer_data receives a pointer to the extra
+ * per-buffer data that the callback had a chance to populate, which remains
+ * valid until the next call to read_stream_next_buffer().  When the stream
+ * runs out of data, InvalidBuffer is returned.  The caller may decide to end
+ * the stream early at any time by calling read_stream_end().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(per_buffer_data == NULL &&
+			   stream->ios_in_progress == 0 &&
+			   stream->pinned_buffers == 1 &&
+			   stream->distance == 1))
+	{
+		BlockNumber next_blocknum;
+
+		/*
+		 * We have a pinned buffer that we need to serve up, but we also want
+		 * to probe the next one before we return, just in case we need to
+		 * start an I/O.  We can re-use the same buffer slot, and an arbitrary
+		 * I/O slot since they're all free.
+		 */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+		Assert(stream->pending_read_nblocks <= 1);
+		if (unlikely(stream->pending_read_nblocks == 1))
+		{
+			next_blocknum = stream->pending_read_blocknum;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+			next_blocknum = read_stream_get_block(stream, NULL);
+		if (unlikely(next_blocknum == InvalidBlockNumber))
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			stream->next_buffer_index = oldest_buffer_index;
+			/* Pin transferred to caller. */
+			stream->pinned_buffers = 0;
+			return buffer;
+		}
+		/* Call the special single block version, which is marginally faster. */
+		if (unlikely(StartReadBuffer(&stream->ios[0].op,
+									 &stream->buffers[oldest_buffer_index],
+									 next_blocknum,
+									 stream->advice_enabled ?
+									 READ_BUFFERS_ISSUE_ADVICE : 0)))
+		{
+			/* I/O needed.  We'll take the general path next time. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+			/* Increase look ahead distance (move towards behavior B/C). */
+			stream->distance = Min(2, stream->max_pinned_buffers);
+		}
+		/* Pin transferred to caller, got another one, no net change. */
+		Assert(stream->pinned_buffers == 1);
+		return buffer;
+	}
+
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+	return buffer;
+}
+
+/*
+ * Reset a read stream by releasing any queued up buffers, allowing the stream
+ * to be used again for different blocks.  This can be used to clear an
+ * end-of-stream condition and start again, or to throw away blocks that were
+ * speculatively read and read some different blocks instead.
+ */
+void
+read_stream_reset(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Start off assuming data is cached. */
+	stream->distance = 1;
+}
+
+/*
+ * Release and free a read stream.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	read_stream_reset(stream);
+	pfree(stream);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 00000000000..f5dbc087b0b
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  BufferManagerRelation bmr,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_reset(ReadStream *stream);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97edd1388e9..82fa6c12970 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1214,6 +1214,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2292,6 +2293,7 @@ ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.39.2

v13-bmr-0003-Use-streaming-I-O-in-pg_prewarm.patchtext/x-patch; charset=UTF-8; name=v13-bmr-0003-Use-streaming-I-O-in-pg_prewarm.patchDownload

From aeeb4e67c69570f42deb8db2f3538d7135ba29c6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v13-bmr 3/5] Use streaming I/O in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use a ReadStream.  This
commit provides a simple example of such a transformation, and generates
fewer system calls.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..f6269562b54 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -19,6 +19,7 @@
 #include "fmgr.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_read_stream_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_read_stream_next_block(ReadStream *stream,
+								  void *callback_private_data,
+								  void *per_buffer_data)
+{
+	struct pg_prewarm_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_read_stream_private p;
+		ReadStream *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											NULL,
+											BMR_REL(rel),
+											forkNumber,
+											pg_prewarm_read_stream_next_block,
+											&p,
+											0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = read_stream_next_buffer(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.39.2

v13-bmr-0004-Add-readbufferbench.patchtext/x-patch; charset=UTF-8; name=v13-bmr-0004-Add-readbufferbench.patchDownload

From 9e27451318e4a4a2e9061e79b5139560610bd78a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 28 Mar 2024 20:21:25 +0200
Subject: [PATCH v13-bmr 4/5] Add readbufferbench

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 src/test/modules/readbufferbench/Makefile     | 21 ++++
 src/test/modules/readbufferbench/bench.sql    | 20 ++++
 src/test/modules/readbufferbench/meson.build  | 28 ++++++
 .../readbufferbench/readbufferbench--1.0.sql  |  9 ++
 .../modules/readbufferbench/readbufferbench.c | 98 +++++++++++++++++++
 .../readbufferbench/readbufferbench.control   |  4 +
 8 files changed, 182 insertions(+)
 create mode 100644 src/test/modules/readbufferbench/Makefile
 create mode 100644 src/test/modules/readbufferbench/bench.sql
 create mode 100644 src/test/modules/readbufferbench/meson.build
 create mode 100644 src/test/modules/readbufferbench/readbufferbench--1.0.sql
 create mode 100644 src/test/modules/readbufferbench/readbufferbench.c
 create mode 100644 src/test/modules/readbufferbench/readbufferbench.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 1cbd532156d..6c96a83f60e 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  readbufferbench \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 7c11fb97f21..8ad342bcc73 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -10,6 +10,7 @@ subdir('injection_points')
 subdir('ldap_password_func')
 subdir('libpq_pipeline')
 subdir('plsample')
+subdir('readbufferbench')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
diff --git a/src/test/modules/readbufferbench/Makefile b/src/test/modules/readbufferbench/Makefile
new file mode 100644
index 00000000000..8b421b1cb1d
--- /dev/null
+++ b/src/test/modules/readbufferbench/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/readbufferbench/Makefile
+
+MODULE_big = readbufferbench
+OBJS = \
+	$(WIN32RES) \
+	readbufferbench.o
+PGFILEDESC = "readbufferbench - micro-benchmarking code ReadBuffer and friends"
+
+EXTENSION = readbufferbench
+DATA = readbufferbench--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/readbufferbench
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/readbufferbench/bench.sql b/src/test/modules/readbufferbench/bench.sql
new file mode 100644
index 00000000000..76b6be5f3e1
--- /dev/null
+++ b/src/test/modules/readbufferbench/bench.sql
@@ -0,0 +1,20 @@
+drop table foo;
+create table foo (i integer) with (fillfactor=10);
+insert into foo select g from generate_series(1, 100000) g;
+
+do $$
+declare
+  begints timestamptz;
+  endts timestamptz;
+  results float[];
+  i int;
+begin
+  for i in 1..5 loop
+    begints = clock_timestamp();
+    perform readbuffer_bench('foo', 10, 10000000);
+    endts = clock_timestamp();
+    results[i] = extract(epoch from endts) - extract(epoch from begints);
+    raise notice 'run %: %', i, results[i];
+  end loop;
+end;
+$$;
diff --git a/src/test/modules/readbufferbench/meson.build b/src/test/modules/readbufferbench/meson.build
new file mode 100644
index 00000000000..1b7e7863c26
--- /dev/null
+++ b/src/test/modules/readbufferbench/meson.build
@@ -0,0 +1,28 @@
+# Copyright (c) 2022-2024, PostgreSQL Global Development Group
+
+readbufferbench_sources = files(
+  'readbufferbench.c',
+)
+
+if host_system == 'windows'
+  readbufferbench_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'readbufferbench',
+    '--FILEDESC', 'readbufferbench - micro-benchmarking code ReadBuffer and friends',])
+endif
+
+readbufferbench = shared_module('readbufferbench',
+  readbufferbench_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += readbufferbench
+
+test_install_data += files(
+  'readbufferbench.control',
+  'readbufferbench--1.0.sql',
+)
+
+tests += {
+  'name': 'readbufferbench',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+}
diff --git a/src/test/modules/readbufferbench/readbufferbench--1.0.sql b/src/test/modules/readbufferbench/readbufferbench--1.0.sql
new file mode 100644
index 00000000000..0ab5a4015a3
--- /dev/null
+++ b/src/test/modules/readbufferbench/readbufferbench--1.0.sql
@@ -0,0 +1,9 @@
+/* src/test/modules/readbufferbench/readbufferbench--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION readbufferbench" to load this file. \quit
+
+
+CREATE FUNCTION readbuffer_bench(regclass, int4, int4)
+RETURNS void
+AS 'MODULE_PATHNAME' LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/test/modules/readbufferbench/readbufferbench.c b/src/test/modules/readbufferbench/readbufferbench.c
new file mode 100644
index 00000000000..8d4c8281c29
--- /dev/null
+++ b/src/test/modules/readbufferbench/readbufferbench.c
@@ -0,0 +1,98 @@
+/*--------------------------------------------------------------------------
+ *
+ * readbufferbench.c
+ *		Test false positive rate of Bloom filter.
+ *
+ * Copyright (c) 2018-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/readbufferbench/readbufferbench.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/rel.h"
+
+PG_MODULE_MAGIC;
+
+
+PG_FUNCTION_INFO_V1(readbuffer_bench);
+
+Datum
+readbuffer_bench(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	int32		nblocks = PG_GETARG_INT32(1);
+	uint32		niters = PG_GETARG_UINT32(2);
+	Relation	rel;
+	Buffer		buf;
+
+	if (PG_NARGS() != 3)
+		elog(ERROR, "wrong number of arguments");
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("must be superuser")));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	if (!RELKIND_HAS_STORAGE(rel->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("cannot get page from relation \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail_relkind_not_supported(rel->rd_rel->relkind)));
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	if (nblocks < 0)
+		nblocks = RelationGetNumberOfBlocksInFork(rel, MAIN_FORKNUM);
+	else if (nblocks > RelationGetNumberOfBlocksInFork(rel, MAIN_FORKNUM))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("relation has only %d blocks",
+						nblocks)));
+
+#if 1
+	{
+		BufferManagerRelation bmr;
+
+		InitBMRForRel(&bmr, rel, MAIN_FORKNUM, NULL);
+		for (int i = 0; i < niters; i++)
+		{
+			for (BlockNumber blkno = 0; blkno < nblocks; blkno++)
+			{
+				buf = ReadBufferBMR(&bmr, blkno, RBM_NORMAL);
+				ReleaseBuffer(buf);
+			}
+		}
+	}
+#else
+	for (int i = 0; i < niters; i++)
+	{
+		for (BlockNumber blkno = 0; blkno < nblocks; blkno++)
+		{
+			buf = ReadBuffer(rel, blkno);
+			ReleaseBuffer(buf);
+		}
+	}
+#endif
+
+	relation_close(rel, AccessShareLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/readbufferbench/readbufferbench.control b/src/test/modules/readbufferbench/readbufferbench.control
new file mode 100644
index 00000000000..c7488504c86
--- /dev/null
+++ b/src/test/modules/readbufferbench/readbufferbench.control
@@ -0,0 +1,4 @@
+comment = 'Micro-benchmark of ReadBuffer and friends'
+default_version = '1.0'
+module_pathname = '$libdir/readbufferbench'
+relocatable = true
-- 
2.39.2

v13-bmr-0005-Refactor-how-BufferManagerRelations-are-used.patchtext/x-patch; charset=UTF-8; name=v13-bmr-0005-Refactor-how-BufferManagerRelations-are-used.patchDownload

From 3da0557b06ef03b190ba1f22522b4cf56fb2a52d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 28 Mar 2024 21:48:02 +0200
Subject: [PATCH v13-bmr 5/5] Refactor how BufferManagerRelations are used

---
 contrib/bloom/blutils.c                   |   8 +-
 contrib/pg_prewarm/pg_prewarm.c           |   7 +-
 src/backend/access/brin/brin.c            |  10 +-
 src/backend/access/brin/brin_revmap.c     |   8 +-
 src/backend/access/gin/gininsert.c        |   9 +-
 src/backend/access/gin/ginutil.c          |   5 +-
 src/backend/access/gist/gist.c            |   5 +-
 src/backend/access/gist/gistutil.c        |   5 +-
 src/backend/access/hash/hashpage.c        |   9 +-
 src/backend/access/heap/hio.c             |   5 +-
 src/backend/access/heap/visibilitymap.c   |   4 +-
 src/backend/access/nbtree/nbtpage.c       |   7 +-
 src/backend/access/spgist/spgutils.c      |   8 +-
 src/backend/access/transam/xlogutils.c    |  10 +-
 src/backend/commands/sequence.c           |   6 +-
 src/backend/storage/aio/read_stream.c     |  32 +-
 src/backend/storage/buffer/bufmgr.c       | 374 +++++++++++-----------
 src/backend/storage/buffer/localbuf.c     |  11 +-
 src/backend/storage/freespace/freespace.c |   6 +-
 src/include/storage/buf_internals.h       |   3 +-
 src/include/storage/bufmgr.h              |  67 ++--
 src/include/storage/read_stream.h         |   4 +-
 22 files changed, 304 insertions(+), 299 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6836129c90d..177c8698278 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -354,8 +354,11 @@ BloomPageAddItem(BloomState *state, Page page, BloomTuple *tuple)
 Buffer
 BloomNewBuffer(Relation index)
 {
+	BufferManagerRelation bmr;
 	Buffer		buffer;
 
+	InitBMRForRel(&bmr, index, MAIN_FORKNUM, NULL);
+
 	/* First, try to get a page from FSM */
 	for (;;)
 	{
@@ -364,7 +367,7 @@ BloomNewBuffer(Relation index)
 		if (blkno == InvalidBlockNumber)
 			break;
 
-		buffer = ReadBuffer(index, blkno);
+		buffer = ReadBufferBMR(&bmr, blkno, RBM_NORMAL);
 
 		/*
 		 * We have to guard against the possibility that someone else already
@@ -388,8 +391,7 @@ BloomNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
-							   EB_LOCK_FIRST);
+	buffer = ExtendBufferedRel(&bmr, EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index f6269562b54..72ec9378581 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -205,6 +205,9 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	{
 		struct pg_prewarm_read_stream_private p;
 		ReadStream *stream;
+		BufferManagerRelation bmr;
+
+		InitBMRForRel(&bmr, rel, forkNumber, NULL);
 
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
@@ -215,9 +218,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		p.last_block = last_block;
 
 		stream = read_stream_begin_relation(READ_STREAM_FULL,
-											NULL,
-											BMR_REL(rel),
-											forkNumber,
+											&bmr,
 											pg_prewarm_read_stream_next_block,
 											&p,
 											0);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 041415a40e7..32b4189314d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1088,6 +1088,7 @@ brinbuildCallbackParallel(Relation index,
 IndexBuildResult *
 brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
+	BufferManagerRelation bmr;
 	IndexBuildResult *result;
 	double		reltuples;
 	double		idxtuples;
@@ -1108,8 +1109,8 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * whole relation will be rolled back.
 	 */
 
-	meta = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
-							 EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+	InitBMRForRel(&bmr, index, MAIN_FORKNUM, NULL);
+	meta = ExtendBufferedRel(&bmr, EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 	Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
 
 	brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
@@ -1258,11 +1259,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 brinbuildempty(Relation index)
 {
+	BufferManagerRelation bmr;
 	Buffer		metabuf;
 
 	/* An empty BRIN index has a metapage only. */
-	metabuf = ExtendBufferedRel(BMR_REL(index), INIT_FORKNUM, NULL,
-								EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+	InitBMRForRel(&bmr, index, INIT_FORKNUM, NULL);
+	metabuf = ExtendBufferedRel(&bmr, EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 5a9ed40ab64..15746942f55 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -528,6 +528,9 @@ revmap_physical_extend(BrinRevmap *revmap)
 	BlockNumber mapBlk;
 	BlockNumber nblocks;
 	Relation	irel = revmap->rm_irel;
+	BufferManagerRelation bmr;
+
+	InitBMRForRel(&bmr, irel, MAIN_FORKNUM, NULL);
 
 	/*
 	 * Lock the metapage. This locks out concurrent extensions of the revmap,
@@ -553,14 +556,13 @@ revmap_physical_extend(BrinRevmap *revmap)
 	nblocks = RelationGetNumberOfBlocks(irel);
 	if (mapBlk < nblocks)
 	{
-		buf = ReadBuffer(irel, mapBlk);
+		buf = ReadBufferBMR(&bmr, mapBlk, RBM_NORMAL);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		page = BufferGetPage(buf);
 	}
 	else
 	{
-		buf = ExtendBufferedRel(BMR_REL(irel), MAIN_FORKNUM, NULL,
-								EB_LOCK_FIRST);
+		buf = ExtendBufferedRel(&bmr, EB_LOCK_FIRST);
 		if (BufferGetBlockNumber(buf) != mapBlk)
 		{
 			/*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71f38be90c3..ccddba919f6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -433,14 +433,15 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 ginbuildempty(Relation index)
 {
+	BufferManagerRelation bmr;
 	Buffer		RootBuffer,
 				MetaBuffer;
 
+	InitBMRForRel(&bmr, index, INIT_FORKNUM, NULL);
+
 	/* An empty GIN index has two pages. */
-	MetaBuffer = ExtendBufferedRel(BMR_REL(index), INIT_FORKNUM, NULL,
-								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
-	RootBuffer = ExtendBufferedRel(BMR_REL(index), INIT_FORKNUM, NULL,
-								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+	MetaBuffer = ExtendBufferedRel(&bmr, EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+	RootBuffer = ExtendBufferedRel(&bmr, EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 5747ae6a4ca..0da4a218402 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -299,6 +299,7 @@ gintuple_get_key(GinState *ginstate, IndexTuple tuple,
 Buffer
 GinNewBuffer(Relation index)
 {
+	BufferManagerRelation bmr;
 	Buffer		buffer;
 
 	/* First, try to get a page from FSM */
@@ -328,8 +329,8 @@ GinNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
-							   EB_LOCK_FIRST);
+	InitBMRForRel(&bmr, index, MAIN_FORKNUM, NULL);
+	buffer = ExtendBufferedRel(&bmr, EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ed4ffa63a77..d1761b73fe2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -132,11 +132,12 @@ createTempGistContext(void)
 void
 gistbuildempty(Relation index)
 {
+	BufferManagerRelation bmr;
 	Buffer		buffer;
 
 	/* Initialize the root page */
-	buffer = ExtendBufferedRel(BMR_REL(index), INIT_FORKNUM, NULL,
-							   EB_SKIP_EXTENSION_LOCK | EB_LOCK_FIRST);
+	InitBMRForRel(&bmr, index, INIT_FORKNUM, NULL);
+	buffer = ExtendBufferedRel(&bmr, EB_SKIP_EXTENSION_LOCK | EB_LOCK_FIRST);
 
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index d2d0b36d4ea..1d8a955b093 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -823,6 +823,7 @@ gistcheckpage(Relation rel, Buffer buf)
 Buffer
 gistNewBuffer(Relation r, Relation heaprel)
 {
+	BufferManagerRelation bmr;
 	Buffer		buffer;
 
 	/* First, try to get a page from FSM */
@@ -877,8 +878,8 @@ gistNewBuffer(Relation r, Relation heaprel)
 	}
 
 	/* Must extend the file */
-	buffer = ExtendBufferedRel(BMR_REL(r), MAIN_FORKNUM, NULL,
-							   EB_LOCK_FIRST);
+	InitBMRForRel(&bmr, r, MAIN_FORKNUM, NULL);
+	buffer = ExtendBufferedRel(&bmr, EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index d09c349e28f..bab6b22b0b3 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -197,6 +197,7 @@ _hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket, uint32 flag,
 Buffer
 _hash_getnewbuf(Relation rel, BlockNumber blkno, ForkNumber forkNum)
 {
+	BufferManagerRelation bmr;
 	BlockNumber nblocks = RelationGetNumberOfBlocksInFork(rel, forkNum);
 	Buffer		buf;
 
@@ -206,19 +207,19 @@ _hash_getnewbuf(Relation rel, BlockNumber blkno, ForkNumber forkNum)
 		elog(ERROR, "access to noncontiguous page in hash index \"%s\"",
 			 RelationGetRelationName(rel));
 
+	InitBMRForRel(&bmr, rel, forkNum, NULL);
+
 	/* smgr insists we explicitly extend the relation */
 	if (blkno == nblocks)
 	{
-		buf = ExtendBufferedRel(BMR_REL(rel), forkNum, NULL,
-								EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+		buf = ExtendBufferedRel(&bmr, EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 		if (BufferGetBlockNumber(buf) != blkno)
 			elog(ERROR, "unexpected hash relation size: %u, should be %u",
 				 BufferGetBlockNumber(buf), blkno);
 	}
 	else
 	{
-		buf = ReadBufferExtended(rel, forkNum, blkno, RBM_ZERO_AND_LOCK,
-								 NULL);
+		buf = ReadBufferBMR(&bmr, blkno, RBM_ZERO_AND_LOCK);
 	}
 
 	/* ref count and lock type are correct */
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7c662cdf46e..f17ffe8a64b 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -246,6 +246,7 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 	uint32		not_in_fsm_pages;
 	Buffer		buffer;
 	Page		page;
+	BufferManagerRelation bmr;
 
 	/*
 	 * Determine by how many pages to try to extend by.
@@ -338,8 +339,8 @@ RelationAddBlocks(Relation relation, BulkInsertState bistate,
 	 * [auto]vacuum trying to truncate later pages as REL_TRUNCATE_MINIMUM is
 	 * way larger.
 	 */
-	first_block = ExtendBufferedRelBy(BMR_REL(relation), MAIN_FORKNUM,
-									  bistate ? bistate->strategy : NULL,
+	InitBMRForRel(&bmr, relation, MAIN_FORKNUM, bistate ? bistate->strategy : NULL);
+	first_block = ExtendBufferedRelBy(&bmr,
 									  EB_LOCK_FIRST,
 									  extend_by_pages,
 									  victim_buffers,
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 1ab6c865e3c..e04752adecd 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -626,9 +626,11 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 static Buffer
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
+	BufferManagerRelation bmr;
 	Buffer		buf;
 
-	buf = ExtendBufferedRelTo(BMR_REL(rel), VISIBILITYMAP_FORKNUM, NULL,
+	InitBMRForRel(&bmr, rel, VISIBILITYMAP_FORKNUM, NULL);
+	buf = ExtendBufferedRelTo(&bmr,
 							  EB_CREATE_FORK_IF_NEEDED |
 							  EB_CLEAR_SIZE_CACHE,
 							  vm_nblocks,
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 01bbece6bfd..6f34ed36ad3 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -868,12 +868,15 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 Buffer
 _bt_allocbuf(Relation rel, Relation heaprel)
 {
+	BufferManagerRelation bmr;
 	Buffer		buf;
 	BlockNumber blkno;
 	Page		page;
 
 	Assert(heaprel != NULL);
 
+	InitBMRForRel(&bmr, rel, MAIN_FORKNUM, NULL);
+
 	/*
 	 * First see if the FSM knows of any free pages.
 	 *
@@ -903,7 +906,7 @@ _bt_allocbuf(Relation rel, Relation heaprel)
 		blkno = GetFreeIndexPage(rel);
 		if (blkno == InvalidBlockNumber)
 			break;
-		buf = ReadBuffer(rel, blkno);
+		buf = ReadBufferBMR(&bmr, blkno, RBM_NORMAL);
 		if (_bt_conditionallockbuf(rel, buf))
 		{
 			page = BufferGetPage(buf);
@@ -975,7 +978,7 @@ _bt_allocbuf(Relation rel, Relation heaprel)
 	 * otherwise would make, as we can't use _bt_lockbuf() without introducing
 	 * a race.
 	 */
-	buf = ExtendBufferedRel(BMR_REL(rel), MAIN_FORKNUM, NULL, EB_LOCK_FIRST);
+	buf = ExtendBufferedRel(&bmr, EB_LOCK_FIRST);
 	if (!RelationUsesLocalBuffers(rel))
 		VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 3f793125f74..528bd6d3bfa 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -374,8 +374,11 @@ initSpGistState(SpGistState *state, Relation index)
 Buffer
 SpGistNewBuffer(Relation index)
 {
+	BufferManagerRelation bmr;
 	Buffer		buffer;
 
+	InitBMRForRel(&bmr, index, MAIN_FORKNUM, NULL);
+
 	/* First, try to get a page from FSM */
 	for (;;)
 	{
@@ -391,7 +394,7 @@ SpGistNewBuffer(Relation index)
 		if (SpGistBlockIsFixed(blkno))
 			continue;
 
-		buffer = ReadBuffer(index, blkno);
+		buffer = ReadBufferBMR(&bmr, blkno, RBM_NORMAL);
 
 		/*
 		 * We have to guard against the possibility that someone else already
@@ -414,8 +417,7 @@ SpGistNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
-							   EB_LOCK_FIRST);
+	buffer = ExtendBufferedRel(&bmr, EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5295b85fe07..ca055a85bb1 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -472,6 +472,7 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 					   BlockNumber blkno, ReadBufferMode mode,
 					   Buffer recent_buffer)
 {
+	BufferManagerRelation bmr;
 	BlockNumber lastblock;
 	Buffer		buffer;
 	SMgrRelation smgr;
@@ -502,11 +503,11 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 
 	lastblock = smgrnblocks(smgr, forknum);
 
+	InitBMRForSMgr(&bmr, smgr, RELPERSISTENCE_PERMANENT, forknum, NULL);
 	if (blkno < lastblock)
 	{
 		/* page exists in file */
-		buffer = ReadBufferWithoutRelcache(rlocator, forknum, blkno,
-										   mode, NULL, true);
+		buffer = ReadBufferBMR(&bmr, blkno, mode);
 	}
 	else
 	{
@@ -521,9 +522,8 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 		/* OK to extend the file */
 		/* we do this in recovery only - no rel-extension lock needed */
 		Assert(InRecovery);
-		buffer = ExtendBufferedRelTo(BMR_SMGR(smgr, RELPERSISTENCE_PERMANENT),
-									 forknum,
-									 NULL,
+
+		buffer = ExtendBufferedRelTo(&bmr,
 									 EB_PERFORMING_RECOVERY |
 									 EB_SKIP_EXTENSION_LOCK,
 									 blkno + 1,
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 46103561c31..ac125407932 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -362,11 +362,13 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
 	Page		page;
 	sequence_magic *sm;
 	OffsetNumber offnum;
+	BufferManagerRelation bmr;
+
+	InitBMRForRel(&bmr, rel, forkNum, NULL);
 
 	/* Initialize first page of relation with special magic number */
 
-	buf = ExtendBufferedRel(BMR_REL(rel), forkNum, NULL,
-							EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+	buf = ExtendBufferedRel(&bmr, EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 	Assert(BufferGetBlockNumber(buf) == 0);
 
 	page = BufferGetPage(buf);
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 6b4b97372c7..6ee81e5fee9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -116,6 +116,8 @@ struct ReadStream
 	int16		distance;
 	bool		advice_enabled;
 
+	BufferManagerRelation bmr;
+
 	/*
 	 * Sometimes we need to be able to 'unget' a block number to resolve a
 	 * flow control problem when I/Os are split.
@@ -375,9 +377,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
  */
 ReadStream *
 read_stream_begin_relation(int flags,
-						   BufferAccessStrategy strategy,
-						   BufferManagerRelation bmr,
-						   ForkNumber forknum,
+						   BufferManagerRelation *bmr,
 						   ReadStreamBlockNumberCB callback,
 						   void *callback_private_data,
 						   size_t per_buffer_data_size)
@@ -389,23 +389,16 @@ read_stream_begin_relation(int flags,
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
 
-	/* Make sure our bmr's smgr and persistent are populated. */
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
-		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
-
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
 	 * currently means advice to the kernel to tell it that we will soon read.
 	 * This number also affects how far we look ahead for opportunities to
 	 * start more I/Os.
 	 */
-	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	tablespace_id = bmr->smgr->smgr_rlocator.locator.spcOid;
 	if (!OidIsValid(MyDatabaseId) ||
-		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
-		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+		(bmr->rel && IsCatalogRelation(bmr->rel)) ||
+		IsCatalogRelationOid(bmr->smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
 		 * Avoid circularity while trying to look up tablespace settings or
@@ -432,7 +425,7 @@ read_stream_begin_relation(int flags,
 							 PG_INT16_MAX - io_combine_limit - 1);
 
 	/* Don't allow this backend to pin more than its share of buffers. */
-	if (SmgrIsTemp(bmr.smgr))
+	if (SmgrIsTemp(bmr->smgr))
 		LimitAdditionalLocalPins(&max_pinned_buffers);
 	else
 		LimitAdditionalPins(&max_pinned_buffers);
@@ -493,12 +486,6 @@ read_stream_begin_relation(int flags,
 	stream->per_buffer_data_size = per_buffer_data_size;
 	stream->max_pinned_buffers = max_pinned_buffers;
 	stream->queue_size = queue_size;
-
-	if (!bmr.smgr)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
-		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
 
@@ -517,11 +504,10 @@ read_stream_begin_relation(int flags,
 	 * initialize parts of the ReadBuffersOperation objects and leave them
 	 * that way, to avoid wasting CPU cycles writing to them for each read.
 	 */
+	memcpy(&stream->bmr, bmr, sizeof(BufferManagerRelation));
 	for (int i = 0; i < max_ios; ++i)
 	{
-		stream->ios[i].op.bmr = bmr;
-		stream->ios[i].op.forknum = forknum;
-		stream->ios[i].op.strategy = strategy;
+		stream->ios[i].op.bmr = &stream->bmr;
 	}
 
 	return stream;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7123cbbaa2a..359981b4293 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -479,20 +479,13 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(BufferManagerRelation bmr,
-								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy);
-static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
-										   ForkNumber fork,
-										   BufferAccessStrategy strategy,
+static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation *bmr,
 										   uint32 flags,
 										   uint32 extend_by,
 										   BlockNumber extend_upto,
 										   Buffer *buffers,
 										   uint32 *extended_by);
-static BlockNumber ExtendBufferedRelShared(BufferManagerRelation bmr,
-										   ForkNumber fork,
-										   BufferAccessStrategy strategy,
+static BlockNumber ExtendBufferedRelShared(BufferManagerRelation *bmr,
 										   uint32 flags,
 										   uint32 extend_by,
 										   BlockNumber extend_upto,
@@ -788,6 +781,7 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
+	BufferManagerRelation bmr;
 	Buffer		buf;
 
 	/*
@@ -800,8 +794,8 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	buf = ReadBuffer_common(BMR_REL(reln),
-							forkNum, blockNum, mode, strategy);
+	InitBMRForRel(&bmr, reln, forkNum, strategy);
+	buf = ReadBufferBMR(&bmr, blockNum, mode);
 
 	return buf;
 }
@@ -822,27 +816,86 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
+	BufferManagerRelation bmr;
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-									  RELPERSISTENCE_UNLOGGED),
-							 forkNum, blockNum,
-							 mode, strategy);
+	InitBMRForSMgr(&bmr, smgr,
+				   permanent ? RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED,
+				   forkNum,
+				   strategy);
+
+	return ReadBufferBMR(&bmr, blockNum, mode);
+}
+
+void
+InitBMRForRel(BufferManagerRelation *bmr, Relation rel, ForkNumber forkNum, BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+
+	bmr->rel = rel;
+	bmr->smgr = RelationGetSmgr(rel);
+	bmr->relpersistence = rel->rd_rel->relpersistence;
+	bmr->forkNum = forkNum;
+	bmr->strategy = strategy;
+
+	/*
+	 * Look up some further information that's needed in ReadBufferBMR
+	 * upfront.
+	 */
+	isLocalBuf = bmr->relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		bmr->io_context = IOCONTEXT_NORMAL;
+		bmr->io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bmr->io_context = IOContextForStrategy(strategy);
+		bmr->io_object = IOOBJECT_RELATION;
+	}
+
+	if (pgstat_should_count_relation(rel))
+		bmr->pgstat_info = rel->pgstat_info;
+	else
+		bmr->pgstat_info = NULL;
+}
+
+void
+InitBMRForSMgr(BufferManagerRelation *bmr, struct SMgrRelationData *srel, char relpersistence, ForkNumber forkNum, BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+
+	bmr->rel = NULL;
+	bmr->smgr = srel;
+	bmr->relpersistence = relpersistence;
+	bmr->forkNum = forkNum;
+	bmr->strategy = strategy;
+
+	isLocalBuf = relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		bmr->io_context = IOCONTEXT_NORMAL;
+		bmr->io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		bmr->io_context = IOContextForStrategy(strategy);
+		bmr->io_object = IOOBJECT_RELATION;
+	}
+
+	bmr->pgstat_info = NULL;
 }
 
 /*
  * Convenience wrapper around ExtendBufferedRelBy() extending by one block.
  */
 Buffer
-ExtendBufferedRel(BufferManagerRelation bmr,
-				  ForkNumber forkNum,
-				  BufferAccessStrategy strategy,
-				  uint32 flags)
+ExtendBufferedRel(BufferManagerRelation *bmr, uint32 flags)
 {
 	Buffer		buf;
 	uint32		extend_by = 1;
 
-	ExtendBufferedRelBy(bmr, forkNum, strategy, flags, extend_by,
+	ExtendBufferedRelBy(bmr, flags, extend_by,
 						&buf, &extend_by);
 
 	return buf;
@@ -866,25 +919,16 @@ ExtendBufferedRel(BufferManagerRelation bmr,
  * be empty.
  */
 BlockNumber
-ExtendBufferedRelBy(BufferManagerRelation bmr,
-					ForkNumber fork,
-					BufferAccessStrategy strategy,
+ExtendBufferedRelBy(BufferManagerRelation *bmr,
 					uint32 flags,
 					uint32 extend_by,
 					Buffer *buffers,
 					uint32 *extended_by)
 {
-	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
-	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
+	Assert(bmr->smgr != NULL);
 	Assert(extend_by > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
-		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
-
-	return ExtendBufferedRelCommon(bmr, fork, strategy, flags,
+	return ExtendBufferedRelCommon(bmr, flags,
 								   extend_by, InvalidBlockNumber,
 								   buffers, extended_by);
 }
@@ -898,9 +942,7 @@ ExtendBufferedRelBy(BufferManagerRelation bmr,
  * crash recovery).
  */
 Buffer
-ExtendBufferedRelTo(BufferManagerRelation bmr,
-					ForkNumber fork,
-					BufferAccessStrategy strategy,
+ExtendBufferedRelTo(BufferManagerRelation *bmr,
 					uint32 flags,
 					BlockNumber extend_to,
 					ReadBufferMode mode)
@@ -910,33 +952,27 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	Buffer		buffer = InvalidBuffer;
 	Buffer		buffers[64];
 
-	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
-	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
+	Assert(bmr->smgr != NULL);
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 
-	if (bmr.smgr == NULL)
-	{
-		bmr.smgr = RelationGetSmgr(bmr.rel);
-		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-	}
-
 	/*
 	 * If desired, create the file if it doesn't exist.  If
 	 * smgr_cached_nblocks[fork] is positive then it must exist, no need for
 	 * an smgrexists call.
 	 */
 	if ((flags & EB_CREATE_FORK_IF_NEEDED) &&
-		(bmr.smgr->smgr_cached_nblocks[fork] == 0 ||
-		 bmr.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
-		!smgrexists(bmr.smgr, fork))
+		(bmr->smgr->smgr_cached_nblocks[bmr->forkNum] == 0 ||
+		 bmr->smgr->smgr_cached_nblocks[bmr->forkNum] == InvalidBlockNumber) &&
+		!smgrexists(bmr->smgr, bmr->forkNum))
 	{
-		LockRelationForExtension(bmr.rel, ExclusiveLock);
+		Assert(bmr->rel);
+		LockRelationForExtension(bmr->rel, ExclusiveLock);
 
 		/* recheck, fork might have been created concurrently */
-		if (!smgrexists(bmr.smgr, fork))
-			smgrcreate(bmr.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+		if (!smgrexists(bmr->smgr, bmr->forkNum))
+			smgrcreate(bmr->smgr, bmr->forkNum, flags & EB_PERFORMING_RECOVERY);
 
-		UnlockRelationForExtension(bmr.rel, ExclusiveLock);
+		UnlockRelationForExtension(bmr->rel, ExclusiveLock);
 	}
 
 	/*
@@ -944,13 +980,13 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 * kernel.
 	 */
 	if (flags & EB_CLEAR_SIZE_CACHE)
-		bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+		bmr->smgr->smgr_cached_nblocks[bmr->forkNum] = InvalidBlockNumber;
 
 	/*
 	 * Estimate how many pages we'll need to extend by. This avoids acquiring
 	 * unnecessarily many victim buffers.
 	 */
-	current_size = smgrnblocks(bmr.smgr, fork);
+	current_size = smgrnblocks(bmr->smgr, bmr->forkNum);
 
 	/*
 	 * Since no-one else can be looking at the page contents yet, there is no
@@ -969,7 +1005,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		if ((uint64) current_size + num_pages > extend_to)
 			num_pages = extend_to - current_size;
 
-		first_block = ExtendBufferedRelCommon(bmr, fork, strategy, flags,
+		first_block = ExtendBufferedRelCommon(bmr, flags,
 											  num_pages, extend_to,
 											  buffers, &extended_by);
 
@@ -994,7 +1030,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	if (buffer == InvalidBuffer)
 	{
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
+		buffer = ReadBufferBMR(bmr, extend_to - 1, mode);
 	}
 
 	return buffer;
@@ -1047,75 +1083,59 @@ ZeroBuffer(Buffer buffer, ReadBufferMode mode)
  * zero it.
  */
 static inline Buffer
-PinBufferForBlock(BufferManagerRelation bmr,
-				  ForkNumber forkNum,
+PinBufferForBlock(BufferManagerRelation *const bmr,
 				  BlockNumber blockNum,
-				  BufferAccessStrategy strategy,
 				  bool *foundPtr)
 {
 	BufferDesc *bufHdr;
 	bool		isLocalBuf;
-	IOContext	io_context;
-	IOObject	io_object;
 
 	Assert(blockNum != P_NEW);
+	Assert(bmr->smgr);
 
-	Assert(bmr.smgr);
-
-	isLocalBuf = bmr.relpersistence == RELPERSISTENCE_TEMP;
-	if (isLocalBuf)
-	{
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-	}
-	else
-	{
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-	}
-
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   bmr.smgr->smgr_rlocator.locator.spcOid,
-									   bmr.smgr->smgr_rlocator.locator.dbOid,
-									   bmr.smgr->smgr_rlocator.locator.relNumber,
-									   bmr.smgr->smgr_rlocator.backend);
+	TRACE_POSTGRESQL_BUFFER_READ_START(bmr->forkNum, blockNum,
+									   bmr->smgr->smgr_rlocator.locator.spcOid,
+									   bmr->smgr->smgr_rlocator.locator.dbOid,
+									   bmr->smgr->smgr_rlocator.locator.relNumber,
+									   bmr->smgr->smgr_rlocator.backend);
 
+	isLocalBuf = bmr->relpersistence == RELPERSISTENCE_TEMP;
 	if (isLocalBuf)
 	{
-		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		bufHdr = LocalBufferAlloc(bmr->smgr, bmr->forkNum, blockNum, foundPtr);
 		if (*foundPtr)
 			pgBufferUsage.local_blks_hit++;
 	}
 	else
 	{
-		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
-							 strategy, foundPtr, io_context);
+		bufHdr = BufferAlloc(bmr->smgr, bmr->relpersistence, bmr->forkNum, blockNum,
+							 bmr->strategy, foundPtr, bmr->io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
 	}
-	if (bmr.rel)
+	if (bmr->pgstat_info)
 	{
 		/*
 		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
 		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
 		 * zeroed instead), the per-relation stats always count them.
 		 */
-		pgstat_count_buffer_read(bmr.rel);
+		bmr->pgstat_info->counts.blocks_fetched++;
 		if (*foundPtr)
-			pgstat_count_buffer_hit(bmr.rel);
+			bmr->pgstat_info->counts.blocks_hit++;
 	}
 	if (*foundPtr)
 	{
 		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		pgstat_count_io_op(bmr->io_object, bmr->io_context, IOOP_HIT);
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  bmr.smgr->smgr_rlocator.locator.spcOid,
-										  bmr.smgr->smgr_rlocator.locator.dbOid,
-										  bmr.smgr->smgr_rlocator.locator.relNumber,
-										  bmr.smgr->smgr_rlocator.backend,
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(bmr->forkNum, blockNum,
+										  bmr->smgr->smgr_rlocator.locator.spcOid,
+										  bmr->smgr->smgr_rlocator.locator.dbOid,
+										  bmr->smgr->smgr_rlocator.locator.relNumber,
+										  bmr->smgr->smgr_rlocator.backend,
 										  true);
 	}
 
@@ -1125,14 +1145,12 @@ PinBufferForBlock(BufferManagerRelation bmr,
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  */
-static inline Buffer
-ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy)
+Buffer
+ReadBufferBMR(BufferManagerRelation *bmr,
+			  BlockNumber blockNum, ReadBufferMode mode)
 {
-	ReadBuffersOperation operation;
 	Buffer		buffer;
-	int			flags;
+	bool		found;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1151,37 +1169,32 @@ ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, flags);
 	}
 
+	buffer = PinBufferForBlock(bmr, blockNum, &found);
 	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
 				 mode == RBM_ZERO_AND_LOCK))
 	{
-		bool		found;
-
-		if (bmr.smgr == NULL)
-		{
-			bmr.smgr = RelationGetSmgr(bmr.rel);
-			bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
-		}
-
-		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
 		ZeroBuffer(buffer, mode);
-		return buffer;
 	}
+	else if (!found)
+	{
+		ReadBuffersOperation operation;
+
+		if (mode == RBM_ZERO_ON_ERROR)
+			operation.flags = READ_BUFFERS_ZERO_ON_ERROR;
+		else
+			operation.flags = 0;
+
+		operation.bmr = bmr;
+		operation.buffers = &buffer;
+		operation.blocknum = blockNum;
+		operation.nblocks = 1;
+		operation.io_buffers_len = 1;
 
-	if (mode == RBM_ZERO_ON_ERROR)
-		flags = READ_BUFFERS_ZERO_ON_ERROR;
-	else
-		flags = 0;
-	operation.bmr = bmr;
-	operation.forknum = forkNum;
-	operation.strategy = strategy;
-	if (StartReadBuffer(&operation,
-						&buffer,
-						blockNum,
-						flags))
 		WaitReadBuffers(&operation);
+	}
 
 	return buffer;
 }
@@ -1223,33 +1236,26 @@ StartReadBuffer(ReadBuffersOperation *operation,
  * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
  * could be initiated here.
  */
-inline bool
+bool
 StartReadBuffers(ReadBuffersOperation *operation,
 				 Buffer *buffers,
 				 BlockNumber blockNum,
 				 int *nblocks,
 				 int flags)
 {
+	BufferManagerRelation *bmr = operation->bmr;
 	int			actual_nblocks = *nblocks;
 	int			io_buffers_len = 0;
 
 	Assert(*nblocks > 0);
 	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
 
-	if (!operation->bmr.smgr)
-	{
-		operation->bmr.smgr = RelationGetSmgr(operation->bmr.rel);
-		operation->bmr.relpersistence = operation->bmr.rel->rd_rel->relpersistence;
-	}
-
 	for (int i = 0; i < actual_nblocks; ++i)
 	{
 		bool		found;
 
-		buffers[i] = PinBufferForBlock(operation->bmr,
-									   operation->forknum,
+		buffers[i] = PinBufferForBlock(bmr,
 									   blockNum + i,
-									   operation->strategy,
 									   &found);
 
 		if (found)
@@ -1294,8 +1300,8 @@ StartReadBuffers(ReadBuffersOperation *operation,
 			 * true asynchronous version we might choose to process only one
 			 * real I/O at a time in that case.
 			 */
-			smgrprefetch(operation->bmr.smgr,
-						 operation->forknum,
+			smgrprefetch(bmr->smgr,
+						 bmr->forkNum,
 						 blockNum,
 						 operation->io_buffers_len);
 		}
@@ -1328,10 +1334,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	Buffer	   *buffers;
 	int			nblocks;
 	BlockNumber blocknum;
-	ForkNumber	forknum;
 	bool		isLocalBuf;
-	IOContext	io_context;
-	IOObject	io_object;
 
 	/*
 	 * Currently operations are only allowed to include a read of some range,
@@ -1348,19 +1351,6 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 
 	buffers = &operation->buffers[0];
 	blocknum = operation->blocknum;
-	forknum = operation->forknum;
-
-	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
-	if (isLocalBuf)
-	{
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-	}
-	else
-	{
-		io_context = IOContextForStrategy(operation->strategy);
-		io_object = IOOBJECT_RELATION;
-	}
 
 	/*
 	 * We count all these blocks as read by this backend.  This is traditional
@@ -1370,6 +1360,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 	 * count this as a "hit", and we don't have a separate counter for "miss,
 	 * but another backend completed the read".
 	 */
+	isLocalBuf = operation->bmr->relpersistence == RELPERSISTENCE_TEMP;
 	if (isLocalBuf)
 		pgBufferUsage.local_blks_read += nblocks;
 	else
@@ -1395,11 +1386,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			 * Report this as a 'hit' for this backend, even though it must
 			 * have started out as a miss in PinBufferForBlock().
 			 */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
-											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
-											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
-											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
-											  operation->bmr.smgr->smgr_rlocator.backend,
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->bmr->forkNum, blocknum + i,
+											  operation->bmr->smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr->smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr->smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr->smgr->smgr_rlocator.backend,
 											  true);
 			continue;
 		}
@@ -1429,9 +1420,9 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 		}
 
 		io_start = pgstat_prepare_io_time(track_io_timing);
-		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
-		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
-								io_buffers_len);
+		smgrreadv(operation->bmr->smgr, operation->bmr->forkNum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(operation->bmr->io_object, operation->bmr->io_context,
+								IOOP_READ, io_start, io_buffers_len);
 
 		/* Verify each block we read, and terminate the I/O. */
 		for (int j = 0; j < io_buffers_len; ++j)
@@ -1460,7 +1451,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 							(errcode(ERRCODE_DATA_CORRUPTED),
 							 errmsg("invalid page in block %u of relation %s; zeroing out page",
 									io_first_block + j,
-									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+									relpath(operation->bmr->smgr->smgr_rlocator, operation->bmr->forkNum))));
 					memset(bufBlock, 0, BLCKSZ);
 				}
 				else
@@ -1468,7 +1459,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 							(errcode(ERRCODE_DATA_CORRUPTED),
 							 errmsg("invalid page in block %u of relation %s",
 									io_first_block + j,
-									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+									relpath(operation->bmr->smgr->smgr_rlocator, operation->bmr->forkNum))));
 			}
 
 			/* Terminate I/O and set BM_VALID. */
@@ -1486,11 +1477,11 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			}
 
 			/* Report I/Os as completing individually. */
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
-											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
-											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
-											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
-											  operation->bmr.smgr->smgr_rlocator.backend,
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(operation->bmr->forkNum, io_first_block + j,
+											  operation->bmr->smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr->smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr->smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr->smgr->smgr_rlocator.backend,
 											  false);
 		}
 
@@ -2061,9 +2052,7 @@ LimitAdditionalPins(uint32 *additional_pins)
  * avoid duplicating the tracing and relpersistence related logic.
  */
 static BlockNumber
-ExtendBufferedRelCommon(BufferManagerRelation bmr,
-						ForkNumber fork,
-						BufferAccessStrategy strategy,
+ExtendBufferedRelCommon(BufferManagerRelation *bmr,
 						uint32 flags,
 						uint32 extend_by,
 						BlockNumber extend_upto,
@@ -2072,28 +2061,28 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr,
 {
 	BlockNumber first_block;
 
-	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
-										 bmr.smgr->smgr_rlocator.locator.spcOid,
-										 bmr.smgr->smgr_rlocator.locator.dbOid,
-										 bmr.smgr->smgr_rlocator.locator.relNumber,
-										 bmr.smgr->smgr_rlocator.backend,
+	TRACE_POSTGRESQL_BUFFER_EXTEND_START(bmr->forkNum,
+										 bmr->smgr->smgr_rlocator.locator.spcOid,
+										 bmr->smgr->smgr_rlocator.locator.dbOid,
+										 bmr->smgr->smgr_rlocator.locator.relNumber,
+										 bmr->smgr->smgr_rlocator.backend,
 										 extend_by);
 
-	if (bmr.relpersistence == RELPERSISTENCE_TEMP)
-		first_block = ExtendBufferedRelLocal(bmr, fork, flags,
+	if (bmr->relpersistence == RELPERSISTENCE_TEMP)
+		first_block = ExtendBufferedRelLocal(bmr, flags,
 											 extend_by, extend_upto,
 											 buffers, &extend_by);
 	else
-		first_block = ExtendBufferedRelShared(bmr, fork, strategy, flags,
+		first_block = ExtendBufferedRelShared(bmr, flags,
 											  extend_by, extend_upto,
 											  buffers, &extend_by);
 	*extended_by = extend_by;
 
-	TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(fork,
-										bmr.smgr->smgr_rlocator.locator.spcOid,
-										bmr.smgr->smgr_rlocator.locator.dbOid,
-										bmr.smgr->smgr_rlocator.locator.relNumber,
-										bmr.smgr->smgr_rlocator.backend,
+	TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(bmr->forkNum,
+										bmr->smgr->smgr_rlocator.locator.spcOid,
+										bmr->smgr->smgr_rlocator.locator.dbOid,
+										bmr->smgr->smgr_rlocator.locator.relNumber,
+										bmr->smgr->smgr_rlocator.backend,
 										*extended_by,
 										first_block);
 
@@ -2105,9 +2094,7 @@ ExtendBufferedRelCommon(BufferManagerRelation bmr,
  * shared buffers.
  */
 static BlockNumber
-ExtendBufferedRelShared(BufferManagerRelation bmr,
-						ForkNumber fork,
-						BufferAccessStrategy strategy,
+ExtendBufferedRelShared(BufferManagerRelation *bmr,
 						uint32 flags,
 						uint32 extend_by,
 						BlockNumber extend_upto,
@@ -2115,7 +2102,6 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 						uint32 *extended_by)
 {
 	BlockNumber first_block;
-	IOContext	io_context = IOContextForStrategy(strategy);
 	instr_time	io_start;
 
 	LimitAdditionalPins(&extend_by);
@@ -2134,7 +2120,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	{
 		Block		buf_block;
 
-		buffers[i] = GetVictimBuffer(strategy, io_context);
+		buffers[i] = GetVictimBuffer(bmr->strategy, bmr->io_context);
 		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
 
 		/* new buffers are zero-filled */
@@ -2152,16 +2138,16 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	 * we get the lock.
 	 */
 	if (!(flags & EB_SKIP_EXTENSION_LOCK))
-		LockRelationForExtension(bmr.rel, ExclusiveLock);
+		LockRelationForExtension(bmr->rel, ExclusiveLock);
 
 	/*
 	 * If requested, invalidate size cache, so that smgrnblocks asks the
 	 * kernel.
 	 */
 	if (flags & EB_CLEAR_SIZE_CACHE)
-		bmr.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+		bmr->smgr->smgr_cached_nblocks[bmr->forkNum] = InvalidBlockNumber;
 
-	first_block = smgrnblocks(bmr.smgr, fork);
+	first_block = smgrnblocks(bmr->smgr, bmr->forkNum);
 
 	/*
 	 * Now that we have the accurate relation size, check if the caller wants
@@ -2193,7 +2179,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		if (extend_by == 0)
 		{
 			if (!(flags & EB_SKIP_EXTENSION_LOCK))
-				UnlockRelationForExtension(bmr.rel, ExclusiveLock);
+				UnlockRelationForExtension(bmr->rel, ExclusiveLock);
 			*extended_by = extend_by;
 			return first_block;
 		}
@@ -2204,7 +2190,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 				 errmsg("cannot extend relation %s beyond %u blocks",
-						relpath(bmr.smgr->smgr_rlocator, fork),
+						relpath(bmr->smgr->smgr_rlocator, bmr->forkNum),
 						MaxBlockNumber)));
 
 	/*
@@ -2226,7 +2212,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 		ReservePrivateRefCountEntry();
 
-		InitBufferTag(&tag, &bmr.smgr->smgr_rlocator.locator, fork, first_block + i);
+		InitBufferTag(&tag, &bmr->smgr->smgr_rlocator.locator, bmr->forkNum, first_block + i);
 		hash = BufTableHashCode(&tag);
 		partition_lock = BufMappingPartitionLock(hash);
 
@@ -2258,7 +2244,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			 * Pin the existing buffer before releasing the partition lock,
 			 * preventing it from being evicted.
 			 */
-			valid = PinBuffer(existing_hdr, strategy);
+			valid = PinBuffer(existing_hdr, bmr->strategy);
 
 			LWLockRelease(partition_lock);
 
@@ -2275,7 +2261,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			if (valid && !PageIsNew((Page) buf_block))
 				ereport(ERROR,
 						(errmsg("unexpected data beyond EOF in block %u of relation %s",
-								existing_hdr->tag.blockNum, relpath(bmr.smgr->smgr_rlocator, fork)),
+								existing_hdr->tag.blockNum, relpath(bmr->smgr->smgr_rlocator, bmr->forkNum)),
 						 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
 
 			/*
@@ -2309,7 +2295,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			victim_buf_hdr->tag = tag;
 
 			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-			if (bmr.relpersistence == RELPERSISTENCE_PERMANENT || fork == INIT_FORKNUM)
+			if (bmr->relpersistence == RELPERSISTENCE_PERMANENT || bmr->forkNum == INIT_FORKNUM)
 				buf_state |= BM_PERMANENT;
 
 			UnlockBufHdr(victim_buf_hdr, buf_state);
@@ -2333,7 +2319,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	 *
 	 * We don't need to set checksum for all-zero pages.
 	 */
-	smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false);
+	smgrzeroextend(bmr->smgr, bmr->forkNum, first_block, extend_by, false);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
@@ -2343,9 +2329,9 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	 * take noticeable time.
 	 */
 	if (!(flags & EB_SKIP_EXTENSION_LOCK))
-		UnlockRelationForExtension(bmr.rel, ExclusiveLock);
+		UnlockRelationForExtension(bmr->rel, ExclusiveLock);
 
-	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_EXTEND,
+	pgstat_count_io_op_time(IOOBJECT_RELATION, bmr->io_context, IOOP_EXTEND,
 							io_start, extend_by);
 
 	/* Set BM_VALID, terminate IO, and wake up any waiters */
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 985a2c7049c..414109c69fd 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -310,8 +310,7 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
  * temporary buffers.
  */
 BlockNumber
-ExtendBufferedRelLocal(BufferManagerRelation bmr,
-					   ForkNumber fork,
+ExtendBufferedRelLocal(BufferManagerRelation *bmr,
 					   uint32 flags,
 					   uint32 extend_by,
 					   BlockNumber extend_upto,
@@ -340,7 +339,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 		MemSet((char *) buf_block, 0, BLCKSZ);
 	}
 
-	first_block = smgrnblocks(bmr.smgr, fork);
+	first_block = smgrnblocks(bmr->smgr, bmr->forkNum);
 
 	if (extend_upto != InvalidBlockNumber)
 	{
@@ -359,7 +358,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 		ereport(ERROR,
 				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 				 errmsg("cannot extend relation %s beyond %u blocks",
-						relpath(bmr.smgr->smgr_rlocator, fork),
+						relpath(bmr->smgr->smgr_rlocator, bmr->forkNum),
 						MaxBlockNumber)));
 
 	for (uint32 i = 0; i < extend_by; i++)
@@ -376,7 +375,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 		/* in case we need to pin an existing buffer below */
 		ResourceOwnerEnlarge(CurrentResourceOwner);
 
-		InitBufferTag(&tag, &bmr.smgr->smgr_rlocator.locator, fork, first_block + i);
+		InitBufferTag(&tag, &bmr->smgr->smgr_rlocator.locator, bmr->forkNum, first_block + i);
 
 		hresult = (LocalBufferLookupEnt *)
 			hash_search(LocalBufHash, (void *) &tag, HASH_ENTER, &found);
@@ -416,7 +415,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/* actually extend relation */
-	smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false);
+	smgrzeroextend(bmr->smgr, bmr->forkNum, first_block, extend_by, false);
 
 	pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
 							io_start, extend_by);
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index bcdb1821938..5dd584217ac 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -605,7 +605,11 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 static Buffer
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
-	return ExtendBufferedRelTo(BMR_REL(rel), FSM_FORKNUM, NULL,
+	BufferManagerRelation bmr;
+
+	InitBMRForRel(&bmr, rel, FSM_FORKNUM, NULL);
+
+	return ExtendBufferedRelTo(&bmr,
 							   EB_CREATE_FORK_IF_NEEDED |
 							   EB_CLEAR_SIZE_CACHE,
 							   fsm_nblocks,
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f190e6e5e46..88c91c4867c 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -451,8 +451,7 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
-extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation bmr,
-										  ForkNumber fork,
+extern BlockNumber ExtendBufferedRelLocal(BufferManagerRelation *bmr,
 										  uint32 flags,
 										  uint32 extend_by,
 										  BlockNumber extend_upto,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 241f68c45e1..1c7748019db 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "pgstat.h"
 #include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
@@ -92,22 +93,6 @@ typedef enum ExtendBufferedFlags
 	EB_LOCK_TARGET = (1 << 5),
 }			ExtendBufferedFlags;
 
-/*
- * Some functions identify relations either by relation or smgr +
- * relpersistence.  Used via the BMR_REL()/BMR_SMGR() macros below.  This
- * allows us to use the same function for both recovery and normal operation.
- */
-typedef struct BufferManagerRelation
-{
-	Relation	rel;
-	struct SMgrRelationData *smgr;
-	char		relpersistence;
-} BufferManagerRelation;
-
-#define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
-#define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
-
-
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
@@ -163,9 +148,36 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * BufferManagerRelation encapsulates what the buffer manager functions like
+ * ReadBuffer() need to know about a relation.
+ *
+ * Initialize with InitBMRForRel() if you have access to the relcache, or
+ * InitBMRForSMgr() for raw access to the underlying smgr. This allows us to
+ * use the same functions for both recovery and normal operation.
+ */
+typedef struct BufferManagerRelation
+{
+	Relation	rel;
+	struct SMgrRelationData *smgr;
+	char		relpersistence;
+	ForkNumber	forkNum;
+	/* admittedly 'strategy' isn't really a property of the relation */
+	BufferAccessStrategy strategy;
+
+	/* private data initialized by InitBMR functions */
+	struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+	IOContext	io_context;
+	IOObject	io_object;
+} BufferManagerRelation;
+
+extern void InitBMRForRel(BufferManagerRelation *bmr, Relation rel, ForkNumber forkNum, BufferAccessStrategy strategy);
+extern void InitBMRForSMgr(BufferManagerRelation *bmr, struct SMgrRelationData *srel, char relpersistence, ForkNumber forkNum, BufferAccessStrategy strategy);
+
 /*
  * prototypes for functions in bufmgr.c
  */
+
 extern PrefetchBufferResult PrefetchSharedBuffer(struct SMgrRelationData *smgr_reln,
 												 ForkNumber forkNum,
 												 BlockNumber blockNum);
@@ -173,6 +185,13 @@ extern PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum,
 										   BlockNumber blockNum);
 extern bool ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum,
 							 BlockNumber blockNum, Buffer recent_buffer);
+
+extern Buffer ReadBufferBMR(BufferManagerRelation *bmr, BlockNumber blockNum, ReadBufferMode mode);
+
+/*
+ * compatibility and convenience wrappers which call InitBMRForRel or
+ * InitBMRForSmgr, followed by ReadBufferBMR.
+ */
 extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
 extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 								 BlockNumber blockNum, ReadBufferMode mode,
@@ -188,9 +207,7 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 struct ReadBuffersOperation
 {
 	/* The following members should be set by the caller. */
-	BufferManagerRelation bmr;
-	ForkNumber	forknum;
-	BufferAccessStrategy strategy;
+	BufferManagerRelation *bmr;
 
 	/* The following private members should not be accessed directly. */
 	Buffer	   *buffers;
@@ -223,20 +240,14 @@ extern void CheckBufferIsPinnedOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
-extern Buffer ExtendBufferedRel(BufferManagerRelation bmr,
-								ForkNumber forkNum,
-								BufferAccessStrategy strategy,
+extern Buffer ExtendBufferedRel(BufferManagerRelation *bmr,
 								uint32 flags);
-extern BlockNumber ExtendBufferedRelBy(BufferManagerRelation bmr,
-									   ForkNumber fork,
-									   BufferAccessStrategy strategy,
+extern BlockNumber ExtendBufferedRelBy(BufferManagerRelation *bmr,
 									   uint32 flags,
 									   uint32 extend_by,
 									   Buffer *buffers,
 									   uint32 *extended_by);
-extern Buffer ExtendBufferedRelTo(BufferManagerRelation bmr,
-								  ForkNumber fork,
-								  BufferAccessStrategy strategy,
+extern Buffer ExtendBufferedRelTo(BufferManagerRelation *bmr,
 								  uint32 flags,
 								  BlockNumber extend_to,
 								  ReadBufferMode mode);
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index f5dbc087b0b..db962699c15 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -50,9 +50,7 @@ typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
 												void *per_buffer_data);
 
 extern ReadStream *read_stream_begin_relation(int flags,
-											  BufferAccessStrategy strategy,
-											  BufferManagerRelation bmr,
-											  ForkNumber forknum,
+											  BufferManagerRelation *bmr,
 											  ReadStreamBlockNumberCB callback,
 											  void *callback_private_data,
 											  size_t per_buffer_data_size);
-- 
2.39.2

#52

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#51)

Re: Streaming I/O, vectored I/O (WIP)

On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

master (213c959a29): 8.0 s
streaming-api v13: 9.5 s

Hmm, that's not great, and I think I know one factor that has
confounded my investigation and the conflicting reports I have
received from a couple of people: some are using meson, which is
defaulting to -O3 by default, and others are using make which gives
you -O2 by default, but at -O2, GCC doesn't inline that
StartReadBuffer specialisation that is used in the "fast path", and
possibly more. Some of that gap is closed by using
pg_attribute_inline_always. Clang fails to inline at any level. So I
should probably use the "always" macro there because that is the
intention. Still processing the rest of your email...

#53

Heikki Linnakangas

hlinnaka@iki.fi

almost 2 years ago

In reply to: Thomas Munro (#52)

Re: Streaming I/O, vectored I/O (WIP)

On 29/03/2024 09:01, Thomas Munro wrote:

On Fri, Mar 29, 2024 at 9:45 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

master (213c959a29): 8.0 s
streaming-api v13: 9.5 s

Hmm, that's not great, and I think I know one factor that has
confounded my investigation and the conflicting reports I have
received from a couple of people: some are using meson, which is
defaulting to -O3 by default, and others are using make which gives
you -O2 by default, but at -O2, GCC doesn't inline that
StartReadBuffer specialisation that is used in the "fast path", and
possibly more. Some of that gap is closed by using
pg_attribute_inline_always. Clang fails to inline at any level. So I
should probably use the "always" macro there because that is the
intention. Still processing the rest of your email...

Ah yeah, I also noticed that the inlining didn't happen with some
compilers and flags. I use a mix of gcc and clang and meson and autoconf
in my local environment.

The above micro-benchmarks were with meson and gcc -O3. GCC version:

$ gcc --version
gcc (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

--
Heikki Linnakangas
Neon (https://neon.tech)

#54

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#53)

6 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

1. I tried out Tomas's suggestion ALTER TABLESPACE ts SET
(io_combine_limit = ...). I like it, it's simple and works nicely.
Unfortunately we don't have support for units like '128kB' in
reloptions.c, so for now it requires a number of blocks. That's not
great, so we should probably fix that before merging it, so I'm
leaving that patch (v14-0004) separate, hopefully for later.

2. I also tried Tomas's suggestion of inventing a way to tell
PostgreSQL what the OS readahead window size is. That allows for
better modelling of whether the kernel is going to consider an access
pattern to be sequential. Again, the ALTER TABLESPACE version would
probably need unit support. This is initially just for
experimentation as it came up in discussions of BHS behaviour. I was
against this idea originally as it seemed like more complexity to have
to explain and tune and I'm not sure if it's worth the trouble... but
in fact we are already in the business of second guessing kernel
logic, so there doesn't seem to be any good reason not to do it a bit
better if we can. Do you think I have correctly understood what Linux
is doing? The name I came up with is: effective_io_readahead_window.
I wanted to make clear that it's a property of the system we are
telling it about, not to be confused with our own look-ahead concept
or whatever. Better names very welcome. This is also in a separate
patch (v14-0005), left for later.

3. Another question I wondered about while retesting: does this need
to be so low? I don't think so, so I've added a patch for that.

src/include/port/pg_iovec.h:#define PG_IOV_MAX Min(IOV_MAX, 32)

Now that I'm not using an array full of arrays of that size, I don't
care so much how big we make that 32 (= 256kB @ 8kB), which clamps
io_combine_limit. I think 128 (= 1MB @ 8kB) might be a decent
arbitrary number. Sometimes we use it to size stack arrays, so I
don't want to make it insanely large, but 128 should be fine. I think
it would be good to be able to at least experiment with up to 1MB (I'm
not saying it's a good idea to do it, who knows?, just that there
isn't a technical reason why not to allow it AFAIK). FWIW every
system on our target list that has p{read,write}v has IOV_MAX == 1024
(I checked {Free,Net,Open}BSD, macOS, illumos and Linux), so the
Min(IOV_MAX, ...) really only clamps the systems where
pg_{read,write}v fall back to loop-based emulation (Windows, Solaris)
which is fine.

PG_IOV_MAX also affects the routine that initialises new WAL files. I
don't currently see a downside to doing that in 1MB chunks, as there
was nothing sacred about the previous arbitrary number and the code
deals with short writes by retrying as it should.

4. I agree with Heikki's complaints about the BMR interface. It
should be made more consistent and faster. I didn't want to make all
of those changes touching AMs etc a dependency though, so I spent some
time trying to squeeze out regressions using some of these clues about
calling conventions, likely hints, memory access and batching. I'm
totally open to later improvements and refactoring of that stuff
later!

Attached is the version with the best results I've managed to get. My
test is GCC -O3, pg_prewarm of a table of 220_000_000 integers =
7.6GB, which sometimes comes out around the same ~250ms on master and
streaming pg_prewarm v14 on a random cloud ARM box I'm testing with,
but not always, sometimes it's ~5-7ms more. (Unfortunately I don't
have access to good benchmarking equipment right now, better numbers
welcome.) Two new ideas:

* give fast path mode a single flag, instead of testing all the
conditions for every block
* give fast path mode a little buffer of future block numbers, so it
can call the callback in batches

I'd tried that batch-calling thing before, and results were
inconclusive, but I think sometimes it helps a bit. Note that it
replaces the 'unget' thing from before and it is possibly a tiny bit
nicer anyway.

I'm a bit stumped about how to improve this further -- if anyone has
any ideas for further improvements I'm all ears.

Zooming back out of micro-benchmark mode, it must be pretty hard to
see in a real workload that actually does something with the buffers,
like a sequential scan. Earlier complaints about all-cached
sequential scan regressions were resolved many versions ago AFAIK by
minimising pin count in that case. I just tried Melanie's streaming
sequential scan patch, with a simple SELECT COUNT(*) WHERE i = -1,
with the same all-cached table of 220 million integers. Patched
consistently comes out ahead for all-in-kernel-cache none-in-PG-cache:
~14.7-> ~14.4, and all-in-PG-cache ~13.5s -> ~13.3s (which I don't
have an explanation for). I don't claim any of that is particularly
scientific, I just wanted to note that single digit numbers of
milliseconds of regression while pinning a million pages is clearly
lost in the noise of other effects once you add in real query
execution. That's half a dozen nanoseconds per page if I counted
right.

So, I am finally starting to think we should commit this, and decide
which user patches are candidates.

Attachments:

v14-0001-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v14-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From 2af81f4eaabfafc6cc7cbb6b61fd569ca53a1a1c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v14 1/7] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 726 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  41 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 579 insertions(+), 232 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f65c17e5ae4..241a6079688 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2720,6 +2720,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..70b19238b78 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -152,6 +157,13 @@ int			effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;
  */
 int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
 
+/*
+ * Limit on how many blocks should be handled in single I/O operations.
+ * StartReadBuffers() callers should respect it, as should other operations
+ * that call smgr APIs directly.
+ */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /*
  * GUC variables about triggering kernel writeback for buffers written; OS
  * dependent defaults are set via the GUC mechanism.
@@ -471,10 +483,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation *bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,18 +511,18 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
-static BufferDesc *BufferAlloc(SMgrRelation smgr,
-							   char relpersistence,
-							   ForkNumber forkNum,
-							   BlockNumber blockNum,
-							   BufferAccessStrategy strategy,
-							   bool *foundPtr, IOContext io_context);
+static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
+									  char relpersistence,
+									  ForkNumber forkNum,
+									  BlockNumber blockNum,
+									  BufferAccessStrategy strategy,
+									  bool *foundPtr, IOContext io_context);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -781,8 +792,8 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
+	BufferManagerRelation bmr;
 
 	/*
 	 * Reject attempts to read non-local temporary relations; we would be
@@ -794,15 +805,12 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	bmr.rel = reln;
+	bmr.smgr = RelationGetSmgr(reln);
+	bmr.relpersistence = reln->rd_rel->relpersistence;
+	buf = ReadBuffer_common(&bmr,
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +830,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(&BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									   RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -875,7 +882,7 @@ ExtendBufferedRelBy(BufferManagerRelation bmr,
 					Buffer *buffers,
 					uint32 *extended_by)
 {
-	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
+	/* Assert((bmr.rel != NULL) != (bmr.smgr != NULL)); */
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_by > 0);
 
@@ -911,7 +918,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	Buffer		buffer = InvalidBuffer;
 	Buffer		buffers[64];
 
-	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
+	/* Assert((bmr.rel != NULL) != (bmr.smgr != NULL)); */
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 
@@ -994,35 +1001,149 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(&bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation *bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
 
-	*hit = false;
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr->smgr);
+
+	isLocalBuf = bmr->relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr->smgr->smgr_rlocator.locator.spcOid,
+									   bmr->smgr->smgr_rlocator.locator.dbOid,
+									   bmr->smgr->smgr_rlocator.locator.relNumber,
+									   bmr->smgr->smgr_rlocator.backend);
+
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr->smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr->smgr, bmr->relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr->rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr->rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr->rel);
+	}
+	if (*foundPtr)
+	{
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  bmr->smgr->smgr_rlocator.locator.spcOid,
+										  bmr->smgr->smgr_rlocator.locator.dbOid,
+										  bmr->smgr->smgr_rlocator.locator.relNumber,
+										  bmr->smgr->smgr_rlocator.backend,
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static pg_attribute_always_inline Buffer
+ReadBuffer_common(BufferManagerRelation *bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
+
+	/* Caller must make sure smgr is initialized. */
+	Assert(bmr->smgr != NULL);
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1162,353 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(*bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
+		bool		found;
+
+#if 0
+		if (bmr->smgr == NULL)
+		{
+			bmr->smgr = RelationGetSmgr(bmr->rel);
+			bmr->relpersistence = bmr->rel->rd_rel->relpersistence;
+		}
+#endif
+
+		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
 	}
+
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
-	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
-	}
+		flags = 0;
+	operation.bmr = *bmr;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
+
+	return buffer;
+}
+
+/*
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, if the compiler
+ * inlines the code and specializes for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffers(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
+ *
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice if
+ * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
+ * happens synchronously in WaitReadBuffers().  In future work, true I/O could
+ * be initiated here.
+ *
+ * Make sure that StartReadBuffer() inlines this function with an attribute.
+ */
+pg_attribute_always_inline bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
 
-	/* At this point we do NOT hold any locks. */
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
-		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		bool		found;
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		buffers[i] = PinBufferForBlock(&operation->bmr,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+	*nblocks = actual_nblocks;
+
+	if (likely(io_buffers_len == 0))
+		return false;
+
+	/* Populate information needed for I/O. */
+	operation->buffers = buffers;
+	operation->blocknum = blockNum;
+	operation->flags = flags;
+	operation->nblocks = actual_nblocks;
+	operation->io_buffers_len = io_buffers_len;
 
+	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	{
 		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
+		 * In theory we should only do this if PinBufferForBlock() had to
+		 * allocate new buffers above.  That way, if two calls to
+		 * StartReadBuffers() were made for the same blocks before
+		 * WaitReadBuffers(), only the first would issue the advice. That'd be
+		 * a better simulation of true asynchronous I/O, which would only
+		 * start the I/O once, but isn't done here for simplicity.  Note also
+		 * that the following call might actually issue two advice calls if we
+		 * cross a segment boundary; in a true asynchronous version we might
+		 * choose to process only one real I/O at a time in that case.
 		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+		smgrprefetch(operation->bmr.smgr,
+					 operation->forknum,
+					 blockNum,
+					 operation->io_buffers_len);
+	}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+	/* Indicate that WaitReadBuffers() should be called. */
+	return true;
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1516,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1235,7 +1524,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * No locks are held either at entry or exit.
  */
-static BufferDesc *
+static inline BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
@@ -1286,19 +1575,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1643,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1678,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2034,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2299,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2322,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2637,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2651,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3719,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5454,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5475,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 92fcd5fa4d5..c12784cbec8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3129,6 +3129,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index adcc0257f91..baecde28410 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..241f68c45e1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,38 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/* The following private members should not be accessed directly. */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +286,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8d7bed411f..826874135e5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2287,6 +2287,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.44.0

v14-0002-Provide-API-for-streaming-relation-data.patchtext/x-patch; charset=US-ASCII; name=v14-0002-Provide-API-for-streaming-relation-data.patchDownload

From 6f610f971fa66eb7eeb1018257b226dfea581822 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v14 2/7] Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue posix_fadvise()
advice ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem that would benefit from information about
future needs.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code and should eventually be replaced by
this), but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/aio/Makefile      |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/read_stream.c | 819 ++++++++++++++++++++++++++
 src/backend/storage/meson.build       |   1 +
 src/include/storage/read_stream.h     |  63 ++
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 905 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c
 create mode 100644 src/include/storage/read_stream.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..2f29a9ec4d1
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..10e1aa3b20b
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 00000000000..f7e9dc1138b
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,819 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Small buffer of block numbers, useful for 'ungetting' to resolve flow
+	 * control problems when I/Os are split.  Also useful for batch-loading
+	 * block numbers in the fast path.
+	 */
+	BlockNumber blocknums[16];
+	int16		blocknums_count;
+	int16		blocknums_next;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	bool		fast_path;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to work in batches.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (stream->blocknums_next < stream->blocknums_count)
+		return stream->blocknums[stream->blocknums_next++];
+
+	/*
+	 * We only bother to fetch one at a time here (but see the fast path which
+	 * uses more).
+	 */
+	return stream->callback(stream,
+							stream->callback_private_data,
+							per_buffer_data);
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	if (stream->blocknums_next == stream->blocknums_count)
+	{
+		/* Never initialized or entirely consumed.  Re-initialize. */
+		stream->blocknums[0] = blocknum;
+		stream->blocknums_count = 1;
+		stream->blocknums_next = 0;
+	}
+	else
+	{
+		/* Must be the last value return from blocknums array. */
+		Assert(stream->blocknums_next > 0);
+		stream->blocknums_next--;
+		Assert(stream->blocknums[stream->blocknums_next] == blocknum);
+	}
+}
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+static void
+read_stream_fill_blocknums(ReadStream *stream)
+{
+	BlockNumber blocknum;
+	int			i = 0;
+
+	do
+	{
+		blocknum = stream->callback(stream,
+									stream->callback_private_data,
+									NULL);
+		stream->blocknums[i++] = blocknum;
+	} while (i < lengthof(stream->blocknums) &&
+			 blocknum != InvalidBlockNumber);
+	stream->blocknums_count = i;
+	stream->blocknums_next = 0;
+}
+#endif
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * We don't start the pending read just because we've hit the distance
+	 * limit, preferring to give it another chance to grow to full
+	 * io_combine_limit size once more buffers have been consumed.  However,
+	 * if we've already reached io_combine_limit, or we've reached the
+	 * distance limit and there isn't anything pinned yet, or the callback has
+	 * signaled end-of-stream, we start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->pending_read_nblocks == io_combine_limit ||
+		 (stream->pending_read_nblocks == stream->distance &&
+		  stream->pinned_buffers == 0) ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new read stream object that can be used to perform the equivalent
+ * of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   BufferManagerRelation bmr,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/* Make sure our bmr's smgr and persistent are populated. */
+	if (bmr.smgr == NULL)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.bmr = bmr;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream.  Each call returns successive
+ * blocks in the order specified by the callback.  If per_buffer_data_size was
+ * set to a non-zero size, *per_buffer_data receives a pointer to the extra
+ * per-buffer data that the callback had a chance to populate, which remains
+ * valid until the next call to read_stream_next_buffer().  When the stream
+ * runs out of data, InvalidBuffer is returned.  The caller may decide to end
+ * the stream early at any time by calling read_stream_end().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(stream->fast_path))
+	{
+		BlockNumber next_blocknum;
+		bool		need_wait;
+
+		/* Fast path assumptions. */
+		Assert(stream->ios_in_progress == 0);
+		Assert(stream->pinned_buffers == 1);
+		Assert(stream->distance == 1);
+		Assert(stream->pending_read_nblocks == 1);
+		Assert(stream->per_buffer_data_size == 0);
+
+		/* We're going to return the buffer we pinned last time. */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+
+		/*
+		 * Pin a buffer for the next call.  Same buffer entry, and arbitrary
+		 * I/O entry (they're all free).
+		 */
+		need_wait = StartReadBuffer(&stream->ios[0].op,
+									&stream->buffers[oldest_buffer_index],
+									stream->pending_read_blocknum,
+									stream->advice_enabled ?
+									READ_BUFFERS_ISSUE_ADVICE : 0);
+
+		/* Choose the block the next call will pin. */
+		if (unlikely(stream->blocknums_next == stream->blocknums_count))
+			read_stream_fill_blocknums(stream);
+		next_blocknum = stream->blocknums[stream->blocknums_next++];
+
+		/*
+		 * Fast return if the next call doesn't require I/O for the buffer we
+		 * just pinned, and we have a block number to give it as a pending
+		 * read.
+		 */
+		if (likely(!need_wait && next_blocknum != InvalidBlockNumber))
+		{
+			stream->pending_read_blocknum = next_blocknum;
+			return buffer;
+		}
+
+		/*
+		 * For anything more complex, set up some more state and take the slow
+		 * path next time.
+		 */
+		stream->fast_path = false;
+
+		if (need_wait)
+		{
+			/* Next call must wait for I/O for the newly pinned buffer. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+		}
+		if (next_blocknum == InvalidBlockNumber)
+		{
+			/* Next call hits end of stream and can't pin anything more. */
+			stream->distance = 0;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+		{
+			/* Set up the pending read. */
+			stream->pending_read_blocknum = next_blocknum;
+		}
+		return buffer;
+	}
+#endif
+
+	if (unlikely(stream->pinned_buffers == 0))
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+	/* See if we can take the fast path for all-cached scans next time. */
+	if (stream->ios_in_progress == 0 &&
+		stream->pinned_buffers == 1 &&
+		stream->distance == 1 &&
+		stream->pending_read_nblocks == 1 &&
+		stream->per_buffer_data_size == 0)
+	{
+		stream->fast_path = true;
+	}
+	else
+	{
+		stream->fast_path = false;
+	}
+#endif
+
+	return buffer;
+}
+
+/*
+ * Reset a read stream by releasing any queued up buffers, allowing the stream
+ * to be used again for different blocks.  This can be used to clear an
+ * end-of-stream condition and start again, or to throw away blocks that were
+ * speculatively read and read some different blocks instead.
+ */
+void
+read_stream_reset(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Start off assuming data is cached. */
+	stream->distance = 1;
+}
+
+/*
+ * Release and free a read stream.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	read_stream_reset(stream);
+	pfree(stream);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 00000000000..f5dbc087b0b
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  BufferManagerRelation bmr,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_reset(ReadStream *stream);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 826874135e5..0806c32a7bb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1214,6 +1214,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2293,6 +2294,7 @@ ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.44.0

v14-0003-Use-streaming-I-O-in-pg_prewarm.patchtext/x-patch; charset=US-ASCII; name=v14-0003-Use-streaming-I-O-in-pg_prewarm.patchDownload

From ab18713ce5b26bb09bf3264ab3ab4d28220728ae Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v14 3/7] Use streaming I/O in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use the new streaming read
interfaces.  This commit provides a very simple example of such a
transformation, and generates fewer system calls.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..f6269562b54 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -19,6 +19,7 @@
 #include "fmgr.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_read_stream_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_read_stream_next_block(ReadStream *stream,
+								  void *callback_private_data,
+								  void *per_buffer_data)
+{
+	struct pg_prewarm_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_read_stream_private p;
+		ReadStream *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											NULL,
+											BMR_REL(rel),
+											forkNumber,
+											pg_prewarm_read_stream_next_block,
+											&p,
+											0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = read_stream_next_buffer(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.44.0

v14-0004-ALTER-TABLESPACE-.-SET-io_combine_limit.patchtext/x-patch; charset=US-ASCII; name=v14-0004-ALTER-TABLESPACE-.-SET-io_combine_limit.patchDownload

From 163c2ec677c0f57206c53f57c6ac9aff93905246 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 30 Mar 2024 19:09:44 +1300
Subject: [PATCH v14 4/7] ALTER TABLESPACE ... SET (io_combine_limit = ...).

This is the per-tablespace version of the GUC of the same name.

XXX reloptions.c lacks the ability to accept units eg '64kB'!  Which is
why I haven't included it with the main feature commit.

Suggested-by: Tomas Vondra <tomas.vondra@enterprisedb.com
Discussion: https://postgr.es/m/f603ac51-a7ff-496a-99c1-76673635692e%40enterprisedb.com
---
 doc/src/sgml/ref/alter_tablespace.sgml |  9 +++---
 src/backend/access/common/reloptions.c | 12 +++++++-
 src/backend/storage/aio/read_stream.c  | 39 ++++++++++++++++----------
 src/backend/utils/cache/spccache.c     | 14 +++++++++
 src/bin/psql/tab-complete.c            |  3 +-
 src/include/commands/tablespace.h      |  1 +
 src/include/utils/spccache.h           |  1 +
 7 files changed, 58 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..faf0c6e7fbc 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -84,16 +84,17 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
      <para>
       A tablespace parameter to be set or reset.  Currently, the only
       available parameters are <varname>seq_page_cost</varname>,
-      <varname>random_page_cost</varname>, <varname>effective_io_concurrency</varname>
-      and <varname>maintenance_io_concurrency</varname>.
+      <varname>random_page_cost</varname>, <varname>effective_io_concurrency</varname>,
+      <varname>maintenance_io_concurrency</varname> and <varname>io_combine_limit</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
-      that tablespace, and the executor's prefetching behavior, as established
+      that tablespace, and the executor's prefetching and I/O sizing behavior, as established
       by the configuration parameters of the
       same name (see <xref linkend="guc-seq-page-cost"/>,
       <xref linkend="guc-random-page-cost"/>,
       <xref linkend="guc-effective-io-concurrency"/>,
-      <xref linkend="guc-maintenance-io-concurrency"/>).  This may be useful if
+      <xref linkend="guc-maintenance-io-concurrency"/>),
+      <xref linkend="guc-io-combine-limit"/>).  This may be useful if
       one tablespace is located on a disk which is faster or slower than the
       remainder of the I/O subsystem.
      </para>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 963995388bb..573bf1e8606 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -372,6 +372,15 @@ static relopt_int intRelOpts[] =
 		0, 0, 0
 #endif
 	},
+	{
+		{
+			"io_combine_limit",
+			"Limit on the size of data reads and writes.",
+			RELOPT_KIND_TABLESPACE,
+			ShareUpdateExclusiveLock
+		},
+		-1, 1, MAX_IO_COMBINE_LIMIT
+	},
 	{
 		{
 			"parallel_workers",
@@ -2091,7 +2100,8 @@ tablespace_reloptions(Datum reloptions, bool validate)
 		{"random_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, random_page_cost)},
 		{"seq_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, seq_page_cost)},
 		{"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)},
-		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)}
+		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)},
+		{"io_combine_limit", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, io_combine_limit)},
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f7e9dc1138b..8dd19ed3390 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -114,6 +114,7 @@ struct ReadStream
 	int16		max_pinned_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		io_combine_limit;
 	bool		advice_enabled;
 
 	/*
@@ -241,7 +242,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= stream->io_combine_limit);
 
 	/* We had better not exceed the pin limit by starting this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
@@ -329,7 +330,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		if (stream->pending_read_nblocks == stream->io_combine_limit)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -389,7 +390,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
-		(stream->pending_read_nblocks == io_combine_limit ||
+		(stream->pending_read_nblocks == stream->io_combine_limit ||
 		 (stream->pending_read_nblocks == stream->distance &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
@@ -419,6 +420,7 @@ read_stream_begin_relation(int flags,
 	size_t		size;
 	int16		queue_size;
 	int16		max_ios;
+	int16		my_io_combine_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
 
@@ -441,15 +443,21 @@ read_stream_begin_relation(int flags,
 		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
-		 * Avoid circularity while trying to look up tablespace settings or
-		 * before spccache.c is ready.
+		 * Avoid circularity while trying to look up tablespace catalog itself
+		 * or before spccache.c is ready: just use the plain GUC values in
+		 * those cases.
 		 */
 		max_ios = effective_io_concurrency;
+		my_io_combine_limit = io_combine_limit;
 	}
-	else if (flags & READ_STREAM_MAINTENANCE)
-		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
 	else
-		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	{
+		if (flags & READ_STREAM_MAINTENANCE)
+			max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+		else
+			max_ios = get_tablespace_io_concurrency(tablespace_id);
+		my_io_combine_limit = get_tablespace_io_combine_limit(tablespace_id);
+	}
 	max_ios = Min(max_ios, PG_INT16_MAX);
 
 	/*
@@ -460,9 +468,9 @@ read_stream_begin_relation(int flags,
 	 * overflow (even though that's not possible with the current GUC range
 	 * limits), allowing also for the spare entry and the overflow space.
 	 */
-	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Max(max_ios * 4, my_io_combine_limit);
 	max_pinned_buffers = Min(max_pinned_buffers,
-							 PG_INT16_MAX - io_combine_limit - 1);
+							 PG_INT16_MAX - my_io_combine_limit - 1);
 
 	/* Don't allow this backend to pin more than its share of buffers. */
 	if (SmgrIsTemp(bmr.smgr))
@@ -488,14 +496,14 @@ read_stream_begin_relation(int flags,
 	 * io_combine_limit - 1 elements.
 	 */
 	size = offsetof(ReadStream, buffers);
-	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(Buffer) * (queue_size + my_io_combine_limit - 1);
 	size += sizeof(InProgressIO) * Max(1, max_ios);
 	size += per_buffer_data_size * queue_size;
 	size += MAXIMUM_ALIGNOF * 2;
 	stream = (ReadStream *) palloc(size);
 	memset(stream, 0, offsetof(ReadStream, buffers));
 	stream->ios = (InProgressIO *)
-		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+		MAXALIGN(&stream->buffers[queue_size + my_io_combine_limit - 1]);
 	if (per_buffer_data_size > 0)
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
@@ -523,6 +531,7 @@ read_stream_begin_relation(int flags,
 		max_ios = 1;
 
 	stream->max_ios = max_ios;
+	stream->io_combine_limit = my_io_combine_limit;
 	stream->per_buffer_data_size = per_buffer_data_size;
 	stream->max_pinned_buffers = max_pinned_buffers;
 	stream->queue_size = queue_size;
@@ -535,7 +544,7 @@ read_stream_begin_relation(int flags,
 	 * doing full io_combine_limit sized reads (behavior B).
 	 */
 	if (flags & READ_STREAM_FULL)
-		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+		stream->distance = Min(max_pinned_buffers, my_io_combine_limit);
 	else
 		stream->distance = 1;
 
@@ -720,14 +729,14 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No advice; move towards io_combine_limit (behavior B). */
-			if (stream->distance > io_combine_limit)
+			if (stream->distance > stream->io_combine_limit)
 			{
 				stream->distance--;
 			}
 			else
 			{
 				distance = stream->distance * 2;
-				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->io_combine_limit);
 				distance = Min(distance, stream->max_pinned_buffers);
 				stream->distance = distance;
 			}
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index ec63cdc8e52..97664824a4d 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -235,3 +235,17 @@ get_tablespace_maintenance_io_concurrency(Oid spcid)
 	else
 		return spc->opts->maintenance_io_concurrency;
 }
+
+/*
+ * get_tablespace_io_combine_limit
+ */
+int
+get_tablespace_io_combine_limit(Oid spcid)
+{
+	TableSpaceCacheEntry *spc = get_tablespace(spcid);
+
+	if (!spc->opts || spc->opts->io_combine_limit < 0)
+		return io_combine_limit;
+	else
+		return spc->opts->io_combine_limit;
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index fc6865fc703..a4c746cf63a 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2633,7 +2633,8 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER TABLESPACE <foo> SET|RESET ( */
 	else if (Matches("ALTER", "TABLESPACE", MatchAny, "SET|RESET", "("))
 		COMPLETE_WITH("seq_page_cost", "random_page_cost",
-					  "effective_io_concurrency", "maintenance_io_concurrency");
+					  "effective_io_concurrency", "maintenance_io_concurrency",
+					  "io_combine_limit");
 
 	/* ALTER TEXT SEARCH */
 	else if (Matches("ALTER", "TEXT", "SEARCH"))
diff --git a/src/include/commands/tablespace.h b/src/include/commands/tablespace.h
index b6cec632db9..c187283a6dd 100644
--- a/src/include/commands/tablespace.h
+++ b/src/include/commands/tablespace.h
@@ -43,6 +43,7 @@ typedef struct TableSpaceOpts
 	float8		seq_page_cost;
 	int			effective_io_concurrency;
 	int			maintenance_io_concurrency;
+	int			io_combine_limit;
 } TableSpaceOpts;
 
 extern Oid	CreateTableSpace(CreateTableSpaceStmt *stmt);
diff --git a/src/include/utils/spccache.h b/src/include/utils/spccache.h
index 11cfa719955..f8a19764a8f 100644
--- a/src/include/utils/spccache.h
+++ b/src/include/utils/spccache.h
@@ -17,5 +17,6 @@ extern void get_tablespace_page_costs(Oid spcid, float8 *spc_random_page_cost,
 									  float8 *spc_seq_page_cost);
 extern int	get_tablespace_io_concurrency(Oid spcid);
 extern int	get_tablespace_maintenance_io_concurrency(Oid spcid);
+extern int	get_tablespace_io_combine_limit(Oid spcid);
 
 #endif							/* SPCCACHE_H */
-- 
2.44.0

v14-0005-Add-effective_io_readahead_window-setting.patchtext/x-patch; charset=US-ASCII; name=v14-0005-Add-effective_io_readahead_window-setting.patchDownload

From a72c51302f3ca71ab7524bcead0179a6304a677b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 31 Mar 2024 05:48:20 +1300
Subject: [PATCH v14 5/7] Add effective_io_readahead_window setting.

In a few places we estimate whether the kernel will consider a pattern
of reads to be sequential, for readahead purposes.  If we think so, we
don't issue POSIX_FADV_WILLNEED advice, because we know that it's better
to get out of the kernel's way.

Linux uses a sliding window to detect sequential access, defaulting to
128kB (see blockdev --report).  This setting allows the user to tell
PostgreSQL that size, so we can try to model kernel behavior more
accurately.  Several other known operating systems seem to use a strict
next-block-only detection, so for other systems we just use 0 for now
(it must be exactly sequential), until we have specific information
about other OSes.  It's not a very critical question, because most other
systems either lack or ignore POSIX_FADV_WILLNEED anyway.

XXX Is this a good idea?
XXX This doesn't support units like '128kB' in ALTER TABLE, to be
like the GUC.

Suggested-by: Tomas Vondra <tomas.vondra@enterprisedb.com
Discussion: https://postgr.es/m/f603ac51-a7ff-496a-99c1-76673635692e%40enterprisedb.com
---
 doc/src/sgml/config.sgml                      | 22 +++++++++++++++++++
 doc/src/sgml/ref/alter_tablespace.sgml        |  6 +++--
 src/backend/access/common/reloptions.c        | 10 +++++++++
 src/backend/storage/aio/read_stream.c         | 14 ++++++++++--
 src/backend/storage/buffer/bufmgr.c           |  7 ++++++
 src/backend/utils/cache/spccache.c            | 14 ++++++++++++
 src/backend/utils/misc/guc_tables.c           | 14 ++++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/bin/psql/tab-complete.c                   |  2 +-
 src/include/commands/tablespace.h             |  1 +
 src/include/storage/bufmgr.h                  |  7 ++++++
 src/include/utils/spccache.h                  |  1 +
 12 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 241a6079688..fe718b8228e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2734,6 +2734,28 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-effective-io-readahead-window" xreflabel="effective_io_readahead_window">
+       <term><varname>effective_io_readahead_window</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>effective_io_readahead_window</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the size of the window that <productname>PostgreSQL</productname>
+         expects the OS to use to detect sequential reads.  This is used to
+         make decisions about when to issue hints to the kernel on future
+         random reads.  Does not apply when direct I/O is used.
+        </para>
+        <para>
+         The default is 128kB on Linux, and 0 otherwise.  This value can
+         be overridden for tables in a particular tablespace by setting the
+         tablespace parameter of the same name (see
+         <xref linkend="sql-altertablespace"/>).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index faf0c6e7fbc..25dbbea66ac 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -85,7 +85,8 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       A tablespace parameter to be set or reset.  Currently, the only
       available parameters are <varname>seq_page_cost</varname>,
       <varname>random_page_cost</varname>, <varname>effective_io_concurrency</varname>,
-      <varname>maintenance_io_concurrency</varname> and <varname>io_combine_limit</varname>.
+      <varname>maintenance_io_concurrency</varname>,
+      <varname>effective_io_readahead_window</varname> and <varname>io_combine_limit</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
       that tablespace, and the executor's prefetching and I/O sizing behavior, as established
@@ -94,7 +95,8 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       <xref linkend="guc-random-page-cost"/>,
       <xref linkend="guc-effective-io-concurrency"/>,
       <xref linkend="guc-maintenance-io-concurrency"/>),
-      <xref linkend="guc-io-combine-limit"/>).  This may be useful if
+      <xref linkend="guc-io-combine-limit"/>),
+      <xref linkend="guc-effective-io-readahead-window"/>).  This may be useful if
       one tablespace is located on a disk which is faster or slower than the
       remainder of the I/O subsystem.
      </para>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 573bf1e8606..8ee22723db9 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -381,6 +381,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, MAX_IO_COMBINE_LIMIT
 	},
+	{
+		{
+			"effective_io_readahead_window",
+			"Size of the window used by the OS to detect sequential buffered access.",
+			RELOPT_KIND_TABLESPACE,
+			ShareUpdateExclusiveLock
+		},
+		-1, 1, PG_INT16_MAX,
+	},
 	{
 		{
 			"parallel_workers",
@@ -2102,6 +2111,7 @@ tablespace_reloptions(Datum reloptions, bool validate)
 		{"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)},
 		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)},
 		{"io_combine_limit", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, io_combine_limit)},
+		{"effective_io_readahead_window", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_readahead_window)},
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 8dd19ed3390..a32177b148c 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -115,6 +115,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		io_combine_limit;
+	int16		effective_io_readahead_window;
 	bool		advice_enabled;
 
 	/*
@@ -256,11 +257,15 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/*
 	 * If advice hasn't been suppressed, this system supports it, and this
-	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 * isn't sequential according to the effective_io_readahead_window
+	 * setting, (which should ideally tell us how the OS detects sequential
+	 * buffered access), then we'll issue advice.
 	 */
 	if (!suppress_advice &&
 		stream->advice_enabled &&
-		stream->pending_read_blocknum != stream->seq_blocknum)
+		!(stream->pending_read_blocknum >= stream->seq_blocknum &&
+		  stream->pending_read_blocknum <= (stream->seq_blocknum +
+											stream->effective_io_readahead_window)))
 		flags = READ_BUFFERS_ISSUE_ADVICE;
 	else
 		flags = 0;
@@ -422,6 +427,7 @@ read_stream_begin_relation(int flags,
 	int16		max_ios;
 	int16		my_io_combine_limit;
 	uint32		max_pinned_buffers;
+	int16		my_effective_io_readahead_window;
 	Oid			tablespace_id;
 
 	/* Make sure our bmr's smgr and persistent are populated. */
@@ -449,6 +455,7 @@ read_stream_begin_relation(int flags,
 		 */
 		max_ios = effective_io_concurrency;
 		my_io_combine_limit = io_combine_limit;
+		my_effective_io_readahead_window = effective_io_readahead_window;
 	}
 	else
 	{
@@ -457,6 +464,8 @@ read_stream_begin_relation(int flags,
 		else
 			max_ios = get_tablespace_io_concurrency(tablespace_id);
 		my_io_combine_limit = get_tablespace_io_combine_limit(tablespace_id);
+		my_effective_io_readahead_window =
+			get_tablespace_io_readahead_window(tablespace_id);
 	}
 	max_ios = Min(max_ios, PG_INT16_MAX);
 
@@ -535,6 +544,7 @@ read_stream_begin_relation(int flags,
 	stream->per_buffer_data_size = per_buffer_data_size;
 	stream->max_pinned_buffers = max_pinned_buffers;
 	stream->queue_size = queue_size;
+	stream->effective_io_readahead_window = my_effective_io_readahead_window;
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 70b19238b78..8960fb02a01 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -164,6 +164,13 @@ int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
  */
 int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 
+/*
+ * In order to detect sequential access in the same way as the OS would, we
+ * need to know the size of the window it uses.  Overridden by the tablespace
+ * setting of the same name.
+ */
+int			effective_io_readahead_window = DEFAULT_EFFECTIVE_IO_READAHEAD_WINDOW;
+
 /*
  * GUC variables about triggering kernel writeback for buffers written; OS
  * dependent defaults are set via the GUC mechanism.
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index 97664824a4d..b29d7abda67 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -249,3 +249,17 @@ get_tablespace_io_combine_limit(Oid spcid)
 	else
 		return spc->opts->io_combine_limit;
 }
+
+/*
+ * get_tablespace_io_readahead_window
+ */
+int
+get_tablespace_io_readahead_window(Oid spcid)
+{
+	TableSpaceCacheEntry *spc = get_tablespace(spcid);
+
+	if (!spc->opts || spc->opts->effective_io_readahead_window < 0)
+		return effective_io_readahead_window;
+	else
+		return spc->opts->effective_io_readahead_window;
+}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c12784cbec8..ab5e5f50df9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3143,6 +3143,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"effective_io_readahead_window",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Size of the window the OS uses to detect sequential buffered file access."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&effective_io_readahead_window,
+		DEFAULT_EFFECTIVE_IO_READAHEAD_WINDOW,
+		0, PG_INT16_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index baecde28410..41598104b32 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -204,6 +204,7 @@
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+#effective_io_readahead_window = 16	# 1-32767; expected system readahead window
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a4c746cf63a..e5e3b8a0366 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2634,7 +2634,7 @@ psql_completion(const char *text, int start, int end)
 	else if (Matches("ALTER", "TABLESPACE", MatchAny, "SET|RESET", "("))
 		COMPLETE_WITH("seq_page_cost", "random_page_cost",
 					  "effective_io_concurrency", "maintenance_io_concurrency",
-					  "io_combine_limit");
+					  "io_combine_limit", "effective_io_readahead_window");
 
 	/* ALTER TEXT SEARCH */
 	else if (Matches("ALTER", "TEXT", "SEARCH"))
diff --git a/src/include/commands/tablespace.h b/src/include/commands/tablespace.h
index c187283a6dd..94cd1a29985 100644
--- a/src/include/commands/tablespace.h
+++ b/src/include/commands/tablespace.h
@@ -44,6 +44,7 @@ typedef struct TableSpaceOpts
 	int			effective_io_concurrency;
 	int			maintenance_io_concurrency;
 	int			io_combine_limit;
+	int			effective_io_readahead_window;
 } TableSpaceOpts;
 
 extern Oid	CreateTableSpace(CreateTableSpaceStmt *stmt);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 241f68c45e1..d131a0fb9f4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -138,6 +138,13 @@ extern PGDLLIMPORT int maintenance_io_concurrency;
 #define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;
 
+#ifdef __linux__
+#define DEFAULT_EFFECTIVE_IO_READAHEAD_WINDOW ((128 * 1024) / BLCKSZ)
+#else
+#define DEFAULT_EFFECTIVE_IO_READAHEAD_WINDOW 0
+#endif
+extern PGDLLIMPORT int effective_io_readahead_window;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
diff --git a/src/include/utils/spccache.h b/src/include/utils/spccache.h
index f8a19764a8f..a9d163400f7 100644
--- a/src/include/utils/spccache.h
+++ b/src/include/utils/spccache.h
@@ -18,5 +18,6 @@ extern void get_tablespace_page_costs(Oid spcid, float8 *spc_random_page_cost,
 extern int	get_tablespace_io_concurrency(Oid spcid);
 extern int	get_tablespace_maintenance_io_concurrency(Oid spcid);
 extern int	get_tablespace_io_combine_limit(Oid spcid);
+extern int	get_tablespace_io_readahead_window(Oid spcid);
 
 #endif							/* SPCCACHE_H */
-- 
2.44.0

v14-0006-Increase-PG_IOV_MAX-for-bigger-io_combine_limit.patchtext/x-patch; charset=US-ASCII; name=v14-0006-Increase-PG_IOV_MAX-for-bigger-io_combine_limit.patchDownload

From 05bb918126ff01559808e66a63eea2f2f6b832ce Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 31 Mar 2024 17:24:44 +1300
Subject: [PATCH v14 6/7] Increase PG_IOV_MAX for bigger io_combine_limit.

PG_IOV_MAX, our clamped version of IOV_MAX, constrains io_combine_limit.
Change the clamp from 32 up to 128 vectors so that we can set
io_combine_limit up to 1MB instead of 256kB (assuming BLCKSZ == 8kB).
These numbers are fairly arbitrary and it's not at all clear that it's a
good idea to run with large numbers, but we might as well at least allow
experimentation.

This will also affect the size of the writes when initializing WAL
segments, which will change from 256kB to 1MB (assuming BLCKSZ == 8kB).
---
 src/include/port/pg_iovec.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index 7255c1bd911..5055ede2c8e 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -33,8 +33,11 @@ struct iovec
 
 #endif
 
-/* Define a reasonable maximum that is safe to use on the stack. */
-#define PG_IOV_MAX Min(IOV_MAX, 32)
+/*
+ * Define a reasonable maximum that is safe to use on the stack but also allows
+ * io_combine_limit to reach large sizes.
+ */
+#define PG_IOV_MAX Min(IOV_MAX, 128)
 
 /*
  * Like preadv(), but with a prefix to remind us of a side-effect: on Windows
-- 
2.44.0

#55

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#54)

3 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

I had been planning to commit v14 this morning but got cold feet with
the BMR-based interface. Heikki didn't like it much, and in the end,
neither did I. I have now removed it, and it seems much better. No
other significant changes, just parameter types and inlining details.
For example:

* read_stream_begin_relation() now takes a Relation, likes its name says
* StartReadBuffers()'s operation takes smgr and optional rel
* ReadBuffer_common() takes smgr and optional rel

ReadBuffer() (which calls ReadBuffer_common() which calls
StartReadBuffer() as before) now shows no regression in a tight loop
over ~1 million already-in-cache pages (something Heikki had observed
before and could only completely fix with a change that affected all
callers). The same test using read_stream.c is still slightly slower,
~1 million pages -in-cache pages 301ms -> 308ms, which seems
acceptable to me and could perhaps be chased down with more study of
inlining/specialisation. As mentioned before, it doesn't seem to be
measurable once you actually do something with the pages.

In some ways BMR was better than the "fake RelationData" concept
(another attempt at wrestling with the relation vs storage duality,
that is, the online vs recovery duality). But in other ways it was
worse: a weird inconsistent mixture of pass-by-pointer and
pass-by-value interfaces that required several code paths to handle it
being only partially initialised, which turned out to be wasted cycles
implicated in regressions, despite which it is not even very nice to
use anyway. I'm sure it could be made to work better, but I'm not yet
sure it's really needed. In later work for recovery I will need to
add a separate constructor read_stream_begin_smgr_something() anyway
for other reasons (multi-relation streaming, different callback) and
perhaps also a separate StartReadBuffersSmgr() if it saves measurable
cycles to strip out branches. Maybe it was all just premature
pessimisation.

So this is the version I'm going to commit shortly, barring objections.

Attachments:

v15-0001-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v15-0001-Provide-vectored-variant-of-ReadBuffer.patchDownload

From 4c3ec42aabaaf0b54f8c4393bef3411fed3a054f Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Tue, 2 Apr 2024 14:40:40 +1300
Subject: [PATCH v15 1/4] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two main advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the
kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.

A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are now made into public APIs.  Those are
provided for the benefit of callers of StartReadBuffers(), who should
respect them when deciding how many buffers to read at once.  A later
commit will add a higher level mechanism for doing that automatically
with a more practical interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (some optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 720 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  50 ++
 src/tools/pgindent/typedefs.list              |   2 +
 7 files changed, 592 insertions(+), 223 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0e9617bcff4..624518e0b01 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2708,6 +2708,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..944ee271ba4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,10 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffer() -- as above, with separate wait step
+ * StartReadBuffers() -- multiple block version
+ * WaitReadBuffers() -- second step of above
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -152,6 +156,13 @@ int			effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;
  */
 int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
 
+/*
+ * Limit on how many blocks should be handled in single I/O operations.
+ * StartReadBuffers() callers should respect it, as should other operations
+ * that call smgr APIs directly.
+ */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /*
  * GUC variables about triggering kernel writeback for buffers written; OS
  * dependent defaults are set via the GUC mechanism.
@@ -471,10 +482,10 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(SMgrRelation smgr, Relation rel,
+								bool force_permament,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,18 +511,18 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
-static BufferDesc *BufferAlloc(SMgrRelation smgr,
-							   char relpersistence,
-							   ForkNumber forkNum,
-							   BlockNumber blockNum,
-							   BufferAccessStrategy strategy,
-							   bool *foundPtr, IOContext io_context);
+static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
+									  char relpersistence,
+									  ForkNumber forkNum,
+									  BlockNumber blockNum,
+									  BufferAccessStrategy strategy,
+									  bool *foundPtr, IOContext io_context);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -777,11 +788,10 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * If strategy is not NULL, a nondefault buffer access strategy is used.
  * See buffer/README for details.
  */
-Buffer
+inline Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -798,11 +808,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 	 * Read the buffer, and update pgstat counters to reflect a cache hit or
 	 * miss.
 	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(RelationGetSmgr(reln), reln, false,
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +830,11 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(smgr, NULL, permanent,
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,55 +1000,90 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr.smgr, bmr.rel, false,
+								   fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
+	uint32		buf_state;
 
-	*hit = false;
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
 
-	/*
-	 * Backward compatibility path, most code should use ExtendBufferedRel()
-	 * instead, as acquiring the extension lock inside ExtendBufferedRel()
-	 * scales a lot better.
-	 */
-	if (unlikely(blockNum == P_NEW))
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
 	{
-		uint32		flags = EB_SKIP_EXTENSION_LOCK;
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
 
-		/*
-		 * Since no-one else can be looking at the page contents yet, there is
-		 * no difference between an exclusive lock and a cleanup-strength
-		 * lock.
-		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-			flags |= EB_LOCK_FIRST;
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static pg_attribute_always_inline Buffer
+PinBufferForBlock(SMgrRelation smgr,
+				  Relation rel,
+				  bool force_permanent,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+	Assert(blockNum != P_NEW);
+
+	if (force_permanent || !rel)
+		persistence = RELPERSISTENCE_PERMANENT;
+	else
+		persistence = rel->rd_rel->relpersistence;
+
+	if (persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
 	}
 
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
@@ -1051,50 +1092,34 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 									   smgr->smgr_rlocator.locator.relNumber,
 									   smgr->smgr_rlocator.backend);
 
-	if (isLocalBuf)
+	if (persistence == RELPERSISTENCE_TEMP)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
+		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
 			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
+	{
+		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (rel)
 	{
 		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
 		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
+		pgstat_count_buffer_read(rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(rel);
 	}
-
-	/* At this point we do NOT hold any locks. */
-
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
@@ -1103,119 +1128,397 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 										  smgr->smgr_rlocator.locator.dbOid,
 										  smgr->smgr_rlocator.locator.relNumber,
 										  smgr->smgr_rlocator.backend,
-										  found);
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ *
+ * smgr is required, rel is optional unless using P_NEW.
+ */
+static pg_attribute_always_inline Buffer
+ReadBuffer_common(SMgrRelation smgr, Relation rel, bool force_permanent,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
+
+	/*
+	 * Backward compatibility path, most code should use ExtendBufferedRel()
+	 * instead, as acquiring the extension lock inside ExtendBufferedRel()
+	 * scales a lot better.
+	 */
+	if (unlikely(blockNum == P_NEW))
+	{
+		uint32		flags = EB_SKIP_EXTENSION_LOCK;
 
 		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
+		 * Since no-one else can be looking at the page contents yet, there is
+		 * no difference between an exclusive lock and a cleanup-strength
+		 * lock.
 		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+			flags |= EB_LOCK_FIRST;
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		return ExtendBufferedRel(BMR_REL(rel), forkNum, strategy, flags);
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
+	{
+		bool		found;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+		buffer = PinBufferForBlock(smgr, rel, force_permanent,
+								   forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
+	}
 
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
-	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		flags = 0;
+	operation.smgr = smgr;
+	operation.rel = rel;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
+
+	return buffer;
+}
+
+static pg_attribute_always_inline bool
+StartReadBuffersImpl(ReadBuffersOperation *operation,
+					 Buffer *buffers,
+					 BlockNumber blockNum,
+					 int *nblocks,
+					 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+	for (int i = 0; i < actual_nblocks; ++i)
+	{
+		bool		found;
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		buffers[i] = PinBufferForBlock(operation->smgr,
+									   operation->rel,
+									   false,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		if (found)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
 		}
 	}
+	*nblocks = actual_nblocks;
 
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
+	if (likely(io_buffers_len == 0))
+		return false;
+
+	/* Populate information needed for I/O. */
+	operation->buffers = buffers;
+	operation->blocknum = blockNum;
+	operation->flags = flags;
+	operation->nblocks = actual_nblocks;
+	operation->io_buffers_len = io_buffers_len;
+
+	if (flags & READ_BUFFERS_ISSUE_ADVICE)
 	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+		/*
+		 * In theory we should only do this if PinBufferForBlock() had to
+		 * allocate new buffers above.  That way, if two calls to
+		 * StartReadBuffers() were made for the same blocks before
+		 * WaitReadBuffers(), only the first would issue the advice. That'd be
+		 * a better simulation of true asynchronous I/O, which would only
+		 * start the I/O once, but isn't done here for simplicity.  Note also
+		 * that the following call might actually issue two advice calls if we
+		 * cross a segment boundary; in a true asynchronous version we might
+		 * choose to process only one real I/O at a time in that case.
+		 */
+		smgrprefetch(operation->smgr,
+					 operation->forknum,
+					 blockNum,
+					 operation->io_buffers_len);
 	}
 
-	if (isLocalBuf)
+	/* Indicate that WaitReadBuffers() should be called. */
+	return true;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
+ *
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice if
+ * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
+ * happens synchronously in WaitReadBuffers().  In future work, true I/O could
+ * be initiated here.
+ */
+bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	return StartReadBuffersImpl(operation, buffers, blockNum, nblocks, flags);
+}
+
+/*
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, because it is
+ * specialized for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffersImpl(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
 	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	IOContext	io_context;
+	IOObject	io_object;
+	char		persistence;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
+
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	persistence = operation->rel
+		? operation->rel->rd_rel->relpersistence
+		: RELPERSISTENCE_PERMANENT;
+	if (persistence == RELPERSISTENCE_TEMP)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
 	}
 	else
 	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
 	}
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (persistence == RELPERSISTENCE_TEMP)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->smgr->smgr_rlocator.locator.spcOid,
+											  operation->smgr->smgr_rlocator.locator.dbOid,
+											  operation->smgr->smgr_rlocator.locator.relNumber,
+											  operation->smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (persistence == RELPERSISTENCE_TEMP)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (persistence == RELPERSISTENCE_TEMP)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->smgr->smgr_rlocator.locator.spcOid,
+											  operation->smgr->smgr_rlocator.locator.dbOid,
+											  operation->smgr->smgr_rlocator.locator.relNumber,
+											  operation->smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1526,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1235,7 +1534,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * No locks are held either at entry or exit.
  */
-static BufferDesc *
+static pg_attribute_always_inline BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
@@ -1286,19 +1585,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1653,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1688,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2044,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2309,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2332,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2647,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2661,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3729,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5464,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5485,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 92fcd5fa4d5..c12784cbec8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3129,6 +3129,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index adcc0257f91..baecde28410 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a0de7aaee3f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -106,6 +107,36 @@ typedef struct BufferManagerRelation
 #define BMR_REL(p_rel) ((BufferManagerRelation){.rel = p_rel})
 #define BMR_SMGR(p_smgr, p_relpersistence) ((BufferManagerRelation){.smgr = p_smgr, .relpersistence = p_relpersistence})
 
+typedef enum ReadBuffersFlags
+{
+	/* Zero out page if reading fails. */
+	READ_BUFFERS_ZERO_ON_ERROR = (1 << 0),
+
+	/* Call smgrprefetch() if I/O necessary. */
+	READ_BUFFERS_ISSUE_ADVICE = (1 << 1),
+} ReadBuffersFlags;
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	struct SMgrRelationData *smgr;
+	Relation	rel;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/*
+	 * The following private members are private state for communication
+	 * between StartReadBuffers() and WaitReadBuffers(), initialized only if
+	 * an actual read is required, and should not be modified.
+	 */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
 
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
@@ -133,6 +164,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -177,6 +212,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +297,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 79745ba9134..04484d41601 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2288,6 +2288,8 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersFlags
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.44.0

v15-0002-Provide-API-for-streaming-relation-data.patchtext/x-patch; charset=US-ASCII; name=v15-0002-Provide-API-for-streaming-relation-data.patchDownload

From 885659e1c982f7e206878d3d38d4f5de5ffddf2d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v15 2/4] Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue posix_fadvise()
advice ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem that would benefit from information about
future needs.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code and should eventually be replaced by
this), but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/aio/Makefile      |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/read_stream.c | 811 ++++++++++++++++++++++++++
 src/backend/storage/meson.build       |   1 +
 src/include/storage/read_stream.h     |  63 ++
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 897 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c
 create mode 100644 src/include/storage/read_stream.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..2f29a9ec4d1
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..10e1aa3b20b
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 00000000000..1a3167d74ae
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,811 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Small buffer of block numbers, useful for 'ungetting' to resolve flow
+	 * control problems when I/Os are split.  Also useful for batch-loading
+	 * block numbers in the fast path.
+	 */
+	BlockNumber blocknums[16];
+	int16		blocknums_count;
+	int16		blocknums_next;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	bool		fast_path;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to work in batches.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (stream->blocknums_next < stream->blocknums_count)
+		return stream->blocknums[stream->blocknums_next++];
+
+	/*
+	 * We only bother to fetch one at a time here (but see the fast path which
+	 * uses more).
+	 */
+	return stream->callback(stream,
+							stream->callback_private_data,
+							per_buffer_data);
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	if (stream->blocknums_next == stream->blocknums_count)
+	{
+		/* Never initialized or entirely consumed.  Re-initialize. */
+		stream->blocknums[0] = blocknum;
+		stream->blocknums_count = 1;
+		stream->blocknums_next = 0;
+	}
+	else
+	{
+		/* Must be the last value return from blocknums array. */
+		Assert(stream->blocknums_next > 0);
+		stream->blocknums_next--;
+		Assert(stream->blocknums[stream->blocknums_next] == blocknum);
+	}
+}
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+static void
+read_stream_fill_blocknums(ReadStream *stream)
+{
+	BlockNumber blocknum;
+	int			i = 0;
+
+	do
+	{
+		blocknum = stream->callback(stream,
+									stream->callback_private_data,
+									NULL);
+		stream->blocknums[i++] = blocknum;
+	} while (i < lengthof(stream->blocknums) &&
+			 blocknum != InvalidBlockNumber);
+	stream->blocknums_count = i;
+	stream->blocknums_next = 0;
+}
+#endif
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * We don't start the pending read just because we've hit the distance
+	 * limit, preferring to give it another chance to grow to full
+	 * io_combine_limit size once more buffers have been consumed.  However,
+	 * if we've already reached io_combine_limit, or we've reached the
+	 * distance limit and there isn't anything pinned yet, or the callback has
+	 * signaled end-of-stream, we start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->pending_read_nblocks == io_combine_limit ||
+		 (stream->pending_read_nblocks == stream->distance &&
+		  stream->pinned_buffers == 0) ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new read stream object that can be used to perform the equivalent
+ * of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   Relation rel,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+	SMgrRelation smgr;
+
+	smgr = RelationGetSmgr(rel);
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		IsCatalogRelation(rel) ||
+		IsCatalogRelationOid(smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.smgr = RelationGetSmgr(rel);
+		stream->ios[i].op.rel = rel;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream.  Each call returns successive
+ * blocks in the order specified by the callback.  If per_buffer_data_size was
+ * set to a non-zero size, *per_buffer_data receives a pointer to the extra
+ * per-buffer data that the callback had a chance to populate, which remains
+ * valid until the next call to read_stream_next_buffer().  When the stream
+ * runs out of data, InvalidBuffer is returned.  The caller may decide to end
+ * the stream early at any time by calling read_stream_end().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(stream->fast_path))
+	{
+		BlockNumber next_blocknum;
+		bool		need_wait;
+
+		/* Fast path assumptions. */
+		Assert(stream->ios_in_progress == 0);
+		Assert(stream->pinned_buffers == 1);
+		Assert(stream->distance == 1);
+		Assert(stream->pending_read_nblocks == 1);
+		Assert(stream->per_buffer_data_size == 0);
+
+		/* We're going to return the buffer we pinned last time. */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+
+		/*
+		 * Pin a buffer for the next call.  Same buffer entry, and arbitrary
+		 * I/O entry (they're all free).
+		 */
+		need_wait = StartReadBuffer(&stream->ios[0].op,
+									&stream->buffers[oldest_buffer_index],
+									stream->pending_read_blocknum,
+									stream->advice_enabled ?
+									READ_BUFFERS_ISSUE_ADVICE : 0);
+
+		/* Choose the block the next call will pin. */
+		if (unlikely(stream->blocknums_next == stream->blocknums_count))
+			read_stream_fill_blocknums(stream);
+		next_blocknum = stream->blocknums[stream->blocknums_next++];
+
+		/*
+		 * Fast return if the next call doesn't require I/O for the buffer we
+		 * just pinned, and we have a block number to give it as a pending
+		 * read.
+		 */
+		if (likely(!need_wait && next_blocknum != InvalidBlockNumber))
+		{
+			stream->pending_read_blocknum = next_blocknum;
+			return buffer;
+		}
+
+		/*
+		 * For anything more complex, set up some more state and take the slow
+		 * path next time.
+		 */
+		stream->fast_path = false;
+
+		if (need_wait)
+		{
+			/* Next call must wait for I/O for the newly pinned buffer. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+		}
+		if (next_blocknum == InvalidBlockNumber)
+		{
+			/* Next call hits end of stream and can't pin anything more. */
+			stream->distance = 0;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+		{
+			/* Set up the pending read. */
+			stream->pending_read_blocknum = next_blocknum;
+		}
+		return buffer;
+	}
+#endif
+
+	if (unlikely(stream->pinned_buffers == 0))
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+	/* See if we can take the fast path for all-cached scans next time. */
+	if (stream->ios_in_progress == 0 &&
+		stream->pinned_buffers == 1 &&
+		stream->distance == 1 &&
+		stream->pending_read_nblocks == 1 &&
+		stream->per_buffer_data_size == 0)
+	{
+		stream->fast_path = true;
+	}
+	else
+	{
+		stream->fast_path = false;
+	}
+#endif
+
+	return buffer;
+}
+
+/*
+ * Reset a read stream by releasing any queued up buffers, allowing the stream
+ * to be used again for different blocks.  This can be used to clear an
+ * end-of-stream condition and start again, or to throw away blocks that were
+ * speculatively read and read some different blocks instead.
+ */
+void
+read_stream_reset(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Start off assuming data is cached. */
+	stream->distance = 1;
+}
+
+/*
+ * Release and free a read stream.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	read_stream_reset(stream);
+	pfree(stream);
+}
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 00000000000..fae09d2b4cc
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  Relation rel,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_reset(ReadStream *stream);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 04484d41601..8bc8dd6f1c6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1215,6 +1215,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2295,6 +2296,7 @@ ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.44.0

v15-0003-Use-streaming-I-O-in-pg_prewarm.patchtext/x-patch; charset=US-ASCII; name=v15-0003-Use-streaming-I-O-in-pg_prewarm.patchDownload

From 149e7a7fb6f632fd63458e3c9f08014ef00dffde Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 23 Jul 2023 09:28:42 +1200
Subject: [PATCH v15 3/4] Use streaming I/O in pg_prewarm.

Instead of calling ReadBuffer() repeatedly, use the new streaming read
interfaces.  This commit provides a very simple example of such a
transformation, and generates fewer system calls.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 contrib/pg_prewarm/pg_prewarm.c | 40 ++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..5c859e983c5 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -19,6 +19,7 @@
 #include "fmgr.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_read_stream_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_read_stream_next_block(ReadStream *stream,
+								  void *callback_private_data,
+								  void *per_buffer_data)
+{
+	struct pg_prewarm_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_read_stream_private p;
+		ReadStream *stream;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											NULL,
+											rel,
+											forkNumber,
+											pg_prewarm_read_stream_next_block,
+											&p,
+											0);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = read_stream_next_buffer(stream, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
 	}
 
 	/* Close relation, release lock. */
-- 
2.44.0

#56

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#55)

3 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

On Tue, Apr 2, 2024 at 9:39 PM Thomas Munro <thomas.munro@gmail.com> wrote:

So this is the version I'm going to commit shortly, barring objections.

And done, after fixing a small snafu with smgr-only reads coming from
CreateAndCopyRelationData() (BM_PERMANENT would be
incorrectly/unnecessarily set for unlogged tables).

Here are the remaining patches discussed in this thread. They give
tablespace-specific io_combine_limit, effective_io_readahead_window
(is this useful?), and up-to-1MB io_combine_limit (is this useful?).
I think the first two would probably require teaching reloption.c how
to use guc.c's parse_int() and unit flags, but I won't have time to
look at that for this release so I'll just leave these here.

On the subject of guc.c, this is a terrible error message... did I do
something wrong?

postgres=# set io_combine_limit = '42MB';
ERROR: 5376 8kB is outside the valid range for parameter
"io_combine_limit" (1 .. 32)

Attachments:

v16-0001-ALTER-TABLESPACE-.-SET-io_combine_limit.patchtext/x-patch; charset=US-ASCII; name=v16-0001-ALTER-TABLESPACE-.-SET-io_combine_limit.patchDownload

From 84b8280481312cdd1efcb7efa1182d4647cbe00a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 30 Mar 2024 19:09:44 +1300
Subject: [PATCH v16 1/4] ALTER TABLESPACE ... SET (io_combine_limit = ...).

This is the per-tablespace version of the GUC of the same name.

XXX reloptions.c lacks the ability to accept units eg '64kB'!  Which is
why I haven't included it with the main feature commit.

Suggested-by: Tomas Vondra <tomas.vondra@enterprisedb.com
Discussion: https://postgr.es/m/f603ac51-a7ff-496a-99c1-76673635692e%40enterprisedb.com
---
 doc/src/sgml/ref/alter_tablespace.sgml |  9 +++---
 src/backend/access/common/reloptions.c | 12 +++++++-
 src/backend/storage/aio/read_stream.c  | 39 ++++++++++++++++----------
 src/backend/utils/cache/spccache.c     | 14 +++++++++
 src/bin/psql/tab-complete.c            |  3 +-
 src/include/commands/tablespace.h      |  1 +
 src/include/utils/spccache.h           |  1 +
 7 files changed, 58 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index 6ec863400d1..faf0c6e7fbc 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -84,16 +84,17 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
      <para>
       A tablespace parameter to be set or reset.  Currently, the only
       available parameters are <varname>seq_page_cost</varname>,
-      <varname>random_page_cost</varname>, <varname>effective_io_concurrency</varname>
-      and <varname>maintenance_io_concurrency</varname>.
+      <varname>random_page_cost</varname>, <varname>effective_io_concurrency</varname>,
+      <varname>maintenance_io_concurrency</varname> and <varname>io_combine_limit</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
-      that tablespace, and the executor's prefetching behavior, as established
+      that tablespace, and the executor's prefetching and I/O sizing behavior, as established
       by the configuration parameters of the
       same name (see <xref linkend="guc-seq-page-cost"/>,
       <xref linkend="guc-random-page-cost"/>,
       <xref linkend="guc-effective-io-concurrency"/>,
-      <xref linkend="guc-maintenance-io-concurrency"/>).  This may be useful if
+      <xref linkend="guc-maintenance-io-concurrency"/>),
+      <xref linkend="guc-io-combine-limit"/>).  This may be useful if
       one tablespace is located on a disk which is faster or slower than the
       remainder of the I/O subsystem.
      </para>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index d6eb5d85599..1e1c611fab2 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -371,6 +371,15 @@ static relopt_int intRelOpts[] =
 		0, 0, 0
 #endif
 	},
+	{
+		{
+			"io_combine_limit",
+			"Limit on the size of data reads and writes.",
+			RELOPT_KIND_TABLESPACE,
+			ShareUpdateExclusiveLock
+		},
+		-1, 1, MAX_IO_COMBINE_LIMIT
+	},
 	{
 		{
 			"parallel_workers",
@@ -2089,7 +2098,8 @@ tablespace_reloptions(Datum reloptions, bool validate)
 		{"random_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, random_page_cost)},
 		{"seq_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, seq_page_cost)},
 		{"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)},
-		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)}
+		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)},
+		{"io_combine_limit", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, io_combine_limit)},
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5e..907c80e6bf9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -114,6 +114,7 @@ struct ReadStream
 	int16		max_pinned_buffers;
 	int16		pinned_buffers;
 	int16		distance;
+	int16		io_combine_limit;
 	bool		advice_enabled;
 
 	/*
@@ -241,7 +242,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= stream->io_combine_limit);
 
 	/* We had better not exceed the pin limit by starting this read. */
 	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
@@ -329,7 +330,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		int16		buffer_index;
 		void	   *per_buffer_data;
 
-		if (stream->pending_read_nblocks == io_combine_limit)
+		if (stream->pending_read_nblocks == stream->io_combine_limit)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
@@ -389,7 +390,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 	 * signaled end-of-stream, we start the read immediately.
 	 */
 	if (stream->pending_read_nblocks > 0 &&
-		(stream->pending_read_nblocks == io_combine_limit ||
+		(stream->pending_read_nblocks == stream->io_combine_limit ||
 		 (stream->pending_read_nblocks == stream->distance &&
 		  stream->pinned_buffers == 0) ||
 		 stream->distance == 0) &&
@@ -419,6 +420,7 @@ read_stream_begin_relation(int flags,
 	size_t		size;
 	int16		queue_size;
 	int16		max_ios;
+	int16		my_io_combine_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
 	SMgrRelation smgr;
@@ -437,15 +439,21 @@ read_stream_begin_relation(int flags,
 		IsCatalogRelationOid(smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
-		 * Avoid circularity while trying to look up tablespace settings or
-		 * before spccache.c is ready.
+		 * Avoid circularity while trying to look up tablespace catalog itself
+		 * or before spccache.c is ready: just use the plain GUC values in
+		 * those cases.
 		 */
 		max_ios = effective_io_concurrency;
+		my_io_combine_limit = io_combine_limit;
 	}
-	else if (flags & READ_STREAM_MAINTENANCE)
-		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
 	else
-		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	{
+		if (flags & READ_STREAM_MAINTENANCE)
+			max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+		else
+			max_ios = get_tablespace_io_concurrency(tablespace_id);
+		my_io_combine_limit = get_tablespace_io_combine_limit(tablespace_id);
+	}
 	max_ios = Min(max_ios, PG_INT16_MAX);
 
 	/*
@@ -456,9 +464,9 @@ read_stream_begin_relation(int flags,
 	 * overflow (even though that's not possible with the current GUC range
 	 * limits), allowing also for the spare entry and the overflow space.
 	 */
-	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Max(max_ios * 4, my_io_combine_limit);
 	max_pinned_buffers = Min(max_pinned_buffers,
-							 PG_INT16_MAX - io_combine_limit - 1);
+							 PG_INT16_MAX - my_io_combine_limit - 1);
 
 	/* Don't allow this backend to pin more than its share of buffers. */
 	if (SmgrIsTemp(smgr))
@@ -484,14 +492,14 @@ read_stream_begin_relation(int flags,
 	 * io_combine_limit - 1 elements.
 	 */
 	size = offsetof(ReadStream, buffers);
-	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(Buffer) * (queue_size + my_io_combine_limit - 1);
 	size += sizeof(InProgressIO) * Max(1, max_ios);
 	size += per_buffer_data_size * queue_size;
 	size += MAXIMUM_ALIGNOF * 2;
 	stream = (ReadStream *) palloc(size);
 	memset(stream, 0, offsetof(ReadStream, buffers));
 	stream->ios = (InProgressIO *)
-		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+		MAXALIGN(&stream->buffers[queue_size + my_io_combine_limit - 1]);
 	if (per_buffer_data_size > 0)
 		stream->per_buffer_data = (void *)
 			MAXALIGN(&stream->ios[Max(1, max_ios)]);
@@ -519,6 +527,7 @@ read_stream_begin_relation(int flags,
 		max_ios = 1;
 
 	stream->max_ios = max_ios;
+	stream->io_combine_limit = my_io_combine_limit;
 	stream->per_buffer_data_size = per_buffer_data_size;
 	stream->max_pinned_buffers = max_pinned_buffers;
 	stream->queue_size = queue_size;
@@ -531,7 +540,7 @@ read_stream_begin_relation(int flags,
 	 * doing full io_combine_limit sized reads (behavior B).
 	 */
 	if (flags & READ_STREAM_FULL)
-		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+		stream->distance = Min(max_pinned_buffers, my_io_combine_limit);
 	else
 		stream->distance = 1;
 
@@ -713,14 +722,14 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		else
 		{
 			/* No advice; move towards io_combine_limit (behavior B). */
-			if (stream->distance > io_combine_limit)
+			if (stream->distance > stream->io_combine_limit)
 			{
 				stream->distance--;
 			}
 			else
 			{
 				distance = stream->distance * 2;
-				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->io_combine_limit);
 				distance = Min(distance, stream->max_pinned_buffers);
 				stream->distance = distance;
 			}
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index ec63cdc8e52..97664824a4d 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -235,3 +235,17 @@ get_tablespace_maintenance_io_concurrency(Oid spcid)
 	else
 		return spc->opts->maintenance_io_concurrency;
 }
+
+/*
+ * get_tablespace_io_combine_limit
+ */
+int
+get_tablespace_io_combine_limit(Oid spcid)
+{
+	TableSpaceCacheEntry *spc = get_tablespace(spcid);
+
+	if (!spc->opts || spc->opts->io_combine_limit < 0)
+		return io_combine_limit;
+	else
+		return spc->opts->io_combine_limit;
+}
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 82eb3955abf..e2115e01ad4 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2633,7 +2633,8 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER TABLESPACE <foo> SET|RESET ( */
 	else if (Matches("ALTER", "TABLESPACE", MatchAny, "SET|RESET", "("))
 		COMPLETE_WITH("seq_page_cost", "random_page_cost",
-					  "effective_io_concurrency", "maintenance_io_concurrency");
+					  "effective_io_concurrency", "maintenance_io_concurrency",
+					  "io_combine_limit");
 
 	/* ALTER TEXT SEARCH */
 	else if (Matches("ALTER", "TEXT", "SEARCH"))
diff --git a/src/include/commands/tablespace.h b/src/include/commands/tablespace.h
index b6cec632db9..c187283a6dd 100644
--- a/src/include/commands/tablespace.h
+++ b/src/include/commands/tablespace.h
@@ -43,6 +43,7 @@ typedef struct TableSpaceOpts
 	float8		seq_page_cost;
 	int			effective_io_concurrency;
 	int			maintenance_io_concurrency;
+	int			io_combine_limit;
 } TableSpaceOpts;
 
 extern Oid	CreateTableSpace(CreateTableSpaceStmt *stmt);
diff --git a/src/include/utils/spccache.h b/src/include/utils/spccache.h
index 11cfa719955..f8a19764a8f 100644
--- a/src/include/utils/spccache.h
+++ b/src/include/utils/spccache.h
@@ -17,5 +17,6 @@ extern void get_tablespace_page_costs(Oid spcid, float8 *spc_random_page_cost,
 									  float8 *spc_seq_page_cost);
 extern int	get_tablespace_io_concurrency(Oid spcid);
 extern int	get_tablespace_maintenance_io_concurrency(Oid spcid);
+extern int	get_tablespace_io_combine_limit(Oid spcid);
 
 #endif							/* SPCCACHE_H */
-- 
2.44.0

v16-0002-Add-effective_io_readahead_window-setting.patchtext/x-patch; charset=US-ASCII; name=v16-0002-Add-effective_io_readahead_window-setting.patchDownload

From 166a81b4d4639ddf5ee89c6f97634a453988e2bb Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 31 Mar 2024 05:48:20 +1300
Subject: [PATCH v16 2/4] Add effective_io_readahead_window setting.

In a few places we estimate whether the kernel will consider a pattern
of reads to be sequential, for readahead purposes.  If we think so, we
don't issue POSIX_FADV_WILLNEED advice, because we know that it's better
to get out of the kernel's way.

Linux uses a sliding window to detect sequential access, defaulting to
128kB (see blockdev --report).  This setting allows the user to tell
PostgreSQL that size, so we can try to model kernel behavior more
accurately.  Several other known operating systems seem to use a strict
next-block-only detection, so for other systems we just use 0 for now
(it must be exactly sequential), until we have specific information
about other OSes.  It's not a very critical question, because most other
systems either lack or ignore POSIX_FADV_WILLNEED anyway.

XXX Is this a good idea?
XXX This doesn't support units like '128kB' in ALTER TABLE, to be
like the GUC.

Suggested-by: Tomas Vondra <tomas.vondra@enterprisedb.com
Discussion: https://postgr.es/m/f603ac51-a7ff-496a-99c1-76673635692e%40enterprisedb.com
---
 doc/src/sgml/config.sgml                      | 22 +++++++++++++++++++
 doc/src/sgml/ref/alter_tablespace.sgml        |  6 +++--
 src/backend/access/common/reloptions.c        | 10 +++++++++
 src/backend/storage/aio/read_stream.c         | 14 ++++++++++--
 src/backend/storage/buffer/bufmgr.c           |  7 ++++++
 src/backend/utils/cache/spccache.c            | 14 ++++++++++++
 src/backend/utils/misc/guc_tables.c           | 14 ++++++++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/bin/psql/tab-complete.c                   |  2 +-
 src/include/commands/tablespace.h             |  1 +
 src/include/storage/bufmgr.h                  |  7 ++++++
 src/include/utils/spccache.h                  |  1 +
 12 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 624518e0b01..fcfe38a823a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2722,6 +2722,28 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-effective-io-readahead-window" xreflabel="effective_io_readahead_window">
+       <term><varname>effective_io_readahead_window</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>effective_io_readahead_window</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the size of the window that <productname>PostgreSQL</productname>
+         expects the OS to use to detect sequential reads.  This is used to
+         make decisions about when to issue hints to the kernel on future
+         random reads.  Does not apply when direct I/O is used.
+        </para>
+        <para>
+         The default is 128kB on Linux, and 0 otherwise.  This value can
+         be overridden for tables in a particular tablespace by setting the
+         tablespace parameter of the same name (see
+         <xref linkend="sql-altertablespace"/>).
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index faf0c6e7fbc..25dbbea66ac 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -85,7 +85,8 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       A tablespace parameter to be set or reset.  Currently, the only
       available parameters are <varname>seq_page_cost</varname>,
       <varname>random_page_cost</varname>, <varname>effective_io_concurrency</varname>,
-      <varname>maintenance_io_concurrency</varname> and <varname>io_combine_limit</varname>.
+      <varname>maintenance_io_concurrency</varname>,
+      <varname>effective_io_readahead_window</varname> and <varname>io_combine_limit</varname>.
       Setting these values for a particular tablespace will override the
       planner's usual estimate of the cost of reading pages from tables in
       that tablespace, and the executor's prefetching and I/O sizing behavior, as established
@@ -94,7 +95,8 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
       <xref linkend="guc-random-page-cost"/>,
       <xref linkend="guc-effective-io-concurrency"/>,
       <xref linkend="guc-maintenance-io-concurrency"/>),
-      <xref linkend="guc-io-combine-limit"/>).  This may be useful if
+      <xref linkend="guc-io-combine-limit"/>),
+      <xref linkend="guc-effective-io-readahead-window"/>).  This may be useful if
       one tablespace is located on a disk which is faster or slower than the
       remainder of the I/O subsystem.
      </para>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 1e1c611fab2..5a5bc7c5a2e 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -380,6 +380,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 1, MAX_IO_COMBINE_LIMIT
 	},
+	{
+		{
+			"effective_io_readahead_window",
+			"Size of the window used by the OS to detect sequential buffered access.",
+			RELOPT_KIND_TABLESPACE,
+			ShareUpdateExclusiveLock
+		},
+		-1, 1, PG_INT16_MAX,
+	},
 	{
 		{
 			"parallel_workers",
@@ -2100,6 +2109,7 @@ tablespace_reloptions(Datum reloptions, bool validate)
 		{"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)},
 		{"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)},
 		{"io_combine_limit", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, io_combine_limit)},
+		{"effective_io_readahead_window", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_readahead_window)},
 	};
 
 	return (bytea *) build_reloptions(reloptions, validate,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 907c80e6bf9..84a5b714b11 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -115,6 +115,7 @@ struct ReadStream
 	int16		pinned_buffers;
 	int16		distance;
 	int16		io_combine_limit;
+	int16		effective_io_readahead_window;
 	bool		advice_enabled;
 
 	/*
@@ -256,11 +257,15 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/*
 	 * If advice hasn't been suppressed, this system supports it, and this
-	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 * isn't sequential according to the effective_io_readahead_window
+	 * setting, (which should ideally tell us how the OS detects sequential
+	 * buffered access), then we'll issue advice.
 	 */
 	if (!suppress_advice &&
 		stream->advice_enabled &&
-		stream->pending_read_blocknum != stream->seq_blocknum)
+		!(stream->pending_read_blocknum >= stream->seq_blocknum &&
+		  stream->pending_read_blocknum <= (stream->seq_blocknum +
+											stream->effective_io_readahead_window)))
 		flags = READ_BUFFERS_ISSUE_ADVICE;
 	else
 		flags = 0;
@@ -422,6 +427,7 @@ read_stream_begin_relation(int flags,
 	int16		max_ios;
 	int16		my_io_combine_limit;
 	uint32		max_pinned_buffers;
+	int16		my_effective_io_readahead_window;
 	Oid			tablespace_id;
 	SMgrRelation smgr;
 
@@ -445,6 +451,7 @@ read_stream_begin_relation(int flags,
 		 */
 		max_ios = effective_io_concurrency;
 		my_io_combine_limit = io_combine_limit;
+		my_effective_io_readahead_window = effective_io_readahead_window;
 	}
 	else
 	{
@@ -453,6 +460,8 @@ read_stream_begin_relation(int flags,
 		else
 			max_ios = get_tablespace_io_concurrency(tablespace_id);
 		my_io_combine_limit = get_tablespace_io_combine_limit(tablespace_id);
+		my_effective_io_readahead_window =
+			get_tablespace_io_readahead_window(tablespace_id);
 	}
 	max_ios = Min(max_ios, PG_INT16_MAX);
 
@@ -531,6 +540,7 @@ read_stream_begin_relation(int flags,
 	stream->per_buffer_data_size = per_buffer_data_size;
 	stream->max_pinned_buffers = max_pinned_buffers;
 	stream->queue_size = queue_size;
+	stream->effective_io_readahead_window = my_effective_io_readahead_window;
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 929eb8f175f..502e93fd7af 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -163,6 +163,13 @@ int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
  */
 int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
 
+/*
+ * In order to detect sequential access in the same way as the OS would, we
+ * need to know the size of the window it uses.  Overridden by the tablespace
+ * setting of the same name.
+ */
+int			effective_io_readahead_window = DEFAULT_EFFECTIVE_IO_READAHEAD_WINDOW;
+
 /*
  * GUC variables about triggering kernel writeback for buffers written; OS
  * dependent defaults are set via the GUC mechanism.
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index 97664824a4d..b29d7abda67 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -249,3 +249,17 @@ get_tablespace_io_combine_limit(Oid spcid)
 	else
 		return spc->opts->io_combine_limit;
 }
+
+/*
+ * get_tablespace_io_readahead_window
+ */
+int
+get_tablespace_io_readahead_window(Oid spcid)
+{
+	TableSpaceCacheEntry *spc = get_tablespace(spcid);
+
+	if (!spc->opts || spc->opts->effective_io_readahead_window < 0)
+		return effective_io_readahead_window;
+	else
+		return spc->opts->effective_io_readahead_window;
+}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c12784cbec8..ab5e5f50df9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3143,6 +3143,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"effective_io_readahead_window",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Size of the window the OS uses to detect sequential buffered file access."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&effective_io_readahead_window,
+		DEFAULT_EFFECTIVE_IO_READAHEAD_WINDOW,
+		0, PG_INT16_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index baecde28410..41598104b32 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -204,6 +204,7 @@
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
 #io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+#effective_io_readahead_window = 16	# 1-32767; expected system readahead window
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index e2115e01ad4..2af489ec447 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2634,7 +2634,7 @@ psql_completion(const char *text, int start, int end)
 	else if (Matches("ALTER", "TABLESPACE", MatchAny, "SET|RESET", "("))
 		COMPLETE_WITH("seq_page_cost", "random_page_cost",
 					  "effective_io_concurrency", "maintenance_io_concurrency",
-					  "io_combine_limit");
+					  "io_combine_limit", "effective_io_readahead_window");
 
 	/* ALTER TEXT SEARCH */
 	else if (Matches("ALTER", "TEXT", "SEARCH"))
diff --git a/src/include/commands/tablespace.h b/src/include/commands/tablespace.h
index c187283a6dd..94cd1a29985 100644
--- a/src/include/commands/tablespace.h
+++ b/src/include/commands/tablespace.h
@@ -44,6 +44,7 @@ typedef struct TableSpaceOpts
 	int			effective_io_concurrency;
 	int			maintenance_io_concurrency;
 	int			io_combine_limit;
+	int			effective_io_readahead_window;
 } TableSpaceOpts;
 
 extern Oid	CreateTableSpace(CreateTableSpaceStmt *stmt);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f380f9d9a6c..aa1ef8bf797 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -173,6 +173,13 @@ extern PGDLLIMPORT int maintenance_io_concurrency;
 #define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
 extern PGDLLIMPORT int io_combine_limit;
 
+#ifdef __linux__
+#define DEFAULT_EFFECTIVE_IO_READAHEAD_WINDOW ((128 * 1024) / BLCKSZ)
+#else
+#define DEFAULT_EFFECTIVE_IO_READAHEAD_WINDOW 0
+#endif
+extern PGDLLIMPORT int effective_io_readahead_window;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
diff --git a/src/include/utils/spccache.h b/src/include/utils/spccache.h
index f8a19764a8f..a9d163400f7 100644
--- a/src/include/utils/spccache.h
+++ b/src/include/utils/spccache.h
@@ -18,5 +18,6 @@ extern void get_tablespace_page_costs(Oid spcid, float8 *spc_random_page_cost,
 extern int	get_tablespace_io_concurrency(Oid spcid);
 extern int	get_tablespace_maintenance_io_concurrency(Oid spcid);
 extern int	get_tablespace_io_combine_limit(Oid spcid);
+extern int	get_tablespace_io_readahead_window(Oid spcid);
 
 #endif							/* SPCCACHE_H */
-- 
2.44.0

v16-0003-Increase-PG_IOV_MAX-for-bigger-io_combine_limit.patchtext/x-patch; charset=US-ASCII; name=v16-0003-Increase-PG_IOV_MAX-for-bigger-io_combine_limit.patchDownload

From 7f0d9adfeec4db374970996c53fd3ccdd0ee857a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 31 Mar 2024 17:24:44 +1300
Subject: [PATCH v16 3/4] Increase PG_IOV_MAX for bigger io_combine_limit.

PG_IOV_MAX, our clamped version of IOV_MAX, constrains io_combine_limit.
Change the clamp from 32 up to 128 vectors so that we can set
io_combine_limit up to 1MB instead of 256kB (assuming BLCKSZ == 8kB).
These numbers are fairly arbitrary and it's not at all clear that it's a
good idea to run with large numbers, but we might as well at least allow
experimentation.

This will also affect the size of the writes when initializing WAL
segments, which will change from 256kB to 1MB (assuming BLCKSZ == 8kB).
---
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 src/include/port/pg_iovec.h                   | 7 +++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 41598104b32..26612c938ac 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,7 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
-#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
+#io_combine_limit = 128kB		# usually 1-128 blocks (depends on OS)
 #effective_io_readahead_window = 16	# 1-32767; expected system readahead window
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
index 7255c1bd911..5055ede2c8e 100644
--- a/src/include/port/pg_iovec.h
+++ b/src/include/port/pg_iovec.h
@@ -33,8 +33,11 @@ struct iovec
 
 #endif
 
-/* Define a reasonable maximum that is safe to use on the stack. */
-#define PG_IOV_MAX Min(IOV_MAX, 32)
+/*
+ * Define a reasonable maximum that is safe to use on the stack but also allows
+ * io_combine_limit to reach large sizes.
+ */
+#define PG_IOV_MAX Min(IOV_MAX, 128)
 
 /*
  * Like preadv(), but with a prefix to remind us of a side-effect: on Windows
-- 
2.44.0

#57

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#56)

Re: Streaming I/O, vectored I/O (WIP)

On Tue, Apr 2, 2024 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Here are the remaining patches discussed in this thread. They give
tablespace-specific io_combine_limit, effective_io_readahead_window
(is this useful?), and up-to-1MB io_combine_limit (is this useful?).
I think the first two would probably require teaching reloption.c how
to use guc.c's parse_int() and unit flags, but I won't have time to
look at that for this release so I'll just leave these here.

On the subject of guc.c, this is a terrible error message... did I do
something wrong?

postgres=# set io_combine_limit = '42MB';
ERROR: 5376 8kB is outside the valid range for parameter
"io_combine_limit" (1 .. 32)

Well, GUC_UNIT_BLOCKS interpolates the block limit into the error
message string (get_config_unit_name()). But, I can't imagine this
error message is clear for any of the GUCs using GUC_UNIT_BLOCKS. I
would think some combination of the two would be helpful, like "43008
kB (5376 blocks) is outside of the valid range for parameter". The
user can check what their block size is. I don't think we need to
interpolate and print the block size in the error message.

On another note, since io_combine_limit, when specified in size,
rounds up to the nearest multiple of blocksize, it might be worth
mentioning this in the io_combine_limit docs at some point. I checked
docs for another GUC_UNIT_BLOCKS guc, backend_flush_after, and it
alludes to this.

- Melanie

#58

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Thomas Munro (#55)

1 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote:

I had been planning to commit v14 this morning but got cold feet with
the BMR-based interface. Heikki didn't like it much, and in the end,
neither did I. I have now removed it, and it seems much better. No
other significant changes, just parameter types and inlining details.
For example:

* read_stream_begin_relation() now takes a Relation, likes its name says
* StartReadBuffers()'s operation takes smgr and optional rel
* ReadBuffer_common() takes smgr and optional rel

Read stream objects can be created only using Relations now. There
could be read stream users which do not have a Relation but
SMgrRelations. So, I created another constructor for the read streams
which use SMgrRelations instead of Relations. Related patch is
attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

Add-a-way-to-create-read-stream-object-by-using-SMgr.patchtext/x-patch; charset=US-ASCII; name=Add-a-way-to-create-read-stream-object-by-using-SMgr.patchDownload

From 38b57ec7e54a54a0c7b0117a0ecaaf68c643e1b0 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Sun, 7 Apr 2024 20:16:00 +0300
Subject: [PATCH] Add a way to create read stream object by using SMgrRelation

Currently read stream object can be created only by using Relation,
there is no way to create it by using SMgrRelation.

To achieve that, read_stream_begin_impl() function is created and
contents of the read_stream_begin_relation() is moved into this
function.

Both read_stream_begin_relation() and read_stream_begin_sgmr_relation()
calls read_stream_begin_impl() by Relation and SMgrRelation
respectively.
---
 src/include/storage/read_stream.h     |  8 +++
 src/backend/storage/aio/read_stream.c | 74 ++++++++++++++++++++++-----
 2 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index fae09d2b4cc..601a7fcf92b 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -15,6 +15,7 @@
 #define READ_STREAM_H
 
 #include "storage/bufmgr.h"
+#include "storage/smgr.h"
 
 /* Default tuning, reasonable for many users. */
 #define READ_STREAM_DEFAULT 0x00
@@ -56,6 +57,13 @@ extern ReadStream *read_stream_begin_relation(int flags,
 											  ReadStreamBlockNumberCB callback,
 											  void *callback_private_data,
 											  size_t per_buffer_data_size);
+extern ReadStream *read_stream_begin_sgmr_relation(int flags,
+												   BufferAccessStrategy strategy,
+												   SMgrRelation smgr,
+												   ForkNumber forknum,
+												   ReadStreamBlockNumberCB callback,
+												   void *callback_private_data,
+												   size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..a9a3b0de6c9 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -406,14 +406,15 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
  * write extra data for each block into the space provided to it.  It will
  * also receive callback_private_data for its own purposes.
  */
-ReadStream *
-read_stream_begin_relation(int flags,
-						   BufferAccessStrategy strategy,
-						   Relation rel,
-						   ForkNumber forknum,
-						   ReadStreamBlockNumberCB callback,
-						   void *callback_private_data,
-						   size_t per_buffer_data_size)
+static ReadStream *
+read_stream_begin_impl(int flags,
+					   BufferAccessStrategy strategy,
+					   Relation rel,
+					   SMgrRelation smgr,
+					   ForkNumber forknum,
+					   ReadStreamBlockNumberCB callback,
+					   void *callback_private_data,
+					   size_t per_buffer_data_size)
 {
 	ReadStream *stream;
 	size_t		size;
@@ -422,9 +423,6 @@ read_stream_begin_relation(int flags,
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
-	SMgrRelation smgr;
-
-	smgr = RelationGetSmgr(rel);
 
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
@@ -434,7 +432,7 @@ read_stream_begin_relation(int flags,
 	 */
 	tablespace_id = smgr->smgr_rlocator.locator.spcOid;
 	if (!OidIsValid(MyDatabaseId) ||
-		IsCatalogRelation(rel) ||
+		(rel && IsCatalogRelation(rel)) ||
 		IsCatalogRelationOid(smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
@@ -548,7 +546,7 @@ read_stream_begin_relation(int flags,
 	for (int i = 0; i < max_ios; ++i)
 	{
 		stream->ios[i].op.rel = rel;
-		stream->ios[i].op.smgr = RelationGetSmgr(rel);
+		stream->ios[i].op.smgr = smgr;
 		stream->ios[i].op.smgr_persistence = 0;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
@@ -557,6 +555,56 @@ read_stream_begin_relation(int flags,
 	return stream;
 }
 
+/*
+ * Create a new read stream for reading a relation.
+ * See read_stream_begin_impl() for the detailed explanation.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   Relation rel,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	SMgrRelation smgr;
+
+	smgr = RelationGetSmgr(rel);
+
+	return read_stream_begin_impl(flags,
+								  strategy,
+								  rel,
+								  smgr,
+								  forknum,
+								  callback,
+								  callback_private_data,
+								  per_buffer_data_size);
+}
+
+/*
+ * Create a new read stream for reading a SMgr relation.
+ * See read_stream_begin_impl() for the detailed explanation.
+ */
+ReadStream *
+read_stream_begin_sgmr_relation(int flags,
+								BufferAccessStrategy strategy,
+								SMgrRelation smgr,
+								ForkNumber forknum,
+								ReadStreamBlockNumberCB callback,
+								void *callback_private_data,
+								size_t per_buffer_data_size)
+{
+	return read_stream_begin_impl(flags,
+								  strategy,
+								  NULL,
+								  smgr,
+								  forknum,
+								  callback,
+								  callback_private_data,
+								  per_buffer_data_size);
+}
+
 /*
  * Pull one pinned buffer out of a stream.  Each call returns successive
  * blocks in the order specified by the callback.  If per_buffer_data_size was
-- 
2.43.0

#59

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#58)

Re: Streaming I/O, vectored I/O (WIP)

On Sun, Apr 7, 2024 at 1:33 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote:

I had been planning to commit v14 this morning but got cold feet with
the BMR-based interface. Heikki didn't like it much, and in the end,
neither did I. I have now removed it, and it seems much better. No
other significant changes, just parameter types and inlining details.
For example:

* read_stream_begin_relation() now takes a Relation, likes its name says
* StartReadBuffers()'s operation takes smgr and optional rel
* ReadBuffer_common() takes smgr and optional rel

Read stream objects can be created only using Relations now. There
could be read stream users which do not have a Relation but
SMgrRelations. So, I created another constructor for the read streams
which use SMgrRelations instead of Relations. Related patch is
attached.

This patch LGTM

- Melanie

#60

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#58)

2 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote:

I had been planning to commit v14 this morning but got cold feet with
the BMR-based interface. Heikki didn't like it much, and in the end,
neither did I. I have now removed it, and it seems much better. No
other significant changes, just parameter types and inlining details.
For example:

* read_stream_begin_relation() now takes a Relation, likes its name says
* StartReadBuffers()'s operation takes smgr and optional rel
* ReadBuffer_common() takes smgr and optional rel

Read stream objects can be created only using Relations now. There
could be read stream users which do not have a Relation but
SMgrRelations. So, I created another constructor for the read streams
which use SMgrRelations instead of Relations. Related patch is
attached.

After sending this, I realized that I forgot to add persistence value
to the new constructor. While working on it I also realized that
current code sets persistence in PinBufferForBlock() function and this
function is called for each block, which can be costly. So, I moved
setting persistence to the out of PinBufferForBlock() function.

Setting persistence outside of the PinBufferForBlock() function (0001)
and creating the new constructor that uses SMgrRelations (0002) are
attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

0001-Refactor-PinBufferForBlock-to-remove-if-checks-about.patchtext/x-patch; charset=US-ASCII; name=0001-Refactor-PinBufferForBlock-to-remove-if-checks-about.patchDownload

From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++++++++++----------------
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 									   smgr->smgr_rlocator.locator.relNumber,
 									   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 							 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1195,7 +1182,15 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		flags = 0;
 	operation.smgr = smgr;
 	operation.rel = rel;
-	operation.smgr_persistence = smgr_persistence;
+
+	if (rel)
+		persistence = rel->rd_rel->relpersistence;
+	else if (smgr_persistence == 0)
+		persistence = RELPERSISTENCE_PERMANENT;
+	else
+		persistence = smgr_persistence;
+	operation.smgr_persistence = persistence;
+
 	operation.forknum = forkNum;
 	operation.strategy = strategy;
 	if (StartReadBuffer(&operation,
-- 
2.43.0

0002-Add-a-way-to-create-read-stream-object-by-using-SMgr.patchtext/x-patch; charset=US-ASCII; name=0002-Add-a-way-to-create-read-stream-object-by-using-SMgr.patchDownload

From 08783eddeb913839c919999264fb7a61da9cda5b Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Sun, 7 Apr 2024 23:32:48 +0300
Subject: [PATCH 2/2] Add a way to create read stream object by using
 SMgrRelation

Currently read stream object can be created only by using Relation,
there is no way to create it by using SMgrRelation.

To achieve that, read_stream_begin_impl() function is created and
contents of the read_stream_begin_relation() is moved into this
function.

Both read_stream_begin_relation() and read_stream_begin_sgmr_relation()
calls read_stream_begin_impl() by Relation and SMgrRelation
respectively.
---
 src/include/storage/read_stream.h     |  9 +++
 src/backend/storage/aio/read_stream.c | 83 ++++++++++++++++++++++-----
 2 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index fae09d2b4cc..7c47cc90b4f 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -15,6 +15,7 @@
 #define READ_STREAM_H
 
 #include "storage/bufmgr.h"
+#include "storage/smgr.h"
 
 /* Default tuning, reasonable for many users. */
 #define READ_STREAM_DEFAULT 0x00
@@ -56,6 +57,14 @@ extern ReadStream *read_stream_begin_relation(int flags,
 											  ReadStreamBlockNumberCB callback,
 											  void *callback_private_data,
 											  size_t per_buffer_data_size);
+extern ReadStream *read_stream_begin_sgmr_relation(int flags,
+												   BufferAccessStrategy strategy,
+												   SMgrRelation smgr,
+												   char smgr_persistence,
+												   ForkNumber forknum,
+												   ReadStreamBlockNumberCB callback,
+												   void *callback_private_data,
+												   size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index d155dde5ce3..cc90f71ab2e 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -406,14 +406,16 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
  * write extra data for each block into the space provided to it.  It will
  * also receive callback_private_data for its own purposes.
  */
-ReadStream *
-read_stream_begin_relation(int flags,
-						   BufferAccessStrategy strategy,
-						   Relation rel,
-						   ForkNumber forknum,
-						   ReadStreamBlockNumberCB callback,
-						   void *callback_private_data,
-						   size_t per_buffer_data_size)
+static ReadStream *
+read_stream_begin_impl(int flags,
+					   BufferAccessStrategy strategy,
+					   Relation rel,
+					   SMgrRelation smgr,
+					   char smgr_persistence,
+					   ForkNumber forknum,
+					   ReadStreamBlockNumberCB callback,
+					   void *callback_private_data,
+					   size_t per_buffer_data_size)
 {
 	ReadStream *stream;
 	size_t		size;
@@ -422,9 +424,6 @@ read_stream_begin_relation(int flags,
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
-	SMgrRelation smgr;
-
-	smgr = RelationGetSmgr(rel);
 
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
@@ -434,7 +433,7 @@ read_stream_begin_relation(int flags,
 	 */
 	tablespace_id = smgr->smgr_rlocator.locator.spcOid;
 	if (!OidIsValid(MyDatabaseId) ||
-		IsCatalogRelation(rel) ||
+		(rel && IsCatalogRelation(rel)) ||
 		IsCatalogRelationOid(smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
@@ -548,8 +547,8 @@ read_stream_begin_relation(int flags,
 	for (int i = 0; i < max_ios; ++i)
 	{
 		stream->ios[i].op.rel = rel;
-		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
+		stream->ios[i].op.smgr = smgr;
+		stream->ios[i].op.smgr_persistence = smgr_persistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
@@ -557,6 +556,62 @@ read_stream_begin_relation(int flags,
 	return stream;
 }
 
+/*
+ * Create a new read stream for reading a relation.
+ * See read_stream_begin_impl() for the detailed explanation.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   Relation rel,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	return read_stream_begin_impl(flags,
+								  strategy,
+								  rel,
+								  RelationGetSmgr(rel),
+								  rel->rd_rel->relpersistence,
+								  forknum,
+								  callback,
+								  callback_private_data,
+								  per_buffer_data_size);
+}
+
+/*
+ * Create a new read stream for reading a SMgr relation.
+ * See read_stream_begin_impl() for the detailed explanation.
+ */
+ReadStream *
+read_stream_begin_sgmr_relation(int flags,
+								BufferAccessStrategy strategy,
+								SMgrRelation smgr,
+								char smgr_persistence,
+								ForkNumber forknum,
+								ReadStreamBlockNumberCB callback,
+								void *callback_private_data,
+								size_t per_buffer_data_size)
+{
+	char		persistence;
+
+	if (smgr_persistence == 0)
+		persistence = RELPERSISTENCE_PERMANENT;
+	else
+		persistence = smgr_persistence;
+
+	return read_stream_begin_impl(flags,
+								  strategy,
+								  NULL,
+								  smgr,
+								  persistence,
+								  forknum,
+								  callback,
+								  callback_private_data,
+								  per_buffer_data_size);
+}
+
 /*
  * Pull one pinned buffer out of a stream.  Each call returns successive
  * blocks in the order specified by the callback.  If per_buffer_data_size was
-- 
2.43.0

#61

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#60)

2 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

On Mon, 8 Apr 2024 at 00:01, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Sun, 7 Apr 2024 at 20:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Tue, 2 Apr 2024 at 11:40, Thomas Munro <thomas.munro@gmail.com> wrote:

I had been planning to commit v14 this morning but got cold feet with
the BMR-based interface. Heikki didn't like it much, and in the end,
neither did I. I have now removed it, and it seems much better. No
other significant changes, just parameter types and inlining details.
For example:

* read_stream_begin_relation() now takes a Relation, likes its name says
* StartReadBuffers()'s operation takes smgr and optional rel
* ReadBuffer_common() takes smgr and optional rel

Read stream objects can be created only using Relations now. There
could be read stream users which do not have a Relation but
SMgrRelations. So, I created another constructor for the read streams
which use SMgrRelations instead of Relations. Related patch is
attached.

After sending this, I realized that I forgot to add persistence value
to the new constructor. While working on it I also realized that
current code sets persistence in PinBufferForBlock() function and this
function is called for each block, which can be costly. So, I moved
setting persistence to the out of PinBufferForBlock() function.

Setting persistence outside of the PinBufferForBlock() function (0001)
and creating the new constructor that uses SMgrRelations (0002) are
attached.

Melanie noticed there was a 'sgmr -> smgr' typo in 0002. Fixed in attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

0001-Refactor-PinBufferForBlock-to-remove-if-checks-about.patchtext/x-patch; charset=US-ASCII; name=0001-Refactor-PinBufferForBlock-to-remove-if-checks-about.patchDownload

From 04fd860ce8c4c57830930cb362799fd155c92613 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Sun, 7 Apr 2024 22:33:36 +0300
Subject: [PATCH 1/2] Refactor PinBufferForBlock() to remove if checks about
 persistence

There are if checks in PinBufferForBlock() function to set persistence
of the relation and this function is called for the each block in the
relation. Instead of that, set persistence of the relation before
PinBufferForBlock() function.
---
 src/backend/storage/aio/read_stream.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 31 +++++++++++----------------
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..d155dde5ce3 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
 	{
 		stream->ios[i].op.rel = rel;
 		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = 0;
+		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 06e9ffd2b00..b4fcefed78a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1067,24 +1067,10 @@ PinBufferForBlock(Relation rel,
 	BufferDesc *bufHdr;
 	IOContext	io_context;
 	IOObject	io_object;
-	char		persistence;
 
 	Assert(blockNum != P_NEW);
 
-	/*
-	 * If there is no Relation it usually implies recovery and thus permanent,
-	 * but we take an argmument because CreateAndCopyRelationData can reach us
-	 * with only an SMgrRelation for an unlogged relation that we don't want
-	 * to flag with BM_PERMANENT.
-	 */
-	if (rel)
-		persistence = rel->rd_rel->relpersistence;
-	else if (smgr_persistence == 0)
-		persistence = RELPERSISTENCE_PERMANENT;
-	else
-		persistence = smgr_persistence;
-
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
@@ -1101,7 +1087,7 @@ PinBufferForBlock(Relation rel,
 									   smgr->smgr_rlocator.locator.relNumber,
 									   smgr->smgr_rlocator.backend);
 
-	if (persistence == RELPERSISTENCE_TEMP)
+	if (smgr_persistence == RELPERSISTENCE_TEMP)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
@@ -1109,7 +1095,7 @@ PinBufferForBlock(Relation rel,
 	}
 	else
 	{
-		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
+		bufHdr = BufferAlloc(smgr, smgr_persistence, forkNum, blockNum,
 							 strategy, foundPtr, io_context);
 		if (*foundPtr)
 			pgBufferUsage.shared_blks_hit++;
@@ -1157,6 +1143,7 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 	ReadBuffersOperation operation;
 	Buffer		buffer;
 	int			flags;
+	char		persistence;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1195,7 +1182,15 @@ ReadBuffer_common(Relation rel, SMgrRelation smgr, char smgr_persistence,
 		flags = 0;
 	operation.smgr = smgr;
 	operation.rel = rel;
-	operation.smgr_persistence = smgr_persistence;
+
+	if (rel)
+		persistence = rel->rd_rel->relpersistence;
+	else if (smgr_persistence == 0)
+		persistence = RELPERSISTENCE_PERMANENT;
+	else
+		persistence = smgr_persistence;
+	operation.smgr_persistence = persistence;
+
 	operation.forknum = forkNum;
 	operation.strategy = strategy;
 	if (StartReadBuffer(&operation,
-- 
2.43.0

0002-Add-a-way-to-create-read-stream-object-by-using-SMgr.patchtext/x-patch; charset=US-ASCII; name=0002-Add-a-way-to-create-read-stream-object-by-using-SMgr.patchDownload

From edb23556e7fdfbbb5e237704b210f219e143f771 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 8 Apr 2024 00:14:52 +0300
Subject: [PATCH 2/2] Add a way to create read stream object by using
 SMgrRelation

Currently read stream object can be created only by using Relation,
there is no way to create it by using SMgrRelation.

To achieve that, read_stream_begin_impl() function is created and
contents of the read_stream_begin_relation() is moved into this
function.

Both read_stream_begin_relation() and read_stream_begin_smgr_relation()
calls read_stream_begin_impl() by Relation and SMgrRelation
respectively.
---
 src/include/storage/read_stream.h     |  9 +++
 src/backend/storage/aio/read_stream.c | 83 ++++++++++++++++++++++-----
 2 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index fae09d2b4cc..c7d688d8c9c 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -15,6 +15,7 @@
 #define READ_STREAM_H
 
 #include "storage/bufmgr.h"
+#include "storage/smgr.h"
 
 /* Default tuning, reasonable for many users. */
 #define READ_STREAM_DEFAULT 0x00
@@ -56,6 +57,14 @@ extern ReadStream *read_stream_begin_relation(int flags,
 											  ReadStreamBlockNumberCB callback,
 											  void *callback_private_data,
 											  size_t per_buffer_data_size);
+extern ReadStream *read_stream_begin_smgr_relation(int flags,
+												   BufferAccessStrategy strategy,
+												   SMgrRelation smgr,
+												   char smgr_persistence,
+												   ForkNumber forknum,
+												   ReadStreamBlockNumberCB callback,
+												   void *callback_private_data,
+												   size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index d155dde5ce3..8dd100f016c 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -406,14 +406,16 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
  * write extra data for each block into the space provided to it.  It will
  * also receive callback_private_data for its own purposes.
  */
-ReadStream *
-read_stream_begin_relation(int flags,
-						   BufferAccessStrategy strategy,
-						   Relation rel,
-						   ForkNumber forknum,
-						   ReadStreamBlockNumberCB callback,
-						   void *callback_private_data,
-						   size_t per_buffer_data_size)
+static ReadStream *
+read_stream_begin_impl(int flags,
+					   BufferAccessStrategy strategy,
+					   Relation rel,
+					   SMgrRelation smgr,
+					   char smgr_persistence,
+					   ForkNumber forknum,
+					   ReadStreamBlockNumberCB callback,
+					   void *callback_private_data,
+					   size_t per_buffer_data_size)
 {
 	ReadStream *stream;
 	size_t		size;
@@ -422,9 +424,6 @@ read_stream_begin_relation(int flags,
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
-	SMgrRelation smgr;
-
-	smgr = RelationGetSmgr(rel);
 
 	/*
 	 * Decide how many I/Os we will allow to run at the same time.  That
@@ -434,7 +433,7 @@ read_stream_begin_relation(int flags,
 	 */
 	tablespace_id = smgr->smgr_rlocator.locator.spcOid;
 	if (!OidIsValid(MyDatabaseId) ||
-		IsCatalogRelation(rel) ||
+		(rel && IsCatalogRelation(rel)) ||
 		IsCatalogRelationOid(smgr->smgr_rlocator.locator.relNumber))
 	{
 		/*
@@ -548,8 +547,8 @@ read_stream_begin_relation(int flags,
 	for (int i = 0; i < max_ios; ++i)
 	{
 		stream->ios[i].op.rel = rel;
-		stream->ios[i].op.smgr = RelationGetSmgr(rel);
-		stream->ios[i].op.smgr_persistence = rel->rd_rel->relpersistence;
+		stream->ios[i].op.smgr = smgr;
+		stream->ios[i].op.smgr_persistence = smgr_persistence;
 		stream->ios[i].op.forknum = forknum;
 		stream->ios[i].op.strategy = strategy;
 	}
@@ -557,6 +556,62 @@ read_stream_begin_relation(int flags,
 	return stream;
 }
 
+/*
+ * Create a new read stream for reading a relation.
+ * See read_stream_begin_impl() for the detailed explanation.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   Relation rel,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	return read_stream_begin_impl(flags,
+								  strategy,
+								  rel,
+								  RelationGetSmgr(rel),
+								  rel->rd_rel->relpersistence,
+								  forknum,
+								  callback,
+								  callback_private_data,
+								  per_buffer_data_size);
+}
+
+/*
+ * Create a new read stream for reading a SMgr relation.
+ * See read_stream_begin_impl() for the detailed explanation.
+ */
+ReadStream *
+read_stream_begin_smgr_relation(int flags,
+								BufferAccessStrategy strategy,
+								SMgrRelation smgr,
+								char smgr_persistence,
+								ForkNumber forknum,
+								ReadStreamBlockNumberCB callback,
+								void *callback_private_data,
+								size_t per_buffer_data_size)
+{
+	char		persistence;
+
+	if (smgr_persistence == 0)
+		persistence = RELPERSISTENCE_PERMANENT;
+	else
+		persistence = smgr_persistence;
+
+	return read_stream_begin_impl(flags,
+								  strategy,
+								  NULL,
+								  smgr,
+								  persistence,
+								  forknum,
+								  callback,
+								  callback_private_data,
+								  per_buffer_data_size);
+}
+
 /*
  * Pull one pinned buffer out of a stream.  Each call returns successive
  * blocks in the order specified by the callback.  If per_buffer_data_size was
-- 
2.43.0

#62

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#61)

1 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

I've attached a patch with a few typo fixes and what looks like an
incorrect type for max_ios. It's an int16 and I think it needs to be
an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
anything when max_ios is int16.

David

Attachments:

read_stream_fixes.patchtext/plain; charset=US-ASCII; name=read_stream_fixes.patchDownload

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 634cf4f0d1..74b9bae631 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -26,12 +26,12 @@
  *
  * B) I/O is necessary, but fadvise is undesirable because the access is
  * sequential, or impossible because direct I/O is enabled or the system
- * doesn't support advice.  There is no benefit in looking ahead more than
- * io_combine_limit, because in this case only goal is larger read system
+ * doesn't support fadvise.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case the only goal is larger read system
  * calls.  Looking further ahead would pin many buffers and perform
  * speculative work looking ahead for no benefit.
  *
- * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * C) I/O is necessary, it appears random, and this system supports fadvise.
  * We'll look further ahead in order to reach the configured level of I/O
  * concurrency.
  *
@@ -418,7 +418,7 @@ read_stream_begin_relation(int flags,
 	ReadStream *stream;
 	size_t		size;
 	int16		queue_size;
-	int16		max_ios;
+	int			max_ios;
 	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
@@ -447,6 +447,8 @@ read_stream_begin_relation(int flags,
 		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
 	else
 		max_ios = get_tablespace_io_concurrency(tablespace_id);
+
+	/* Cap to INT16_MAX to avoid overflowing below */
 	max_ios = Min(max_ios, PG_INT16_MAX);
 
 	/*

#63

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: David Rowley (#62)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, 24 Apr 2024 at 14:32, David Rowley <dgrowleyml@gmail.com> wrote:

I've attached a patch with a few typo fixes and what looks like an
incorrect type for max_ios. It's an int16 and I think it needs to be
an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
anything when max_ios is int16.

No feedback, so I'll just push this in a few hours unless anyone has anything.

David

#64

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: David Rowley (#63)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, May 1, 2024 at 2:51 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 24 Apr 2024 at 14:32, David Rowley <dgrowleyml@gmail.com> wrote:

I've attached a patch with a few typo fixes and what looks like an
incorrect type for max_ios. It's an int16 and I think it needs to be
an int. Doing "max_ios = Min(max_ios, PG_INT16_MAX);" doesn't do
anything when max_ios is int16.

No feedback, so I'll just push this in a few hours unless anyone has anything.

Patch looks correct, thanks. Please do. (Sorry, running a bit behind
on email ATM... I also have a few more typos around here from an
off-list email from Mr Lakhin, will get to that soon...)

#65

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#5)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Nov 29, 2023 at 1:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Done. I like it, I just feel a bit bad about moving the p*v()
replacement functions around a couple of times already! I figured it
might as well be static inline even if we use the fallback (= Solaris
and Windows).

Just for the record, since I'd said things like the above a few times
while writing about this stuff: Solaris 11.4.69 has gained preadv()
and pwritev(). That's interesting because it means that there will
soon be no liive Unixoid operating systems left without them, and the
fallback code in src/include/port/pg_iovec.h will, in practice, be
only for Windows. I wondered if that might have implications for how
we code or comment stuff like that, but it still seems to make sense
as we have it.

(I don't think Windows can have a real synchronous implementation; the
kernel knows how to do scatter/gather, a feature implemented
specifically for databases, but only in asynchronous ("overlapped") +
direct I/O mode, a difference I don't know how to hide at this level.
In later AIO work we should be able to use it as intended, but not by
pretending to be Unix like this.)

#66

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Heikki Linnakangas (#41)

1 attachment(s)

Re: Streaming I/O, vectored I/O (WIP)

Hi,

It seems that Heikki's 'v9.heikki-0007-Trivial-comment-fixes.patch'
[1]: /messages/by-id/289a1c0e-8444-4009-a8c2-c2d77ced6f07@iki.fi
patch just updates it.

[1]: /messages/by-id/289a1c0e-8444-4009-a8c2-c2d77ced6f07@iki.fi

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

Fix-the-top-comment-at-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=Fix-the-top-comment-at-read_stream.c.patchDownload

From 76efb37b03b295070dc5259049825fe56460383f Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 10 Jul 2024 19:04:51 +0300
Subject: [PATCH] Fix the top comment at read_stream.c

---
 src/backend/storage/aio/read_stream.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 74b9bae6313..238ff448a91 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -51,7 +51,7 @@
  * I/Os that have been started with StartReadBuffers(), and for which
  * WaitReadBuffers() must be called before returning the buffer.
  *
- * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * For example, if the callback returns block numbers 10, 42, 43, 44, 60 in
  * successive calls, then these data structures might appear as follows:
  *
  *                          buffers buf/data       ios
-- 
2.45.2

#67

Bruce Momjian

bruce@momjian.us

over 1 year ago

In reply to: Nazir Bilal Yavuz (#66)

Re: Streaming I/O, vectored I/O (WIP)

On Wed, Jul 10, 2024 at 07:21:59PM +0300, Nazir Bilal Yavuz wrote:

Hi,

It seems that Heikki's 'v9.heikki-0007-Trivial-comment-fixes.patch'
[1] is partially applied, the top comment is not updated. The attached
patch just updates it.

[1] /messages/by-id/289a1c0e-8444-4009-a8c2-c2d77ced6f07@iki.fi

Thanks, patch applied to master.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.